DPIP Working Group Minutes

November 14, 2006

 

DLF Forum Report

 

Anurag Acharya talked about Google Scholar's desire to be the one place to do research.  He said the federated search is a dead end and harvesting is more promising.  Google hopes to have agreement with Elsevier within the year to index their journals. 

 

Catalogs: Faceted search is being explored at many places. The USC Libraries' Gandhara Project: single search interface to diverse resources, including multiple OPACs

(see slide #7 for a diagram of the harvesting architecture)

http://www.diglib.org/forums/fall2006/presentations/grappone1106.pdf

 

DLF-Aquifer Asset Actions Experiment with OAI Metadata Harvesting: Demonstrating the Value of Actionable URLs

http://dlib.org/dlib/october06/cole/10cole.html

 

Harvard Course Reserves System (including many advanced features)

http://www.diglib.org/forums/fall2006/presentations/stern1106.pdf

 

DLF Fall Forum 2006: Full Program with links to all Presentations

http://www.diglib.org/forums/fall2006/fall2006program.htm

 

Matthew and Youn attended a pre-DLF class on PREMIS.  Current state of PREMIS is still new: need for best practices and real-world implementation.

 

Cornell DCAPS staff talked about managing digital projects for the future.  For preservation and migration they are using the Ockham registry to manage the information they need.

 

Mass Digitization at Cornell

100,000 volumes to be done at Kirtas near Rochester, NY.  Kirtas has contacted Meg Bellinger to talk about the possibility of doing a similar project at Yale.  Fred talked with Oya Rieger at Cornell.  They have a deal, but they now need to develop the infrastructure store the data.  Cornell estimates 300 TB storage.  Kirtas scanning appears to be higher quality than the Google mass digitization project.  Quality has an impact on preservation.  Mass digitization does not necessarily create a preservation copy.

 

Mass Digitization at Michigan

Michigan talked about the MBooks (digitization with Google) project.  They do some work that we are not clear we need to do.  They provide two links (one to Google and one to the local MBooks interface).  The local Michigan system has a better page-turner, better access to serials and volumes, access to all government documents, and different rights. 

Two stage search: they have a bibliographic record in their catalog you can search and then a fulltext search within the MBooks page-turner.  They have some structural metadata included in their METS records.  They have sequence numbers but are only beginning to include page numbers. 

 

They have a rights management database that governs what fulltext can be seen through the MBooks interface both internationally and on the Michigan campus.  They can arrange to pay copyright for current books they want to give access to.  They are doing 30,000 books a week.

 

Outline of Process at Michigan:

 

Content selection

Retrieval from stacks

All items must have barcodes

Tracking information is stored in LMS

Metadata extracted and made available to Google

Scanning by Google

Updates to OPAC bibl and holdings records (descriptive metadata)

METS object for structural metadata, technical metadata, images and OCR text

 

Page image and metadata repository (based on DLXS)

No separate preservation archive

Download from Google and ingest to repository (TIFF, JPG, UTF-8 OCR, metadata)

Validation through barcode checkdigit, MD5 checksum, JHOVE, etc.

Quality assurance (20-page samples manually reviewed through ACDSee)

Persistent identifier (Handle)

 

Rights database (determine authoritative copyright status, store rights attribute)

GeoIP database (user country of origin derived from IP address)

 

Library catalog (OPAC) serves as discovery interface

 

Pageturner

 –User interface to MBooks

 –Access rights determination

 –Descriptive metadata from OPAC

 –OCR, page image access

 –Full text search

 

Future cleanup

Future enhancements

 

 Michigan / Google Mass Digitization: MBooks and Google Book Search

(see especially slide #26 for a diagram of MBooks system components)

http://www.diglib.org/forums/fall2006/presentations/powell1106.pdf

 

Woodrow Wilson Presidential Library

They will pay for a digitization project of 3000 pages of House diaries.  House was a member of the Wilson cabinet.  This is a possible DPIP project.  Derek scanned a few pages.  Derek and Jennifer have sent this to a vendor to get a quote.  There are handwritten notes, and this won't be a simple OCR project.  There will need to be rekeying.  Once they get some quotes they will negotiate with the WWPL for the money they would pay to complete the project. 

 

Yale Daily News Project

ContentDM is being purchased for the YDN project in order to provide a sophisticated interface to the newspaper content. The license will allow 50,000 images. The initial phase of the project will digitize 13,000 pages (roughly 7 years of publication), and will be used to pursue alumni funding for additional years.