Migrating government information from CD-ROMs: a pilot project

 

 

 

Gretchen Gano

Social Science Data Librarian

Social Science Libraries & Information Services

140 Prospect Street

432-6121

 gretchen.gano@yale.edu

 

Julie Linden

Government Information Librarian

Government Documents & Information Center

38 Mansfield Street

432-3310

julie.linden@yale.edu
Migrating government information from CD-ROMs: a pilot project

Purpose

The purpose of this project is to address and document the major challenges to long-term access to government information on CD-ROMs by migrating a sample set of government CD-ROMs to a more stable server environment.

For more than a decade, CD-ROMs were a popular distribution format for the U.S. Federal Depository Library Program. Yale Library’s Government Documents & Information Center (GDIC) holds more than 3,000 individual CD-ROMs in its federal depository collection—a significant number, even though distribution of this format has decreased significantly in recent years. GDIC also holds approximately 360 CD-ROMs in its United Nations, European Union, Canadian, and Food and Agriculture Organization depository collections. While these CD-ROM collections are much smaller than the U.S. CD-ROM collection, they are similar to the U.S. collection in that the CDs may contain information that is not duplicated in other media or that has not been preserved in a more stable, long-term format than CD-ROMs.

These collections are similar to others described in the Yale Library’s 2004 proposal for an interim “Rescue Repository” in that “The lifecycle management of this content has become a clear necessity as the volume continues to grow. The digital masters for much of this material are in immediate danger of permanent loss through media decay, physical damage, technological obsolescence, or difficulties in archival management” (http://www.library.yale.edu/iac/documents/RescueRepositoryProposal.pdf). Transferring the files from CDs to the Rescue Repository would, in the short term, address the issues of media decay and physical damage; while an important step, such action would not address the issues of long-term viability of the files on the CDs. Government and inter-governmental organization (IGO) CDs contain files in a wide range of formats, proprietary and non-proprietary, some still very much in use, some obsolete or nearly so. Government agencies and IGOs do not necessarily take responsibility for archival management of the data and information they have disseminated on CD-ROM. Further complicating the matter, some government CDs package public domain data or information with commercial software, thus raising intellectual property issues for the library that wishes to migrate or otherwise preserve the CD’s contents. Can the data and information be separated from the software and still be usable?

 

An analysis of these issues and others related to the “CD Legacy Problem” was presented by John Hernandez (Princeton University) and Tom Byrne (University of Kentucky) at the Spring 2004 Depository Library Council Meeting. They concluded that “any solution to the CD legacy problem will be highly labor intensive”; that web access was considered “optimal for access to most formats” (although it did not solve preservation issues); and that an “optimal solution” to the long-term preservation problem “must come from GPO [the Government Printing Office] & agencies with depository involvement.” GPO is working on a CD-ROM data migration strategy and has announced plans to test migration processes on a sample of federal agency CD-ROMs, but has no timeline either for the test or for the longer-term project of migrating all depository CD-ROMs. Even if GPO eventually resolves the problems of legacy CD-ROMs, this proposed pilot project will nevertheless be useful in helping us gather information and build expertise for dealing with migration issues for other Yale Library collections (e.g. Economic Growth Center Collection CD-ROMs; numeric data files in the Social Science Data Archive).

This project will take the Hernandez and Byrne analysis a step further by actually migrating a sample set of CD-ROMs. The migration will serve two purposes:

  1. To make the data and information on the CD-ROMs more easily accessible to faculty and students on the Yale network. The files on the CDs chosen for this project will be transferred to the ssrs server, which also houses the library’s Social Science Data Archive. The ssrs server is administered by Social Science Research Services (a unit of Academic Media & Technology); it is backed up and is accessible to faculty and students on the Yale network.
  2. To analyze and document the costs, processes, and challenges of migrating the files on the CD-ROMs to a more stable server environment and to non-proprietary formats that are recognized as acceptable preservation formats (e.g. migrating Excel to ASCII).

 

The lessons learned from this project can be shared with the depository library community and will potentially provide a basis for a collaborative project to “rescue” government and IGO CD-ROMs.

Expected outcomes of this project:

  1. A clear, detailed articulation of the challenges of providing long-term access to and preservation of information on government and IGO CD-ROMs.
  2. A draft workflow for migrating government information from CD-ROM to a more stable server environment and to non-proprietary file formats.
  3. A cost analysis of the labor and storage requirements.
  4. An initial draft of preservation metadata and digital file integrity requirements for these materials.
  5. An analysis of the project’s scalability and potential for collaboration with other institutions.

Methodology

  1. Determine selection criteria for the CD-ROMs to be migrated for this pilot project. Criteria will encompass a range of file formats, subject matter, and authoring agencies or bodies.
  2. Based on the criteria, select specific CD-ROMs from the U.S. federal, UN, EU, Canadian, and FAO depositories for inclusion in this project.
  3. In consultation with the Metadata Committee, determine what metadata to capture during the migration process. Devise a format for student workers to enter the information.
  4. For each CD-ROM, the librarians on this project will analyze and document:

§         Intellectual property issues (e.g. proprietary software)

§         File formats – viability for preservation; ability to migrate (if necessary) to a more long-term format

§         Location of existing metadata and technical documentation

§         Instructions for student worker(s) to perform migration tasks

  1. The student(s) hired for this project will then:

·        Transfer the files from CD to the ssrs server

·        Migrate specific files from one format to another (e.g. Excel to ASCII)

·        Enter metadata and link to (or load) external documentation

·        Document time spent on tasks

·        Document errors or problems encountered during migration tasks

  1. The librarians on this project will then

·        Perform quality control analysis on the migrated files (e.g. checksums for numeric data). The librarians will explore the utility of using JHOVE as a layer of file identification for the numeric data.

·        Analyze and troubleshoot errors or problems encountered

·        Adjust the workflow as needed, based on testing the migrated files and on communication with student(s).

·        Analyze the data to determine cost and storage requirements

·        Analyze the metadata to understand workflow and cost requirements

·        Enhance Orbis records for the CD-ROMs with links to server locations; catalog numeric data files into StatCat.

·        Develop cost estimates for scaling this process to the entire GDIC CD-ROM collections.

Timeline:

  • Jan-Feb: develop selection criteria; select CD-ROMs to migrate
  • Feb-Mar: analyze each CD-ROM as described above; hire and train student worker
  • Mar-May: student migrates CDs; librarians review work and adjust workflow as needed
  • June-Aug: librarians analyze student data and develop cost and storage requirements
  • Sept-Oct: librarians communicate with selected other institutions to discuss possible collaboration and scalability issues
  • Nov-Dec: write final report

 

Expenses:

Student labor: estimate $12.00/hour

  • 5 hours training
  • 4 hours per CD-ROM migrated (includes documentation requirements) x 15 CDs = 60 hours (over approx 8 weeks, for an average of 7.5 hours/week)
  • Total; 65 hours, or $780

Travel to a peer institution in the region (likely Columbia University) to discuss potential for collaboration on a larger-scale CD-ROM migration project: $220

Server costs: $0. Estimated server storage space: 10 megabytes.

Benefits:

This project will contribute to the Yale University Library in the following ways:

  • Cost estimates and storage requirements data from this pilot study can be used to determine costs and storage requirements for migrating the entire GDIC CD-ROM collection and can inform other digital migration projects.
  • Selection criteria developed can be used to shape a collection that should be deposited into the Rescue Repository for short-term preservation.
  • Information and data from selected CD-ROMs will be made available on the Yale network, thus improving access for Yale faculty and students.
  • Technical challenges and quality control aspects of migrating file formats will be documented and can inform other migration projects in the libraries.
  • OAI-compliant metadata created and external documentation gathered can enhance resource discovery.
  • Explorations of preservation metadata requirements will provide a test case environment as the Metadata Committee works on library-wide preservation metadata standards and practices.

This project will benefit the library and information science community in the following ways:

  • Documentation of cost estimates, storage requirements, workflow, metadata requirements, quality control measures, and technical challenges can be used by other libraries confronting similar migration projects.
  • Detailed articulation of challenges contributes to overall understanding of at-risk digital government information.
  • Dialogue with libraries with similar collections will lay the foundation for a potential collaborative, large-scale project to migrate at-risk collections to more stable long-term environments.
  • Examination and analysis of specific titles can inform collection development recommendations to the Inter-university Consortium of Political and Social Research (ICPSR), a major data archive and source of numeric data for Yale researchers.