Migrating government information
from CD-ROMs: a pilot project
Gretchen Gano
Social Science Data Librarian
Social Science Libraries & Information Services
140 Prospect
Street
432-6121
gretchen.gano@yale.edu
Julie Linden
Government Information Librarian
Government Documents &
Information Center
38 Mansfield
Street
432-3310
julie.linden@yale.edu
Migrating government information from CD-ROMs: a pilot project
Purpose
The purpose of this project is to address and
document the major challenges to long-term access to government information on
CD-ROMs by migrating a sample set of government CD-ROMs to a more stable server
environment.
For more than a decade, CD-ROMs were a
popular distribution format for the U.S. Federal Depository Library Program.
Yale Library’s Government Documents & Information Center (GDIC) holds more
than 3,000 individual CD-ROMs in its federal depository collection—a
significant number, even though distribution of this format has decreased
significantly in recent years. GDIC also holds approximately 360 CD-ROMs in its
United Nations, European Union, Canadian, and Food and Agriculture Organization
depository collections. While these CD-ROM collections are much smaller than
the U.S. CD-ROM collection, they are similar to the U.S. collection in that the CDs may
contain information that is not duplicated in other media or that has not been
preserved in a more stable, long-term format than CD-ROMs.
These collections are similar to others
described in the Yale Library’s 2004 proposal for an interim “Rescue
Repository” in that “The
lifecycle management of this content has become a clear necessity as the volume
continues to grow. The digital masters for much of this material are in
immediate danger of permanent loss through media decay, physical damage,
technological obsolescence, or difficulties in archival management” (http://www.library.yale.edu/iac/documents/RescueRepositoryProposal.pdf).
Transferring the files from CDs to the Rescue Repository would, in the short
term, address the issues of media decay and physical damage; while an important
step, such action would not address the issues of long-term viability of the files on the CDs. Government
and inter-governmental organization (IGO) CDs contain files in a wide range of
formats, proprietary and non-proprietary, some still very much in use, some
obsolete or nearly so. Government agencies and IGOs do not necessarily take
responsibility for archival management of the data and information they have
disseminated on CD-ROM. Further complicating the matter, some government CDs
package public domain data or information with commercial software, thus
raising intellectual property issues for the library that wishes to migrate or
otherwise preserve the CD’s contents. Can the data and information be separated
from the software and still be usable?
An analysis of these issues
and others related to the “CD Legacy Problem” was presented by John Hernandez (Princeton University)
and Tom Byrne (University
of Kentucky) at the
Spring 2004 Depository Library Council Meeting. They concluded that “any
solution to the CD legacy problem will be highly labor intensive”; that web
access was considered “optimal for access to most formats” (although it did not
solve preservation issues); and that an “optimal solution” to the long-term
preservation problem “must come from GPO [the Government Printing Office] &
agencies with depository involvement.” GPO is working on a CD-ROM data migration
strategy and has announced plans to test migration processes on a sample of
federal agency CD-ROMs, but has no timeline either for the test or for the
longer-term project of migrating all depository CD-ROMs. Even if GPO eventually resolves
the problems of legacy CD-ROMs, this proposed pilot project will nevertheless
be useful in helping us gather information and build expertise for dealing with
migration issues for other Yale Library collections (e.g. Economic Growth
Center Collection CD-ROMs; numeric data files in the Social Science Data
Archive).
This project will take the Hernandez and Byrne analysis a step further
by actually migrating a sample set of CD-ROMs. The migration will serve two
purposes:
- To make the data and information on the CD-ROMs more easily
accessible to faculty and students on the Yale network. The files on the
CDs chosen for this project will be transferred to the ssrs server,
which also houses the library’s Social Science Data Archive. The ssrs
server is administered by Social Science Research Services (a unit of
Academic Media & Technology); it is backed up and is accessible to
faculty and students on the Yale network.
- To analyze and document the costs, processes, and challenges of
migrating the files on the CD-ROMs to a more stable server environment and
to non-proprietary formats that are recognized as acceptable preservation
formats (e.g. migrating Excel to ASCII).
The lessons learned from this project can be
shared with the depository library community and will potentially provide a
basis for a collaborative project to “rescue” government and IGO CD-ROMs.
Expected outcomes of this project:
- A clear, detailed articulation of the challenges of providing
long-term access to and preservation of information on government and IGO
CD-ROMs.
- A draft workflow for migrating government information from CD-ROM
to a more stable server environment and to non-proprietary file formats.
- A cost analysis of the labor and storage requirements.
- An initial draft of preservation metadata and digital file
integrity requirements for these materials.
- An analysis of the project’s scalability and potential for
collaboration with other institutions.
Methodology
- Determine selection criteria for the CD-ROMs to be migrated for
this pilot project. Criteria will encompass a range of file formats,
subject matter, and authoring agencies or bodies.
- Based on the criteria, select specific CD-ROMs from the U.S.
federal, UN, EU, Canadian, and FAO depositories for inclusion in this
project.
- In consultation with the Metadata Committee, determine what
metadata to capture during the migration process. Devise a format for
student workers to enter the information.
- For each CD-ROM, the librarians on this project will analyze and
document:
§
Intellectual
property issues (e.g. proprietary software)
§
File
formats – viability for preservation; ability to migrate (if necessary) to a
more long-term format
§
Location
of existing metadata and technical documentation
§
Instructions
for student worker(s) to perform migration tasks
- The student(s) hired for this project will then:
·
Transfer
the files from CD to the ssrs server
·
Migrate
specific files from one format to another (e.g. Excel to ASCII)
·
Enter
metadata and link to (or load) external documentation
·
Document
time spent on tasks
·
Document
errors or problems encountered during migration tasks
- The librarians on this project will then
·
Perform
quality control analysis on the migrated files (e.g. checksums for numeric
data). The librarians will explore the utility of using JHOVE as a layer of
file identification for the numeric data.
·
Analyze
and troubleshoot errors or problems encountered
·
Adjust
the workflow as needed, based on testing the migrated files and on
communication with student(s).
·
Analyze
the data to determine cost and storage requirements
·
Analyze
the metadata to understand workflow and cost requirements
·
Enhance
Orbis records for the CD-ROMs with links to server locations; catalog numeric
data files into StatCat.
·
Develop
cost estimates for scaling this process to the entire GDIC CD-ROM collections.
Timeline:
- Jan-Feb: develop selection criteria;
select CD-ROMs to migrate
- Feb-Mar: analyze each CD-ROM as
described above; hire and train student worker
- Mar-May: student migrates CDs;
librarians review work and adjust workflow as needed
- June-Aug: librarians analyze student
data and develop cost and storage requirements
- Sept-Oct: librarians communicate with
selected other institutions to discuss possible collaboration and
scalability issues
- Nov-Dec: write final report
Expenses:
Student labor: estimate $12.00/hour
- 5 hours training
- 4 hours per CD-ROM migrated (includes documentation requirements)
x 15 CDs = 60 hours (over approx 8 weeks, for an average of 7.5
hours/week)
- Total; 65 hours, or $780
Travel to a peer institution in the region
(likely Columbia University) to discuss potential for
collaboration on a larger-scale CD-ROM migration project: $220
Server costs: $0. Estimated server storage
space: 10 megabytes.
Benefits:
This project will contribute to the Yale
University Library in the following ways:
- Cost estimates and storage requirements data from this pilot study
can be used to determine costs and storage requirements for migrating the
entire GDIC CD-ROM collection and can inform other digital migration
projects.
- Selection criteria developed can be used to shape a collection
that should be deposited into the Rescue Repository for short-term
preservation.
- Information and data from selected CD-ROMs will be made available
on the Yale network, thus improving access for Yale faculty and students.
- Technical challenges and quality control aspects of migrating file
formats will be documented and can inform other migration projects in the
libraries.
- OAI-compliant metadata created and external documentation gathered
can enhance resource discovery.
- Explorations of preservation metadata requirements will provide a
test case environment as the Metadata Committee works on library-wide
preservation metadata standards and practices.
This project will benefit the library and
information science community in the following ways:
- Documentation of cost estimates, storage requirements, workflow,
metadata requirements, quality control measures, and technical challenges
can be used by other libraries confronting similar migration projects.
- Detailed articulation of challenges contributes to overall
understanding of at-risk digital government information.
- Dialogue with libraries with similar collections will lay the
foundation for a potential collaborative, large-scale project to migrate
at-risk collections to more stable long-term environments.
- Examination and analysis of specific titles can inform collection
development recommendations to the Inter-university Consortium of
Political and Social Research (ICPSR), a major data archive and source of
numeric data for Yale researchers.