Thanks to Missy Roser for passing this along...
......................................................
from The Seybold Report
Vol 4, No. 21 . 9 Feb, 2005
Following the Trail of the Disappearing Data
BY VICTORIA MCCARGAR
Since the mid-1990s, it has become increasingly clear that information
stored digitally - unlike physical photos, for example - is unnervingly
fragile. Lacking the appropriate systems, workflows and metadata to
ensure longevity, news archives are setting the stage for future data
loss.
Sitting on my desk is a black-and-white aerial photograph looking up
Pasadena's Arroyo Seco at the Rose Bowl on a sparkling winter day. The
picture is in very good condition, the emulsion intact, with a couple of
minor wrinkles and a mark or two from an orange grease pencil. I can see
on the back the carefully applied caption from the day it ran in the Los
Angeles Times, Jan. 1, 1935 (Alabama beat Stanford, 29-13). This picture
looks like it's good to go for at least another 70 years.
On the monitor of my Macintosh G4, I have a JPEG from Mullaittivu, Sri
Lanka, from Jan. 1, 2005. It's an arresting image of the forearm and
hand of a dead woman, visual evidence of the human tragedy of the Dec.
26, 2004, tsunami. Someone in 2075 (amid global warming-induced
flooding, perhaps) might want to see exactly what Mullaittivu looked
like on this day. Will he or she be able to pull up this 200-dpi, 584KB
nugget of disaster history 70 years from now?
Don't bet on it just yet.
Since the mid-1990s, it has become increasingly clear that information
stored digitally is terribly fragile. Newspapers periodically run
stories about this phenomenon and give good coverage to heroic data
rescue efforts, such as the British project to salvage the Digital
Domesday Book, or conundrums, like the difficulties museums are having
curating digital works of art. But there appears to be a mysterious
disconnect when it comes to another group with an important cultural
stake in long-term preservation: newspaper archives.
Research on a global scale is under way to find solutions to preserving
born-digital content, but it's a field limited almost exclusively to
academic and research libraries, national archives and bureaucratic
record keepers - professionals invested with a defined responsibility to
keep digital files alive and accessible for a long time.
So it is ironic that even as they're publishing stories about data
fragility, newspapers haven't quite made the connection with what is
going on in their own electronic morgues. (I refer throughout to
newspaper archives, but in fact the same issues affect other news media
collections as well - for that matter, any data collection that is
supposed to last indefinitely.)
The fact is, photo and multimedia databases, and even text databases are
potentially shorter-lived than yellowing newsprint, and some formats in
use today will ultimately prove more unstable than chemical color
photography. Indeed, the very technologies that have enabled the rapid
dissemination of news are conspiring to create a generation-size gap in
the historic record.
Only 1s and 0s
Digital data is basically a collection of on-off switches, strings of 1s
and 0s (bits) ordered in manageable chunks called bytes. In simplest
terms, what differentiates the million bytes of a 1MB JPEG from the
million bytes of a 1MB spreadsheet is how the bytes are interpreted by
which application. But other factors besides software determine the
future accessibility and readability of the 1s and 0s: platform and
operating system, storage structure, technical metadata, content
description, copyright and even (maybe especially) institutional
discipline. Over time, sometimes catastrophically quickly but more
likely gradually, a byte stream will tend to become unreadable,
essentially reverting to the magnetic on-off switches of storage media,
the 1s and 0s.
The task of identifying all the risk factors and putting preservation
solutions in place has barely begun. In the meantime, lacking the
appropriate systems, workflows and metadata to ensure longevity, news
archives are setting the stage for future data loss. It's not too much
of a stretch to say that byte streams that have been stored for the past
10 years - and those that will be captured and stored tonight or next
week - might already be lost.
It's not hard to see how this happened.
Lured by speed, unprecedented accessibility and flexibility, not to
mention gains in staff productivity, publishers and their newsrooms have
embraced technologies that enable a wealth of functions: easily
captured, edited and transmitted photography, full-page pagination, Web
publishing, content sharing and repurposing, and PDF workflows, to name
the big ones. Over in the news library, meanwhile, huge gains in storage
density and processing power meant that big, increasingly sophisticated
image databases or burgeoning collections of images on CD-ROMs have
relegated black-and-white prints in envelopes to the back of the stacks.
"Archives" have morphed into "assets," and assets have come to refer to
a variety of formats beyond photography and text. Information graphics,
analytical databases, HTML pages and digital video all have all become
part of the potential multimedia archival mix. As technology has come to
play a larger role in the news archives, responsibility for maintaining
content has in many cases been transferred from traditional archivists
and librarians to systems analysts. At the same time, the automatic
capture of bibliographic and descriptive metadata from the publishing
system has resulted, not surprisingly, in heavily downsized archives and
library staffs. This is a major shift in information management
philosophy, because IT departments arguably have a different approach
than libraries to long-term preservation.
Budgeting for Preservation
Archives consisting of envelopes of old clippings and black-and-white
photographs didn't require large capital outlays every few years to
sustain them; as long as they were protected from dangers such as fire
and water, and kept in a reasonably controlled environment, they could
survive almost indefinitely.
Digital data is very different, primarily because it doesn't respond
well to that kind of benign neglect. To forget about a few envelopes of
CD-ROMs in a file drawer for 10 or 15 years is asking to lose them; to
skip a couple of upgrades is to put an entire format at risk.
The problem with funding archives, moreover, is that it's difficult for
budgeters to see a return on investment. While digital preservation
costs are still mostly a matter of speculation, most researchers agree
that it will be expensive. True, some news archives generate a modest
revenue stream from reselling old images and articles in new digital
forms, but beyond that, publishers and chief financial officers aren't
necessarily willing to spend money to meet some vaguely perceived
obligation to maintain a record of history in the making.
Surviving Space and Time
Digital archives exist in a physical world and are subject to equipment
failures, such as burst pipes and the like. Properly backed up, the data
will survive physical dangers and be restored. But digital preservation
does not equate with disaster recovery - a misconception that IT
professionals often have. The threats I'm concerned with here are much
more subtle, amounting to the gradual loss of information through a
variety of changes over time.
Software obsolescence. This is such a seemingly ordinary problem that
it's tempting to think that it really isn't one at all. If systems
administrators are careful enough to make every upgrade on schedule, the
objects will migrate naturally to the next version, or so the thinking
goes. But batch migration of thousands or millions of individual objects
from one version to the next is not common practice. The typical
workflow is to leave an object in its original version until a user
needs it for some new purpose.
But what if a user retrieves the object created in version N, and the
only available software in-house is version N+5? Backward compatibility
will never be unlimited, and the nature of forward migration is to
introduce errors with every upgrade, however minute or undetectable.
Even with well-executed batch migration, over time those errors are
cumulative and the data gradually becomes unreadable (see illustration).
That assumes the software continues to exist and function. WordStar, a
nearly ubiquitous word processor in the 1970s and 1980s, is often held
up as the poster child for digital obsolescence. No current word
processing programs will open a WordStar file, and the company stopped
manufacturing the software in 1991. Cracking old WordStar files now
amounts to a hobby for computer enthusiasts.
Hardware obsolescence. Every new data storage format signals the end of
its predecessors, be they Zip disks putting an end to 3.5-inch floppies
or EVDs (enhanced versatile disks) putting users on notice that there's
a format beyond DVD. While it's true that few people do any serious
archiving on Zips (or their successor, memory sticks), many news
archives have consigned their photography to CD-ROMs, and they're now
looking at having to shift to DVDs. Inasmuch as CDs are turning out to
be subject to more physical deterioration sooner than thought, having to
reformat on the more stable DVD platform is probably a good thing. But
it's still a moving target.
If photographs are stored in large databases with industrial-strength
hard disks and tape-drive backups, the material is easier to move
forward than collections of disks.
Inadequate metadata. In a January 1995 Scientific American article, RAND
Corp. researcher Jeffrey Rothenberg pointed out that if modern
civilization is going to hang onto digital information into the future,
its denizens are going to have to create a lot of other information
about the information to go with it, to enable future seekers to write
new software to "bootstrap" their way into rendering the obsolete data
into some form that humans can read.
That information about the information, or metadata, is critical to the
preservation process - probably a great deal more important than
software or hardware, in fact. Much of the research agenda in data
preservation focuses on what that metadata should comprise. In his
article, Rothenberg proposed that the information include, minimally,
specifications about hardware, operating system and software
requirements; byte-stream interpretation, and enough information about
the software code itself to allow a future user crack it - essentially a
digital Rosetta Stone.
That so-called technical metadata is in addition to the more familiar
content and context metadata: the journalistic who, what, where, when,
and why of good caption-writing; bibliographic data such as date of
publication, section, edition, part and page; and enough information
about the copyright status of the object to ensure that future users
know what their access rights are.
Some of this metadata can be captured or generated automatically, but a
lot of it cannot, and producing it will not be inexpensive. Assigning
index terms according to a controlled vocabulary, sometimes known as
keywording or taxonomy, is a good example of this. As much art as
science, good indexing provides ways to limit searches and zero in on
the subject of an article or image, saving the user from looking at a
lot of irrelevant material.
As multimedia databases grow and become more complex, smart metadata
will make the difference between a useable database and one that merely
contains objects. If an object can't be searched for, found, retrieved
and used, it is as good as lost. As brilliant as Google is, simple
free-text searching isn't up to the kind of sophisticated searching that
news users need. No one will want to slog through a Google-scaled 10,000
or 20,000 hits in his or her own multimedia database.
And just because an object is never retrieved doesn't mean it doesn't
still reside in the database. Over time, systems analysts and budget
writers will find themselves supporting - and financing - a larger and
larger chunk of this "dark" data.
Lack of standards and best practices. Preservation researchers agree
that tight standards are key to solving the data longevity problem. The
academic and research library and archives worlds, which have been
grappling with the digital preservation problem for most of a decade,
are coming at it from a foundation of fairly rigid standards for digital
data structures and description, beginning with MARC (machine-aided
cataloging) in the 1960s, and proceeding through today's emerging
standards like MIX (technical metadata for still images in XML) and METS
(metadata encoding and transmission standard). They are, consequently,
well prepared to begin adding preservation metadata to their
institutional workflows as standards begin to take final shape in the
next few years.
News archives practice has developed in response to the deadline demands
of news research and, more recently, the requirements of repurposing
material for the Web and other products, including sharing content with
sibling properties. One-off systems and local customization are
gradually giving way to discussions of ways to interoperate, developing
best-practices workflows not just within a single news organization, but
within a corporate chain. The venerable IPTC (International Press
Telecommunications) "header" is a logical place to start talking about
standards for preservation, but eventual solutions will come at the
expense of flexibility and the latitude to customize.
Lack of institutional discipline. Customization has usually been born of
necessity. Meeting production deadlines and the "get the paper out at
any cost" mentality that is the hallmark of working in a newsroom tend
to produce some really creative workflows. However, in the automated
capture and processing of metadata, spot innovations and one-off
workarounds can play havoc with the digital record.
Best practices for digital archiving suggest that the process actually
begins with the photographer or reporter and continues through the
entire editing process. But the burdens and requirements of well-formed
metadata are way beyond what can reasonably be expected of shooters,
wordsmiths and artists. On the archives end, the only way to guarantee
the compliance of the record is a set of quality controls, which are
usually humans drawing a salary and benefits. Without them, the
resulting record is basically an anomaly and, over time, subject to
becoming invisible to a future search engine.
Moreover, any current and future efforts to develop digital preservation
solutions will be aimed at solving a standardized problem - developing a
uniform migration path for JPEGs to a future format like JPEG2000, for
example. If an individual news archive isn't IPTC-compliant, is using a
slightly different version of JPEG or has incomplete technical metadata
because of one of a dozen possible user workarounds, the standard
"rescue" solution might pass it by.
XML is frequently mentioned as a preservation solution because of its
platform independence and highly intuitive, self-describing tag-sets.
XML in theory and XML in practical application are quite different,
however, and the rigid workflows required for well-formed XML are hard
to come by in most newsrooms, especially at the design desks, where a
lot of last-minute changes take place. When deadline performance is at
stake, the creative workaround will trump the compliant workflow every
time.
Copyright. It's not a technological problem, but it's almost as big a
threat as obsolescence and could turn out to be even harder to solve. In
the fallout from the Supreme Court's 2001 Tasini v. New York Times
decision over the rights of freelancers, large parts of news archives
disappeared from their host databases, either moved offline or deleted
outright. As digital copyright continues to evolve, archive managers are
struggling with how to handle freelance material, for which in many
cases archiving is verboten.
What can a newspaper or magazine do with freelance stories and photos to
archive its own published record? The answer, surprisingly, is to
microfilm it with the rest of the paper. The electronic version, on the
other hand, may exist in a digital limbo, moved to the archive in an
automated workflow, invisible to users, its status uncertain. And
creating metadata for copyright that will be meaningful 50 or 100 years
from now seems to require a rather large crystal ball.
Coping Techniques
While preservation-oriented standards, practices, users and vendors sort
themselves out, there are a few seat-of-the-pants techniques that work
fairly well, as long as alert people in the organization stay on top of
the content they're trying to keep. None, however, is more than a
short-term, stop-gap method. At this point, that's simply all there is.
Migration on demand. Files are upgraded piecemeal as the need for one in
the newer version arises. Unneeded files remain in the old version
indefinitely. The migration process also necessitates accounting for the
transfer of all the metadata, which might exist in a separate format,
while retaining all its connections to the original object if the
metadata is not contained, or "encapsulated," with the object. A
thorough, well- documented testing program is essential before
undertaking a larger-scale migration, and careful documentation is
necessary for future users to understand the outcomes of successive
migrations.
Technology preservation. This involves keeping one or more older
computers running and maintaining the software versions that require
older machines. Files that can't be migrated are stored here, too. This
is actually a fairly good, inexpensive approach, as long as the machines
are in working order or can be repaired if damage occurs. It's not a
viable solution beyond a few years, though. Similarly, the files might
not be formally backed up anywhere, meaning a system crash is
potentially the end of the data.
Normalization. This refers to saving the object in a single format that
is easier to preserve. In practice, this can mean exporting files to
flat ASCII or even printing everything out on paper (popular for
e-mail). The development of the so-called "archival" PDF, known as
PDF/A, is another example of this approach, one that aims to extend
"normalization" to any system in any institution. Loss of functionality
of the original document is an obvious drawback, and there are further
issues of how to authenticate the "original" if that is a consideration.
(For example, a PDF of a freelance contract, which is a legal document,
will require a fairly sophisticated method of authenticating the
signatures - yet another bit of software that will somehow have to
travel with the document for the life of the contract and beyond.)
Bit-level preservation. This is a fancy term for hanging onto problem
files but giving up on the ability to render them pending some future
technological development. The hope is that if the data can be
preserved, someone will eventually figure out a way to render it.
Interestingly, systems administrators might already be doing a fair
amount of bit-level preservation without knowing it, depending on how
many files they're accumulating in their databases that are obsolete,
can't be opened, are no longer identifiable, or lack enough metadata to
support search and retrieval. Whether that mass of dark data eventually
is measured in terabytes or more is a function of how comprehensive the
metadata is and how thoroughly the whole asset management process has
been documented.
Hard Questions
News archives have a comparatively long track record in what is now
termed digital asset management. Nevertheless, it's important to
remember that we're still in the early stages of trying to support
digital content into the future, and what seems like a workable solution
now probably won't be after a number of years. All told, media archives
have about 20 years' experience with text databases and half that with
large-scale digital image archives. The success or failure of successive
migrations after 70 or 80 years won't be known for some time yet, at
which point there will be no analog original, such as film negatives or
prints, to fall back on.
While solutions evolve, news archivists should be asking themselves a
few questions that will go a long way toward putting solutions in place,
once they emerge, in an ongoing dialog among IT, news librarians and
journalists about the process of archiving.
What are we archiving? In the days of shelves and manila envelopes,
limits on archives were a function of space, and it was obvious that
periodic decisions had to be made about what to discard. One of the
interesting developments of the Digital Age is the gradual abandonment
of archival policies, written or otherwise, that spelled out what was
going to be kept permanently, what was to be kept temporarily and for
how long, and what was to be "de-accessioned" outright. Creators and
archivists didn't always see eye to eye on the policies, though, so it's
not surprising that as technology improved, creators began asking
archivists to take in more material than ever before, whether or not
they were equipped to handle it.
>From a human standpoint, one of the great things about digital storage
is that it's compact, convenient and, unlike bulging shelves, out of
sight. But the bottomless accumulation of unpublished pictures from
photo assignments, for example, is likely to be every bit as expensive,
or more, than shelves of prints, if the intent is to keep the files
viable indefinitely.
And if users, archivists and IT support personnel haven't arrived at a
mutual understanding of what the system requirements are, including
appropriate expiration or selection strategies, the result will sooner
or later be an unmanageable, minimally described mass of data weighed in
terabytes or petabytes. Making policies now will save a lot of grief
later.
How much is preserving digital archives going to cost? There are so many
variables that preservation costs are difficult to estimate, but some
researchers put it conservatively at $1 million per terabyte per decade,
assuming that the institution has already developed (and paid for) all
the necessary metadata analysis and creation; has seamless, reliable,
ironclad workflows; and has established failsafe migration paths for all
of its format types - three pretty hefty assumptions. In other words,
once the expensive work of development has been accomplished, it is
still not going to be as cheap as maintaining paper and emulsion in
manila envelopes.
Who is going to be responsible? There is a natural partnership to be
fostered among information professionals in the news library and
technologists in the IT department. Hardware and software, the
centerpieces of the IT approach, are only half of the equation. The rest
is metadata development, standards compliance and user workflows - the
domain of information professionals from libraries and archives. But the
system can't succeed without buy-in from users in the newsroom, who need
to be included in the development of realistic policies for long-term
preservation, as well as help to promote intelligent, compliant
workflows among their creative colleagues.
Responsibility extends to understanding standards and compliance, and
keeping a close eye on developments in the field. An emerging body of
literature about preservation metadata will eventually influence
standards, XML schemas and, in turn, systems developers and integrators.
See Preservation Metadata: Implementation Strategies, or PREMIS
(www.oclc.org/research/projects/pmwg/), for information about one
important effort. But since vendors won't develop preservation-aware
solutions until customers start asking for them, it behooves media
properties to be well-informed about preservation and their own internal
long-term retention strategies.
How do we pay for this? Some of the thorniest questions concern how to
pay for sustainable digital collections. There are more questions than
answers. What is the value of the collection, and to whom? What is the
ROI for text, images and other material, such as Web pages and video
that is of little or no commercial value, but has intrinsic historic
worth? The contents of news archives are the history of a city, a
nation, a culture, a snapshot of an epoch of humankind, but if you can't
sell it on your Web site, how can you justify the expense of maintaining
access decade after decade?
The short answer is that it might not be feasible. The problem might
just be too big, too complex and too expensive over time for individual
media properties or even their parent companies to sustain on their own.
In the research and academic world, there is ongoing work to scope out
models for "trusted digital repositories," third-party entities that
have the mission and expertise to take in the digital contents from
outside archives and do the preservation work on behalf of their
customers, guaranteeing continued access according to a predetermined
set of criteria.
Cooperative efforts - perhaps an industrywide project - would leverage
what limited expertise exists while the field grows and attracts more
practitioners. Research and development funding, moreover, could be
spread among a larger pool. But that will still require a concerted
effort at standards development and best practices to be a realistic
proposition. This will require partnerships between media companies and
vendors, as well as rethinking established newsroom workflows.
What about what we have already archived? Another provocative question
is, what has already been lost? News databases are full of complicated
multiplatform formats, compound, complex objects and nonstandard,
locally customized metadata schemas. A standard for preservation
metadata is close, but implementation will take a few years. Without
these critical components of a preservation-oriented archive, how will
old data move forward or how will it be rescued after the fact if
migration fails? Is there already a gap in the historic record? Some
archivists believe the 1990s are already gone. Only time will determine
whether they're alarmists - or actually right.
Fortunately, I know that my Jan. 1, 2005, picture from the devastation
at Mullaittivu will be human-readable in 2075. It'll be on microfilm.
About the Author
Victoria McCargar is involved in newsroom and library technology support
and strategic planning at the Los Angeles Times, where she is a senior
editor. A frequent lecturer, she is a member of two international teams
researching digital preservation and is investigating standards and
preservation strategies for the newspaper industry. She is an adjunct
professor at UCLA and holds masters degrees in information science and
journalism. She can be reached at mccargar@mac.com.