[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Requirements for mass digitization projects
Ongoing discussions about various mass digitization projects,
driven primarily by the Google Libraries program but including
the respective activities of Microsoft, the Open Content
Alliance, and others, prompts these comments about what should be
taken into account as these programs proceed. My concern is a
practical one: Some projects are incomplete in their design,
which will likely result in their having to be redone in the near
future, an expense that the world of scholarly communications can
ill afford. There are at least four essential characteristics of
any such project, and there may very well be more.
As many have noted, the first requirement of such a project is
that it adopt an archival approach. Some scanning is now being
done with little regard for preserving the entire informational
context of the original. Scanning first editions of Dickens
gives us nothing if the scans do not precisely copy first
editions of Dickens; the corollary to this is that clearly
articulated policies about archiving must be part of any mass
digitization project. Some commercial projects have little
regard for this, as archival quality simply is not part of the
business plan; only members of the library community are in a
position to assert the importance of this. An archival
certification board is evolving as a scholarly desideratum.
Archives of digital facsimiles are important, but we also need
readers' editions, the second requirement of mass digitization
projects. This goes beyond scanning and involves the editorial
process that is usually associated with the publishing industry.
The point is not simply to preserve the cultural legacy but to
make it more available to scholars, students, and interested
laypeople. The high school student who first encounters
Dickens's "Great Expectations" should not also be asked to fight
with Victorian typography, not to mention orthography. In the
absence of readers' editions, broad public support for mass
digitization projects will be difficult to come by.
As devotees of "Web 2.0" insist with increasing frequency, all
documents are in some sense community documents. Thus scanned
and edited material must be placed into a technical environment
that enables ongoing annotation and commentary. The supplemental
commentary may in time be of greater importance than the initial
or "founding" document itself, and some comments may themselves
become seminal. I become uneasy, however, when the third
requirement of community engagement is not paired with the first
of archival fidelity. What do we gain when "The Declaration of
Independence" is mounted on a Web site as a wiki? Sitting
beneath the fascinating activities of an intellectually engaged
community must be the curated archival foundation.
The fourth requirement is that mass digitization projects should
yield file structures and tools that allow for machine process to
work with the content. Whether this is called "pattern
recognition" or "data mining" or something else is not important.
What is important is to recognize that the world of research
increasingly will be populated by robots, a term that no longer
can or should carry a negative connotation. Some people call
this "Web 3.0", but I prefer to think of it as "the post-human
Internet," which may not even be a World Wide Web application.
To my knowledge, none of the current mass digitization projects
fully incorporate all four of these requirements.
Note that I am not including any mention of copyright here, which
is the topic that gets the most attention when mass digitization
is contemplated. All four of these requirements hold for public
domain documents. Copyright is a red herring.
Joe Esposito