[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Requirements for mass digitization projects
There is another aspect to the Google approach to large-scale
text capture which calls for comment. Google has so far shown
little sensitivity to the requirement to provide consistent and
reliable meta-data. There is now more information available from
Google on the books it has scanned (click on the 'About this book
link'). Indeed some of it seems rather promising (eg the list of
'key words and phrases' - but I dont see an explanation of how
these phrases were selected -- presumably an algorithm); here is
a typical instance:
http://books.google.com/books?vid=OCLC61913221&id=cD0JAAAAIAAJ&dq=clarendon
And for the public domain books, or some of them, these keywords
are themselves hyper-linked to occurrences throughout the text. A
very nice and typically Google-ish touch.
But in spite of some recent improvements there is overall a lack
of explicitness and satisfactory meta-data from Google Book
Search. There is still no reliable way of finding out
systematically which books have been incorporated into Google
Book Search and when they were incorporated, and it would be
useful to be able to find out from which copy of an early book
the Google scan was made (this information could be easily
incorporated where library copies have been used).
I havent been following the other large-scale digitization
projects at all closely, but I think its very likely that the
best projects will have very open and explicit bibliographic
metadata -- and Google should behave in a more explicit and open
way if its project is to carry on.
BTW if you doubt that the Google project has some low-grade
scans, take a look at this page (which is proffered by Google as
the first of three typical Selected Pages for the book concerned)
http://books.google.com/books?vid=OCLC09756236&id=CrKWwZ4d29AC&pg=RA42-PA1&dq=thomas+hodgkin
One wonders whose fat fingers those were, that have been scanned
in place of valuable 19th C text?
These large-scale digitization projects have enormous potential.
I hope that the librarians ensure that the ones that work are the
ones with good catalog data and superior meta-data. That
criterion should be added to any list of desiderata, and it bears
on all the aspects mentioned by Joe Esposito.
adam
On 12/25/06, Joseph J. Esposito <espositoj@gmail.com> wrote:
Ongoing discussions about various mass digitization projects,
driven primarily by the Google Libraries program but including
the respective activities of Microsoft, the Open Content
Alliance, and others, prompts these comments about what should
be taken into account as these programs proceed. My concern is
a practical one: Some projects are incomplete in their design,
which will likely result in their having to be redone in the
near future, an expense that the world of scholarly
communications can ill afford. There are at least four
essential characteristics of any such project, and there may
very well be more.
As many have noted, the first requirement of such a project is
that it adopt an archival approach. Some scanning is now being
done with little regard for preserving the entire informational
context of the original. Scanning first editions of Dickens
gives us nothing if the scans do not precisely copy first
editions of Dickens; the corollary to this is that clearly
articulated policies about archiving must be part of any mass
digitization project. Some commercial projects have little
regard for this, as archival quality simply is not part of the
business plan; only members of the library community are in a
position to assert the importance of this. An archival
certification board is evolving as a scholarly desideratum.
Archives of digital facsimiles are important, but we also need
readers' editions, the second requirement of mass digitization
projects. This goes beyond scanning and involves the editorial
process that is usually associated with the publishing
industry. The point is not simply to preserve the cultural
legacy but to make it more available to scholars, students, and
interested laypeople. The high school student who first
encounters Dickens's "Great Expectations" should not also be
asked to fight with Victorian typography, not to mention
orthography. In the absence of readers' editions, broad public
support for mass digitization projects will be difficult to
come by.
As devotees of "Web 2.0" insist with increasing frequency, all
documents are in some sense community documents. Thus scanned
and edited material must be placed into a technical environment
that enables ongoing annotation and commentary. The
supplemental commentary may in time be of greater importance
than the initial or "founding" document itself, and some
comments may themselves become seminal. I become uneasy,
however, when the third requirement of community engagement is
not paired with the first of archival fidelity. What do we
gain when "The Declaration of Independence" is mounted on a Web
site as a wiki? Sitting beneath the fascinating activities of
an intellectually engaged community must be the curated
archival foundation.
The fourth requirement is that mass digitization projects
should yield file structures and tools that allow for machine
process to work with the content. Whether this is called
"pattern recognition" or "data mining" or something else is not
important. What is important is to recognize that the world of
research increasingly will be populated by robots, a term that
no longer can or should carry a negative connotation. Some
people call this "Web 3.0", but I prefer to think of it as "the
post-human Internet," which may not even be a World Wide Web
application.
To my knowledge, none of the current mass digitization projects
fully incorporate all four of these requirements.
Note that I am not including any mention of copyright here,
which is the topic that gets the most attention when mass
digitization is contemplated. All four of these requirements
hold for public domain documents. Copyright is a red herring.
Joe Esposito