[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Requirements for mass digitization projects
I should apologise to the list for not drawing your attention to
the hopeless and groaning pun:- large-scale digitization projects
do not have to result in digital enlargement projects....
And here is a cleaner link to the enlarged digits:
http://books.google.com/books?vid=OCLC09756236&id=CrKWwZ4d29AC&pg=RA42-PA1#PRA42-PA1,M1
And yes, I had noticed that there is an OCLC catalog number
buried in the url, but its not a helpful way of presenting that
information.
Google has a lot of cataloging information to hand (there we go
again) or it could not possibly be organising its scanning
effort. It is a great pity that it does not help users by
providing easy access to it.
I hope that in a New Year spirit of optimism and charity the list
will forgive the dreadfulness of the puns.
Adam Hodgkin
www.exacteditions.com
mobile: +44 7931 371 744
skype name: adam.hodgkin
skype in: +44 20 7871 4537
On 12/28/06, adam hodgkin <adam.hodgkin@gmail.com> wrote:
There is another aspect to the Google approach to large-scale
text capture which calls for comment. Google has so far shown
little sensitivity to the requirement to provide consistent and
reliable meta-data. There is now more information available
from Google on the books it has scanned (click on the 'About
this book link'). Indeed some of it seems rather promising (eg
the list of 'key words and phrases' - but I dont see an
explanation of how these phrases were selected -- presumably an
algorithm); here is a typical instance:
http://books.google.com/books?vid=OCLC61913221&id=cD0JAAAAIAAJ&dq=clarendon
And for the public domain books, or some of them, these
keywords are themselves hyper-linked to occurrences throughout
the text. A very nice and typically Google-ish touch.
But in spite of some recent improvements there is overall a
lack of explicitness and satisfactory meta-data from Google
Book Search. There is still no reliable way of finding out
systematically which books have been incorporated into Google
Book Search and when they were incorporated, and it would be
useful to be able to find out from which copy of an early book
the Google scan was made (this information could be easily
incorporated where library copies have been used).
I havent been following the other large-scale digitization
projects at all closely, but I think its very likely that the
best projects will have very open and explicit bibliographic
metadata -- and Google should behave in a more explicit and
open way if its project is to carry on.
BTW if you doubt that the Google project has some low-grade
scans, take a look at this page (which is proffered by Google
as the first of three typical Selected Pages for the book
concerned)
http://books.google.com/books?vid=OCLC09756236&id=CrKWwZ4d29AC&pg=RA42-PA1&dq=thomas+hodgkin
One wonders whose fat fingers those were, that have been
scanned in place of valuable 19th C text?
These large-scale digitization projects have enormous
potential. I hope that the librarians ensure that the ones that
work are the ones with good catalog data and superior
meta-data. That criterion should be added to any list of
desiderata, and it bears on all the aspects mentioned by Joe
Esposito.
adam
On 12/25/06, Joseph J. Esposito <espositoj@gmail.com> wrote:
Ongoing discussions about various mass digitization projects,
driven primarily by the Google Libraries program but including
the respective activities of Microsoft, the Open Content
Alliance, and others, prompts these comments about what should
be taken into account as these programs proceed. My concern
is a practical one: Some projects are incomplete in their
design, which will likely result in their having to be redone
in the near future, an expense that the world of scholarly
communications can ill afford. There are at least four
essential characteristics of any such project, and there may
very well be more.
As many have noted, the first requirement of such a project is
that it adopt an archival approach. Some scanning is now
being done with little regard for preserving the entire
informational context of the original. Scanning first
editions of Dickens gives us nothing if the scans do not
precisely copy first editions of Dickens; the corollary to
this is that clearly articulated policies about archiving must
be part of any mass digitization project. Some commercial
projects have little regard for this, as archival quality
simply is not part of the business plan; only members of the
library community are in a position to assert the importance
of this. An archival certification board is evolving as a
scholarly desideratum.
Archives of digital facsimiles are important, but we also need
readers' editions, the second requirement of mass digitization
projects. This goes beyond scanning and involves the
editorial process that is usually associated with the
publishing industry. The point is not simply to preserve the
cultural legacy but to make it more available to scholars,
students, and interested laypeople. The high school student
who first encounters Dickens's "Great Expectations" should not
also be asked to fight with Victorian typography, not to
mention orthography. In the absence of readers' editions,
broad public support for mass digitization projects will be
difficult to come by.
As devotees of "Web 2.0" insist with increasing frequency, all
documents are in some sense community documents. Thus scanned
and edited material must be placed into a technical
environment that enables ongoing annotation and commentary.
The supplemental commentary may in time be of greater
importance than the initial or "founding" document itself, and
some comments may themselves become seminal. I become uneasy,
however, when the third requirement of community engagement is
not paired with the first of archival fidelity. What do we
gain when "The Declaration of Independence" is mounted on a
Web site as a wiki? Sitting beneath the fascinating
activities of an intellectually engaged community must be the
curated archival foundation.
The fourth requirement is that mass digitization projects
should yield file structures and tools that allow for machine
process to work with the content. Whether this is called
"pattern recognition" or "data mining" or something else is
not important. What is important is to recognize that the
world of research increasingly will be populated by robots, a
term that no longer can or should carry a negative
connotation. Some people call this "Web 3.0", but I prefer to
think of it as "the post-human Internet," which may not even
be a World Wide Web application.
To my knowledge, none of the current mass digitization
projects fully incorporate all four of these requirements.
Note that I am not including any mention of copyright here,
which is the topic that gets the most attention when mass
digitization is contemplated. All four of these requirements
hold for public domain documents. Copyright is a red herring.
Joe Esposito