SCOPA Professional Development Grant for 2000

Digitally Available Electronic Resource Licenses
Kimberly Parker, Collections, Yale University Library

Background

The Digitally Available Electronic Resource Licenses project was designed to facilitate the use and consultation of our licenses for electronic resources. The project was intended to make as many licenses available online as possible (given constraints of confidentiality and some readability issues). The online availability of the licenses is intended to permit consultation of our licenses from multiple locations and searching across licenses to answer questions about permitted uses or to find appropriate alternate language to use during license negotiations for new products.

Due to the need for our licenses to be made available online in searchable form, this project was one that was likely to be explored even without a SCOPA grant. However, the grant permitted two major accomplishments. The first accomplishment was the conversion of a large set of existing licenses in a relatively short period of time. This permits us to move forward from this point with much less of a backfile of unconverted licenses to slow the ongoing work of digitizing new licenses as they enter the library. The second accomplishment was that the grant permitted us the time and funding to experiment with the most appropriate way of making licenses available in searchable form.

The year began with some background work already having been accomplished. A student assistant funded from departmental resources had devoted some time in Fall 1999 to sorting licenses and photocopying those that were unambiguously ready for digitization in preparation for sending the documents to Reprographics and Imaging Services (RIS) for conversion to TIFF images. Licenses set aside at this point included those without countersignatures on file, those on legal size paper (which would have complicated a standardized scanning process at RIS), and others with similar departures from the norm.

The year of the grant opened with a set of about 500 pages of license documents prepared for RIS digitization and no funding yet released from SCOPA to pay for the scanning. Because my department intended license work to be an ongoing activity (although at a much slower pace after the SCOPA year), it was possible for me to fund the digitization of these documents by RIS from departmental funds. Thus, although my original SCOPA budget was for a flat fee for RIS scanning in addition to student wages -- in the end all the SCOPA funds were spent on student wages.

Explorations

Once I had a significant body of digitized documents on hand, I was able to begin experimenting with the best way to provide licenses as coherent multi-page documents with the original text both readable and searchable. A very little experimentation with OCR of documents proved that while OCR made the documents searchable, without a great deal of additional work cleaning up the OCR, they were not as useful for simple reading or consultation when placed on a website, either as a text or html document.

I then turned to an examination of the possibilities of PDF documents. Adobe Acrobat has a document creation feature that permits the importation of a TIFF image into a PDF document. Once in the document, Acrobat has its own OCR functionality, and the OCR'd text can be set to be "hidden" behind the image of the page, but tied to portions of that image for searching and copying purposes. Using this capability permits us to OCR and to leave the OCR "dirty" while still enabling a reasonable level of searchability within a document as well. This process also results in cleanly readable documents that look like the pages from which they were scanned.

Having discovered a potential document format that would enable a minimal amount of work at the processing end, I still needed to pursue the question of searchability across documents. In addition, some licenses for YUL resources are click-through HTML documents, and if the terms are acceptable, we may simply save a copy of the license we "accepted" in this manner. However, it would be highly preferable to be able to cross-search these saved HTML files and the documents we are digitizing from our paper licenses. Accordingly, I moved some samples of "original image with hidden text" PDF licenses to the web directory prepared for them, and began working with our web server support person to discover whether our library's web search engine would read and index PDF files of this type. We were able early on to determine that other PDF files on the library's web server were being indexed by the search engine. After that, we went through several iterations of experimentation, first ensuring the directory and files were being fed to the search engine properly. Then we discovered that the search engine needed to be tuned to accept files of a larger document size, as these PDF files were encompassing tied-together images and accompanying text. However, the end result was a successful and positive one, with searches on the license directory pulling results both from the sample PDF files and some saved licenses in html format.

Once the decision about format was made, the next stage of the work was to devise procedures to enable a student assistant to quickly and efficiently process the TIFF images converted by RIS, as well as begin the process of dealing with ongoing license arrivals.

Procedures for processing the RIS scanned images were put into place and the student assistants (four different assistants worked with me at different times throughout this project) began processing the RIS backlog. As this work continued, I assessed the types of licenses that were arriving throughout the year. Those that were available in electronic format were easily divided into two types -- those on the web without modifications could simply be saved to our local web server license directory and enabled for search and display without any further work. Those that were provided as word processing documents could be "printed" to PDF format right from within Microsoft Word and uploaded to the web server license directory, again with very little additional work needed.

Licenses that arrived throughout the year in paper format (and this is still the majority of our licenses), needed to be converted to TIFF format and then could be processed as any of the RIS converted documents. I considered simply sending these new licenses to RIS to scan as they arrived, but RIS charges are based on batch efficiency, making it more expensive to send documents one at a time for scanning. Also, it is fairly easy to scan documents on our in-office scanner when they are no more than 1 or 2 documents a week. Thus, we added a procedure for scanning licenses into TIFF format at the beginning of the student assistant's procedures for PDF conversion.

Sidebar and Continuing Cleanup

Ancillary to this project were efforts I was making to streamline the presentation of summary license information on our library's website, and to link that summary information to the scanned licenses as they became available online. It was not entirely easy to coordinate both of these activities at the same time. I attempted to use a database to let me know what processes had been accomplished on which licenses, but the work of keeping the database up to date proved more time-consuming than time-saving. In the end, I simply pushed ahead with both projects in order to accomplish as much as possible, trusting that when the backlog was cleared up, I would then have time to determine which products did not have licenses yet scanned, and which licenses did not yet have summaries created.

I am still finalizing procedures and implementation of how to handle licenses which stipulate confidentiality in one form or another. At present licenses of this type have been copied into a password-protected directory. My intent is to arrange with our web server administrator the establishment of a search option on this protected directory and to supply a search link (similar to what appears on the CDC website) that will offer the option to search the restricted directory if an individual knows the password.

Summary

The Digitally Available Electronic Resource Licenses SCOPA project was almost entirely successful in that a significant portion of our filed licenses have been made available online. (See: www.library.yale.edu/ecollections/licenses/licenses.html) The small number that are not yet online are those that presented a problem in one fashion or another, and the project was quite useful in turning up issues that need to be investigated in our licenses. Secondarily, we now have in place a method for incoming licenses to be digitized and provided in searchable form on our website, and the past two months have proven that a minimal amount of student labor is required to keep up with this task.

One final task remains to be accomplished: that of informing more people in the library that the scanned licenses are available for review and searching. In addition, I need to continue to convert the license summaries at www.library.yale.edu/journals/licensing.html to link to the scanned versions of the licenses.

Finally, the project has proven that converting paper-based files, which many individuals need to consult, to digitized, searchable, web-available documents is an achievable feat requiring minimal extra staff time to accomplish.

Budget Expenditure Information

Since department funds paid for the initial RIS scanning of our licenses, the only expense associated with this project was that of hiring the student assistants who completed the conversion process and later digitized incoming licenses. Over the course of the project, the hourly rate for a student to do this type of work ranged from $7.85/hr to $10.00/hr. I originally estimated that the amount of work involved would most likely be no more than 3 hours of work a week, and anticipated not more than 32 weeks of work. Without the RIS conversion burden on the budget, I was able to adjust the number of hours/week of work that my student assistants accomplished in order to have them also working on licenses that came in throughout the year. Thus, although my students worked only 28 weeks of the year, the hours they worked ranged from 1-11 hours a week, with an average of almost 4 hours/week.

I requested and received $968 for my budget. Due to my own error, I later believed that I had been allocated $1000 for the project, and expended $999.21 on student wages. Upon discovering the error, I offered to refund the SCOPA budget for this amount, but the Library Business Office indicated that this small amount of error was insignificant in the larger budget, and informed me that a refund would not be necessary.




SCOPA Home | Membership | Minutes and Reports | Mentoring Program
Grants | Forums | Professional Resources | Resources for New Librarians

Yale University Library
Copyright © 2001
Web site comments to Jae Williams
Last updated 03/26/01