Final Report
Digitization of Hearings of the U. S. Congress House of Representatives Committee on UnAmerican Activities
Jennifer Weintraub and Sandy Peterson
Goals of the Project:
Summary of Activities:
The first step of the project was selecting 25 hearings that would be most interesting and representative of the material in the hearings.
The hearings were scanned and corrected. We then processed the images with optical character recognition software. Finally, we marked up the resulting text in XML. We used TEILite as our DTD and followed the standards in the TEI Text Encoding and Libraries report (http://www.indiana.edu/~letrs/tei/), marking up at level 3 (Simple Content Analysis) with some additions. This enabled us to markup the material relatively quickly. The markup stresses the page numbers, names, places, and people in the hearings, those elements most important to our users. We accomplished this by using free software, which left more money for student labor.
We also created PDFs as an intermediate delivery solution, which we will mount on the web before the public forum.
Budget:
We were given $1180 and spent it all on student labor, which totaled 112 hours (at first $10.35/hour and then $10.65/hour).
Results:
In general, this project can be deemed a qualified success. It was probably too ambitious for its timeline, given the resources involved. The grant provided much needed seed money for accomplishing the most pedestrian aspects of the project, but we were not able to finish the work. We were able to scan all the hearings and apply optical character recognition to 18 of them. We were also able to mark up 5 hearings entirely. The project was successful in getting the project started, enabling us to make decisions on how to work with the material, and getting processes and procedures in place.
Once the material is scanned, OCR and markup is time consuming, though not difficult. Our student spent about 35% of his time scanning and 60% on the OCR, marking up last. For this reason, we would try to outsource the scanning and OCR in the future and have our students work on markup and basic quality control. While we thought that it would be cost-effective to do this in-house, it turned out that in order to OCR the material efficiently, we had to scan more carefully than we would if we were simply making PDFs
We were not able to finish all the material but aim to do so over the next few months (with some student labor). We are currently OCRing and marking up the remaining hearings (a process which has grown more efficient over the course of the project). We will be mounting a web page describing the project and producing preliminary PDFs.
Hopefully, further development on electronic text delivery in the library will enable us to load the marked-up files for better delivery of this material, potentially using DLXS, already in use at the library. We could then possibly use this pilot project to find further funding to deliver all of the hearings on the web.
In the future we would pursue more outsourcing of scanning and OCR. It is better to use intelligent students for more interesting parts of the job. We would also consider creating a customized DTD for this project if it expands. We were not able to find another DTD to borrow from when the project began and felt it would be best to work with an established DTD to start.