SCOPA Grant proposal: 2005                                                                                             

Scanning Arabic manuscripts and modern texts: today and beyond

Final report: March 15, 2006                                                           Submitted by: Elizabeth A. S. Beaudin

 

Original funding amount:      $1,510.00

Balance as of this report:     $   300.00 (as of 1/31/2006):

The SCOPA award was issued at the beginning of calendar year 2006.  A mid-year report was submitted to the SCOPA committee on June 15, 2006.   

 

Acknowledgments:

Expert assistance and guidance has contributed to this project, especially by colleagues within Yale University Library, including Toby Appel, Historical Medical Librarian; Daniel Dollar, Associate Director, Collection Development and Management at the Med School library; Kelly Perry, Library Service Assistant, at the Med School Library; Ellen Cordes, formerly of the Beinecke Library and now Head of Technical Services, Lewis Walpole Library; and Simon Samoeil, Curator of the Near East Collection.   In addition, my colleagues in Electronic Collections, Kimberly Parker and Jennifer Weintraub, added their expertise and provided good counsel at different stages of the project.  Ann Okerson, AUL, provided special funding so that the SCOPA funds could support more student-worker hours. I wish to thank them all for their generosity.

 

Goals of the Project:

·        to discover what steps are involved in scanning a manuscript and its modern critical equivalent when the text to be scanned is written in Arabic

·        to identify the resources currently available on campus and in the library system to carry out the scanning project

·        to publish the results for librarians and the Yale community detailing the process involved in scanning and Optical Character Recognition (OCR) processing to text for a non-Roman alphabet such as Arabic. 

Project Results:

1.      General

The web site for project can be found at http://www.library.yale.edu/oacis/scopa/.  The site documents the project and displays a selection of the scanned images for the selected texts, which include two modern Arabic editions and an English translation, along with images from the manuscript held at the Historical Medical Library

 

2.      Digitization Workflow

 

Workflow in digitization projects can be divided into 3 phases, excluding the check-out from the library’s ILS and the return of the selected volume to the library catalog.  These phases are: a) scanning, b) processing, and c) OCR conversion.

 

a)     scanning

                                                              i.      My work-study student, Nahaliel Kanfer, completed the scanning of the sections in the selected modern editions.  We both learned more about the Minolta scanner during this phase, especially thanks to assistance provided by Jen Weintraub.

                                                            ii.      The scanning process for the Ibn Sina manuscript is radically different than that required for the modern text equivalent of this work.  Skilled imaging specialists at the Med School were contracted to scan 20 folios from the work.  A few of these images are posted on the project web site to show the richness of the manuscript.  None of the manuscript images continued through the workflow as the manuscript text cannot be converted via OCR.  Thus, when beginning a digitization project involving manuscripts, different efforts and concerns are involved, such as very skilled imaging and proper display mechanisms over the Internet.

 

b)     Processing

                                                              i.      Processing involves preparing the scanned image to improve OCR results.  Black & White contrasts can be enhanced, speckles on the page removed, the angle of an image de-skewed, all to improve the image for OCR step.

c)      OCR

                                                              i.      The software used in this step interprets the Arabic text and converts it into machine readable format for later searching and retrieval by a library patron.  This is clearly the most challenging step in the digitization workflow.

                                                            ii.      The purchase of the Arabic OCR software from Sakhr, Inc. has helped a great deal in learning more about the challenges in converting Arabic text into its searchable electronic form.  The software package itself presented some challenges in the use of its English and Arabic interfaces and in understanding how to optimize OCR results. 

 

  1. Digitizing sites interviewed on campus:

a)     Digital Conversion Facility in SML

b)     Beinecke

c)      Document Delivery unit in Medical Library

d)     Med Media Group (ITS-Med Yale University School of Medicine )

 

  1. Lessons Learned: 

a)     In general, the SCOPA project provided hands-on experience to compliment the theoretical understanding I had from working with the Digital Lab staff at the Bibliotheca Alexandrina in Alexandria, Egypt.  One thing is to observe a Digital Lab in operation; another is to put this experience into practice.  Here are just a few of my observations.

b)     When dealing with printed books or journal volumes, the full workflow is used.  At the scanning stage, the size and binding of the book can have an impact on the quality of the scanned images.  Adjustments to the scanner settings are often needed as the scanning specialist takes scans of pages in the center of a larger book to compensate for gutters and margins.

c)      The use of Sakhr’s OCR software gave us an unexpected benefit.  When our Project AMEEL team heard of a new OCR software package under development, we contacted the software company, NovoDynamics in Ann Arbor, and offered to evaluate the package.  Thanks to using Sakhr on the SCOPA project, we had gained expertise within Yale University Library and were able to conduct a good product comparison as well as offer informed observations to the NovoDynamics developers. 

d)     Along with developing a solid workflow, any digitization project needs a well planned structure for archiving the scanned images and OCR output to permit smooth and dependable retrieval for the library patron’s use.  I followed the example from the OACIS project digitization experiment in which we used the file directory structure of the OACIS server as the repository for all digitized images and OCR output.  During the SCOPA project, I also reviewed the Greenstone suite of software as a potential alternative to the file directory method.  Later, as a member of the Vital / Fedora evaluation team, I used images from this SCOPA project to test the features offered by VTLS in their Vital interface product.  Thanks to the SCOPA project, I was better prepared to review the available features in both Greenstone and Vital.  Because of the experimenting needed as part of the Vital evaluation, I mixed and matched formats for display purposes in the SCOPA project to determine the best format for quick and clear display over the Internet.

e)     Yale University Library has significant experience in creating digitization projects that include images.  Its experience with text-based projects however continues to be limited.  This SCOPA project, though a small one, offered me an opportunity to explore best practices when dealing with text, for choosing display formats, for studying archive structures, and for optimizing search results when OCR is an essential part of the project workflow.  As a result, this project has helped me to participate more fully in library-wide evaluations and planning.  In addition, the SCOPA project has provided me with important experience to help meet the goals more effectively for our larger digitization projects, such as Project AMEEL and Iraq ReCollection.