Scanning Arabic manuscripts and modern texts: today and beyond

 

 

Purpose and expected outcome of the project:

Scanning images for the creation of digital image libraries is now a well established and documented process on campus.  The ELI project has made great contributions to bringing these digital image collections into the classroom. Scanning text for research and pedagogical purposes is beginning to make inroads in technical projects on campus when the texts involved are printed in Western languages.  Non-Western texts present special challenges when planning a digital text collection.  In the case of texts printed in Arabic, the challenges multiply because of varying fonts used in the printing process, these dependent in the main on the date and country of publication.

 

The purpose of this project is to discover what steps are involved in scanning a manuscript and its modern critical equivalent when the text to be scanned is written in Arabic.  To accomplish this, the project involves identifying the resources currently available on campus and in the library system to carry out the scanning project.  The project will publish the results for librarians and the Yale community detailing the process involved in scanning and Optical Character Recognition (OCR) processing to text for a non-Roman alphabet such as Arabic. 

 

Methodology:

  1. Identify a specific text. Select a manuscript in Arabic that has not been scanned for a digital image collection and its modern critical edition. (For the purposes of this project, the manuscript that will be used is held at the Medical Historical Library.  It is the قا نون في ألطــّب   (Canon of Medicine) from Ibn Sīna. (Cushing Arabic ms. 5).  The modern critical edition is al-QŻanŻun fŻi al-tibb / li-AbŻi `AlŻi al-Husayn ibn `Abd AllŻah Ibn SŻinŻa. Frankfurt, 1996.)
  2. For the purposes of analysis, select a specific and small excerpt from the manuscript and the corresponding passage in the modern text for scanning.
  3. Identify and document the scanning facilities on campus that can accommodate a special text, i.e. a manuscript requiring special handling.
  4. Identify and document the collaborations needed between departments on campus to complete the scanning tasks. Identify and document any charges assessed by departments for such tasks.
  5. Design a file storage structure for the results of the scanning task.
  6. Scan both texts and save the images to TIFF format.
  7. Pass the TIFF format through a software package that permits OCR to text processing of a non-Western language.
  8. Store and document for retrieval all images and files from the scanning and OCR processes.
  9. Prepare the results for viewing via the Internet.
  10. Document findings for publication via the Internet.
  11. Publicize the results ( and the location of the web site) via a SCOPA presentation and/or a paper at the 2005 MELA/MESA conference.

 

Project Requirements:

(1)   A survey of the scanning resources on campus would need to be drawn up and distributed electronically or conducted in person.

(2)   Some cooperation among the internal library units would be sought to cover costs normally assessed for special scanning projects. 

(3)   A simple yet flexible web site would need to be constructed to display the results of the discovery project.  Cooperation would be sought from the YUL Systems group to determine the best location for the site.  Also, attention would be paid to the current MED DL project using Greenstone software such that the resulting files from this SCOPA project would be compatible to that project’s data structure.

(4)   Software to conduct the OCR to text processing would need to be purchased.  An academic discount will be sought.


Expenses

Purchase of OCR software from Sakhr (does not reflect an academic discount)

$1400.00

Student time for scanning @ $11.00/hr

$110.00

Total

$1510.00

 

 

Benefits

The successful results of this grant will create a guideline for librarians and scholars at Yale to follow when undertaking projects involving the scanning and processing of non-Western text.   The files produced  from the manuscript can serve as a seed project for new electronic collection efforts or as additional data for existing collections in the Medical Historical Library.  The scanning of the modern counterpart will provide additional information on successful OCR techniques for future digitization projects managed by the Library.

 

 

Yale University Library

SCOPA Grant Proposal: 2004

submitted: 10/18/2003