Developing OCR (Optical Character Recognition) technologies for Computerised reading of historical Arabic manuscripts
Computerized reading of classical texts in Arabic script is a complex but highly rewarding task undertaken by Al-Zahrawi group of researchers in the TRDC led by Dr. Raid Saabni. Besides issues such as aging, torn scripts, yellowing, dirt and issues connected to the individual handwriting of long dead scribes, Arabic manuscripts is an inherently challenging script for Optical Character Recognition (OCR). The basic structure of a cursive script with character changes in suffixes and prefixes is a challenge that has so far met with far less success than OCR in non-cursive scripts. In classical Arabic scripts the challenge is compounded by the calligraphy of distinctly different fonts. So far the tools developed by Al-Zahrawi group have far surpassed the 50% success rate of currently available Arabic OCR technologies.
¡There are an estimated 40 million manuscripts written in Arabic scripts in the world in languages such as Arabic, Persian, Turkish and Urdu. Only a small proportion of these have been photographed with digital cameras and an even smaller proportion have been rendered into digital characters that enable searching and comparisons. What this means effectively is that a very large part of this huge cultural heritage is not accessible to the public. Successful development of OCR or of keyword searches will open this huge trove to exploration and will doubtless lead to immensely important cultural insights.
The depth of knowledge acquired over several years of intensive research on the immensely complex classical manuscripts has created the possibility of developing real world applications with a potentially significant economic impact. Just one of these now under development is a check-reading OCR system for banks. Contrary to what many people might assume the volume of checks processed by the world’s banks is actually increasing every year giving Al-Zahrawi group the opportunity to develop means to automate the reading of checks as part of the verification process. Other applications are in smartphone applications that can read and decipher handwritten texts.