Historical Document Image Analyzing

Triangle Research and Development Center

November 10, 2019

Contact Info

Community Research & Activities

Dr. Raid Saabni

Between the seventh and the fifteenth century a huge number of manuscripts were written in Arabic alphabet (Arabic, Farsi or Ottoman Turkish) in various fields. Around forty million manuscripts and books are currently available in different places around the world. The process of modernizing and revising these documents may include determining the writer, the region, and the age of the document. These are tedious and time consuming processes usually performed in most part manually. So far, Arabic historical documents have received less attention with respect to other languages for various reasons that include the complexity of the script and the academic research atmosphere in the Middle East in the recent century. Among more than seven million unique titles, which are currently available in different places around the world, only 5% have been subject to scientific revision and modernization.

Following the Islamic conquests in the 7th century the Arabic language gradually replaced the Aramaic language of speech and writing of the Jews in the occupied countries. This led to Judeo-Arabic dialects of the Arabic language that replaced the Arabic in which Jews spoke. Especially in the Middle Ages, Jews in Arabic-speaking countries wrote their documents by using Hebrew letters, special symbols and Arabic letters that have no equivalent in Hebrew. In the 9th century the Arabic was everyday language among Jews in Islamic countries and also penetrated into Jewish law, in the interpretation of the Bible and even in Hebrew grammars. The Judeo-Arabic dialects have been developed simultaneously in two regions: East – Land of Israel, Babylon and Egypt – and the West – Spain and North Africa. Arabic literature transliterated to Hebrew has been written in two main areas: translations of works on science and philosophy, thanks to which it was possible to develop a Jewish thought, and creating original Jewish Judeo-Arabic. Most Jewish literature written in Arabic translated documents almost forgotten and left its imprint on the Jewish creation. Once exposed in the Cairo Geniza archives revealed the great extent of Jewish literature in Arabic and its impact on the Arab-speaking Jews throughout the ages. In other cases, some original books written in Arabic have disappeared and only translations or Judeo-Arabic version of them are still existing.

The special geographic position of Italy made it a bridge between West and East in trade and cultural interaction. This special role, contributed to the building of one of the first collections of Islamic manuscripts (the Medici Oriental Press) gathered during the XV-XVII Centuries.

Document image analysis (DIA) refers to the process of converting a raster image of a document page (a matrix of pixels) to a symbolic form consisting of textual (characters, digits, punctuation, words) and graphical (lines, geometric shapes, etc.) objects; for a complete survey see [Nagy, 2000]. Document descriptions in terms of these high-level objects are significantly more compact than their image counterparts. More importantly, the rich semantic content of such descriptions makes it possible to manipulate these documents to serve a variety of uses such as searching them for specific patterns or classifying and combining them according to some criteria. Most DIA systems consist of the following main stages:

Image enhancement, noise reduction and binarization, to separate the foreground (ink) representing the written text and illustrations from the background (paper) and noise generated by several aging factors.

Page layout analysis and segmentation, to extract major text blocks, separate them from graphics (figures, logos, etc.) and segment them into columns, paragraphs, lines, words, parts of words and even strokes.

Preprocessing, to reduce noise, correct for skew (rotation of documents in the scanner surface) and convert pixels of objects to suitable representations for instance using their contours.

Feature extraction, to represent segmented objects by means of distinctive characteristics.

Classification, to assign each object to one “class” (e.g. the character label) on the basis of these features.

Post processing, using lexicons, statistics and natural language processing to improve the recognition.

Tags :

2019,Publications