AUTHORS: L. Sala, R. Martoglia, M. Vanzini, R. A. Vigliermo
URL: https://ceur-ws.org/Vol-3234/paper1.pdf
Work Package : WP5
Keywords: Cultural heritage, Digital Library, Islamic sciences, Arabic script OCR, Information extraction, Output alignment, Page layout analysis, Semiautomatic cataloguing, Software tool usage demo.
Abstract
Digital Maktaba (DM) is an interdisciplinary project to create a digital library of texts in non-Latin
alphabets (Arabic, Persian, Azerbaijani). The dataset is made available by the digital library heritage
of the ”La Pira” library in the history and doctrines of Islam based in Palermo, which is the hub of the
Foundation for Religious Sciences (FSCIRE, Bologna). Establishing protocols for the creation, maintenance
and cataloguing of historical content in non-Latin alphabets is the long-term goal of DM. The first step of
this project was to create an innovative workflow for automatic extraction of information and metadata
from title pages of Arabic script texts. The Optical Character Recognition (OCR) tool uses various
recognition systems, text processing techniques and corpora in order to provide accurate extraction and
metadata of document content. In this paper we address the ongoing development of this novel tool
and, for the first time, we present a demo of the current version that we have designed for the extraction
and cataloguing process by showing a use case on an Arabic book frontispiece. In particular, we delve
into the details of the tool workflow for automatically converting and uploading PDFs from the digital
library, for the automatic extraction of cataloguing metadata and the semiautomatic (at the current stage)
process of cataloguing. We also shortly discuss future prospects and the many additional features that
we are planning to develop.