Abstract
WP5, Digital Maktaba, is focused on creating a digital cataloguing system for libraries specialized in religious studies that need to manage cultural heritage data in non-Latin alphabets, starting with the Arabic alphabet. The goal is to establish procedures for the extraction, management, and cataloguing of libraries and archives, and to develop virtuous models that can accommodate texts written in non-Latin alphabets.
Staffing
For WP5, the leader is Prof. Fabrizio D’Avenia and the product owners are Prof. Sonia BERGAMASCHI and prof. Federico RUOZZI. Prof. Fabrizio D’Avenia is responsible for overseeing the overall direction and management of WP5, while Prof. RUOZZI and Prof. MARTOGLIA serve as the main point of contact for defining and prioritizing the features and requirements of the project. They are experts in their respective fields and play key roles in ensuring the success of WP5.
Steps
- Analysis of challenges and setting of roadmap for facing them, including recognition of operational scenario and exploration of OCR tools and technologies, text mining techniques, long term preservation techniques, big data management tools, and machine learning techniques
- Definition of techniques for text acquisition/OCR and object fusion, based on fuzzy matching, to assist/automate text extraction
- Definition of techniques for extracting syntactic metadata, including exploitation of multilingual resources and techniques for automatic recognition of titles and authors
- Definition of requirements and specification for a database to store extracted catalogue data and metadata
- Supervision and validation of recognition and extraction techniques on samples of materials and larger corpora
- Design of database for storing extracted catalogue data and metadata, including data management techniques for interfacing/interchange with catalogue data from other libraries and exploitation of long term preservation practices and big data management/distributed techniques
- Definition of advanced search techniques, including approximate and full-text search, for searching archive data
- Design of web user interface for cataloguing new documents and searching the archive
- Definition of intelligent assistance techniques based on similarity search and supervised (incremental and interpretable) ML algorithms to assist data entry, automate publication type recognition, and allow the prototype to “learn” and become more automated and effective with use
- Validation of algorithms on samples of materials and on existing corpora
- Implementation of DevSecOps principles for faster and automated code deployment, increased speed to delivery of applications and services, alignment of development and IT operations, and integration of security in the design process
- Integration and testing of developed components
- Dissemination and exploitation of results, including creation of a web platform for Digital Maktaba and organization of training activities for librarians.
Outcomes
- Development of text acquisition/OCR techniques for assisting/automating text extraction, including object fusion techniques based on fuzzy matching to align/merge outputs of different OCR tools
- Definition of knowledge extraction techniques for linguistic/semantic metadata, including use of multilingual resources and techniques for title and author recognition
- Definition of requirements and specification for a database to store extracted catalogue data and metadata
- Supervision and validation of recognition and extraction techniques on samples of materials and larger corpora
- Development of IT tools for data management, interactive search, and supervised cataloguing, including a database for storing extracted data and metadata, advanced search techniques, a web user interface for cataloguing and searching, and intelligent assistance techniques based on similarity search and supervised (incremental and interpretable) machine learning algorithms
- Integration and testing of developed techniques to create a reproducible and reusable web tool for simple cataloguing in different languages and fields
- Exploration of continuous integration/continuous deployment principles to improve speed of delivery and security in software development
- Use of agile development principles to align development and IT operations and improve collaboration among project partners