AUTHORS: S. Bergamaschi, R. Martoglia, F. Ruozzi, R. A. Vigliermo, L. Sala, M. Vanzini
URL: https://ceur-ws.org/Vol-3365/short11.pdf
Work Package : WP5
Keywords: Cultural heritages, Non-Latin alphabets, Knowledge extraction, Machine Learning, Natural Language Processing, Big data management, Long-term preservation, Big data integration, Named Entity Recognition
Abstract
The services provided by today’s cutting-edge digital library systems may benefit from new technologies that can improve cataloguing efficiency and cultural heritages preservation and accessibility. Below, we introduce the recently started Digital Maktaba (DM) project, which suggests a new model for the knowledge extraction and semi-automatic cataloguing task in the context of digital libraries that contain documents in non-Latin scripts (e.g. Arabic). Since DM involves a large amount of unorganized data from several sources, particular emphasis will be placed on topics such as big data integration, big data analysis and long-term preservation. This project aims to create an innovative workflow for the automatic extraction of information and metadata and for a semi-automated cataloguing process by exploiting Machine Learning, Natural Language Processing, Artificial Intelligence and data management techniques to provide a system that is capable of speeding up, enhancing and supporting the librarian’s work. We also report on some promising results that we obtained through a preliminary proof of concept experimentation. (Short paper, discussion paper)