Knowledge Extraction and Cross-Language Data Integration in Digital Libraries

AUTHORS: L. Sala

URL: https://ceur-ws.org/Vol-3478/paper17.pdf

Work Package : WP5

Keywords: Data Integration, Cross-Language Record Linkage, Knowledge Extraction, Long-term Preservation

Abstract
Digital Humanities (DH) is an interdisciplinary field that has grown rapidly in recent years, requiring the creation of an efficient and uniform platform capable of managing various types of data in several languages. This paper presents the research objectives and methodologies of my PhD project: the creation of a novel framework for Knowledge Extraction and Multilingual Data Integration in the context of digital libraries in non-Latin languages, in particular Arabic, Persian and Azerbaijani. The research began with the Digital Maktaba (DM) project and continued within the PNRR ITSERR infrastructure, in which the DBGroup1 participates. The project aims to develop a two-component framework consisting of a Knowledge Extraction Subsystem and a Data Integration Subsystem. The case study is based on the DM project, which seeks to create a flexible and efficient digital library for preserving and analyzing multicultural heritage documents by exploiting the available and ad-hoc created datasets, Explainable Machine Learning , Natural Language Processing (NLP) technologies and Data Integration approaches. Key challenges and future developments in Knowledge Extraction and Data Integration are examined, which involve leveraging the MOMIS system for Data Integration tasks and adopting a microservices-based architecture for the effective implementation of the system. The goal is to provide a versatile platform for organizing and integrating various data sources and languages, thereby fostering a more inclusive and accessible global perspective on cultural and historical artefacts that encourage collaboration in building an expanding knowledge base.