Sensitive Topics Retrieval in Digital Libraries: A Case Study of ḥadīṯ collections

AUTHORS: Giovanni Sullutrone Riccardo Amerigo Vigliermo Luca Sala, Sonia Bergamaschi

WORK PACKAGE: WP 5 – Digital Maktaba

URL https://link.springer.com/chapter/10.1007/978-3-031-72440-4_5

Keywords: Retrieval-Augmented Generation Bias Digital Libraries Sensitive Topics Islamic studies

ḥadīṯ collections

Abstract
The advent of Large Language Models (LLMs) has led to the development of new Question-Answering (QA) systems based on Retrieval-Augmented Generation (RAG) to incorporate query-specific knowledge at inference time. In this paper, the trustworthiness of RAG systems is investigated, particularly focusing on the performance of their retrieval phase when dealing with sensitive topics. This issue is particularly relevant as it could hinder a user’s ability to analyze sections of the available corpora, effectively biasing any following research. To mimic a specialised library possibly containing sensitive topics, a ḥādīṯ dataset has been curated using an ad-hoc framework called Question-Classify-Retrieve (QCR), which automatically assesses the performance of document retrieval by operating in three main steps: Question Generation, Passage Classification, and Passage Retrieval. Different sentence embedding models for document retrieval were tested showing significant performance gap between sensitive and non-sensitive topics compared to baseline. In real-world applications this would mean relevant documents placed lower in the retrieval list leading to the presence of irrelevant information or the absence of relevant one in case of a lower cut-off.

Leave a comment