Isometric Sets of Words and Generalizations of the Fibonacci Cubes

AUTHORS: Marcella Anselmo Giuseppa Castiglione Manuela FloresDora Giammarresi,

Maria Madonia Sabrina Mantaci 

WORK PACKAGE: WP 7 – REVER

URL https://link.springer.com/chapter/10.1007/978-3-031-64309-5_35

Keywords: Isometric sets of words Hamming distance Hypercubes Generalized Fibonacci Cubes

Abstract
The hypercube Qn is a graph whose 2n vertices can be associated to all binary words of length n in a way that adjacent vertices get words that differ only in one symbol. Given a word f, the subgraph Qn(f) is defined by selecting all vertices not containing f as a factor. A word f is said to be isometric if Qn(f) is an isometric subgraph of Qn, i.e., keeping the distances between the remaining nodes. Graphs Qn(f) were defined and studied as a generalization of Fibonacci cubes Qn(11). Isometric words have been completely characterized using combinatorial methods for strings.

We introduce the notion of isometric sets of words with the aim of capturing further interesting cases in the scenario of isometric subgraphs of the hypercubes. We prove some combinatorial properties and study special interesting cases.




Density of Ham- and Lee- non-isometric k-ary Words

AUTHORS: Marcella Anselmo Manuela Flores Maria Serafina Madonia

WORK PACKAGE: WP 7 – REVER

URL https://ceur-ws.org/Vol-3587/3914.pdf

Keywords: Isometric words, Overlap with errors, Hamming and Lee distance, Density

Abstract
Isometric k-ary words have been defined referring to the Hamming and the Lee distances. A word is
non-isometric if and only if it has a prefix at distance 2 from the suffix of same length; such a prefix is
called 2-error overlap. The limit density of isometric binary words based on the Hamming distance has
been evaluated by Klavz ˇar and Shpectorov, obtaining that about 8% of all binary words are isometric. In
this paper, the issue is addressed for k-ary words and referring to the Hamming and the Lee distances.
Actually, the only meaningful case of Lee-isometric k-ary words is when k = 4. It is proved that, when the
length of words increases, the limit density of quaternary Ham-isometric words is around 17%, while the
limit density of quaternary Lee-isometric words is even bigger, it is about 30%. The results are obtained
using combinatorial methods and algorithms for counting the number of k-ary isometric words.




Using large language models to create narrative events

AUTHORS: Valentina Bartalesi, Emanuele Lenzi, Claudio De Martino

WORK PACKAGE:

URL https://peerj.com/articles/cs-2242/

Keywords:

ḥadīṯ collections

Abstract
Narratives play a crucial role in human communication, serving as a means to convey experiences, perspectives, and meanings across various domains. They are particularly significant in scientific communities, where narratives are often utilized to explain complex phenomena and share knowledge. This article explores the possibility of integrating large language models (LLMs) into a workflow that, exploiting the Semantic Web technologies, transforms raw textual data gathered by scientific communities into narratives. In particular, we focus on using LLMs to automatically create narrative events, maintaining the reliability of the generated texts. The study provides a conceptual definition of narrative events and evaluates the performance of different smaller LLMs compared to the requirements we identified. A key aspect of the experiment is the emphasis on maintaining the integrity of the original narratives in the LLM outputs, as experts often review texts produced by scientific communities to ensure their accuracy and reliability. We first perform an evaluation on a corpus of five narratives and then on a larger dataset comprising 124 narratives. LLaMA 2 is identified as the most suitable model for generating narrative events that closely align with the input texts, demonstrating its ability to generate high-quality narrative events. Prompt engineering techniques are then employed to enhance the performance of the selected model, leading to further improvements in the quality of the generated texts.




Sensitive Topics Retrieval in Digital Libraries: A Case Study of ḥadīṯ collections

AUTHORS: Giovanni Sullutrone Riccardo Amerigo Vigliermo Luca Sala, Sonia Bergamaschi

WORK PACKAGE: WP 5 – Digital Maktaba

URL https://link.springer.com/chapter/10.1007/978-3-031-72440-4_5

Keywords: Retrieval-Augmented Generation Bias Digital Libraries Sensitive Topics Islamic studies

ḥadīṯ collections

Abstract
The advent of Large Language Models (LLMs) has led to the development of new Question-Answering (QA) systems based on Retrieval-Augmented Generation (RAG) to incorporate query-specific knowledge at inference time. In this paper, the trustworthiness of RAG systems is investigated, particularly focusing on the performance of their retrieval phase when dealing with sensitive topics. This issue is particularly relevant as it could hinder a user’s ability to analyze sections of the available corpora, effectively biasing any following research. To mimic a specialised library possibly containing sensitive topics, a ḥādīṯ dataset has been curated using an ad-hoc framework called Question-Classify-Retrieve (QCR), which automatically assesses the performance of document retrieval by operating in three main steps: Question Generation, Passage Classification, and Passage Retrieval. Different sentence embedding models for document retrieval were tested showing significant performance gap between sensitive and non-sensitive topics compared to baseline. In real-world applications this would mean relevant documents placed lower in the retrieval list leading to the presence of irrelevant information or the absence of relevant one in case of a lower cut-off.




Text-to-SQL with Large Language Models: Exploring the Promise and Pitfalls

AUTHORS: Luca Sala, Giovanni Sullutrone, Sonia Bergamaschi

WORK PACKAGE: WP 5 – Digital Maktaba

URL https://ceur-ws.org/Vol-3741/paper65.pdf

Keywords: Large Language Models, Text-to-SQL, Relational Databases, SQL

Abstract
The emergence of Large Language Models (LLMs) represents a fundamental change in the ever-evolving
field of natural language processing (NLP). Over the past few years, the enhanced capabilities of these
models have led to their widespread use across various fields, in both practical applications and research
contexts. In particular, as data science intersects with LLMs, new research opportunities and insights
emerge, notably in translating text into Structured Query Language (Text-to-SQL). The application of
this technology to such task poses a unique set of opportunities and related issues that have significant
implications for information retrieval. This discussion paper delves into these intricacies and limitations,
focusing on challenges that jeopardise efficacy and reliability. This research investigates the scalability,
accuracy, and concerning issue of hallucinated responses, questioning the trustworthiness of LLMs.
Furthermore, we point out the limits of the current usage of test dataset created for research purposes
in capturing real-world complexities. Finally, we consider the performance of Text-to-SQL with LLMs
from different perspectives. Our investigation identifies the key challenges faced by LLMs and proposes
viable solutions to facilitate the exploitation of these models to advance data retrieval, bridging the gap
between academic researcher and real-world application scenarios.




Automatic Lemmatization of Old Church Slavonic Language Using A Novel Dictionary-Based Approach

AUTHORS: Usman Nawaz, Liliana Lo Presti, Marianna Napolitano, Marco La Cascia

WORK PACKAGE: WP 4 – DamySim

URL https://link.springer.com/chapter/10.1007/978-3-031-70442-0_25

Keywords: Old Church Slavonic Lemmatization Ancient Language Natural Language Processing

Abstract
Old Church Slavonic (OCS) is an ancient language, and it has unique challenges and hurdles in natural language processing. Currently, there is a lack of Python libraries devised for the analysis of OCS texts. This research is not just filling the crucial gap in the computational treatment of OCS language but also producing valuable resources for scholars in historical linguistics, cultural studies, and humanities for the development of further research in the field of ancient language processing. The main contribution of this research work is the development of an algorithm for the lemmatization of OCS texts based on a learned dictionary. The approach can deal with ancient languages without the need for prior linguistic knowledge. Preparing a dataset of more than 330K words of OCS and their corresponding lemmas, this approach integrates the algorithm and dictionary efficiently to achieve accurate lemmatization on test data.




Unveiling the Truth: Exploring Human Gaze Patterns in Fake Images

AUTHORS: Giuseppe Cartella, Vittorio Cuculo, Marcella Cornia, Rita Cucchiara

WORK PACKAGE: WP 6 – YASMINE

URL https://ieeexplore.ieee.org/document/10465604

Keywords: Deepfakes, gaze tracking, human in the loop, visual perception

Abstract
Creating high-quality and realistic images is now possible thanks to the impressive advancements in image generation. A description in natural language of your desired output is all you need to obtain breathtaking results. However, as the use of generative models grows, so do concerns about the propagation of malicious content and misinformation. Consequently, the research community is actively working on the development of novel fake detection techniques, primarily focusing on low-level features and possible fingerprints left by generative models during the image generation process. In a different vein, in our work, we leverage human semantic knowledge to investigate the possibility of being included in frameworks of fake image detection. To achieve this, we collect a novel dataset of partially manipulated images using diffusion models and conduct an eye-tracking experiment to record the eye movements of different observers while viewing real and fake stimuli. A preliminary statistical analysis is conducted to explore the distinctive patterns in how humans perceive genuine and altered images. Statistical findings reveal that, when perceiving counterfeit samples, humans tend to focus on more confined regions of the image, in contrast to the more dispersed observational pattern observed when viewing genuine images.




RoBERT2VecTM: ANovel Approach for Topic Extraction in IslamicStudies

AUTHORS: Sania Aftar Luca Gagliardelli Amina El Ganadi Federico Ruozzi Sonia Bergamaschi

WORK PACKAGE: WP 5 – DIGITAL MAKTABA

URL https://aclanthology.org/2024.findings-emnlp.534/

Keywords: 

Abstract
Investigating “Hadith” texts, crucial for theological studies and Islamic jurisprudence, presents challenges due to the linguistic complexity of Arabic, such as its complex morphology. In this paper, we propose an innovative approach to address the challenges of topic modeling in Hadith studies by utilizing the Contextualized Topic Model (CTM). Our study introduces RoBERT2VecTM, a novel neural-based approach that combines the RoBERTa transformer model with Doc2Vec, specifically targeting the semantic analysis of “Matn” (the actual content). The methodology outperforms many traditional state-of-the-art NLP models by generating more coherent and diverse Arabic topics. The diversity of the generated topics allows for further categorization, deepening the understanding of discussed concepts. Notably, our research highlights the critical impact of lemmatization and stopwords in enhancing topic modeling. This breakthrough marks a significant stride in applying NLP to non-Latin languages and opens new avenues for the nuanced analysis of complex religious texts.




The Impact of Generative AI on Islamic Studies: Case Analysis of “Digital Muhammad ibn Isma’il Al-Bukharī”

AUTHORS: Amina El Ganadi Sania Aftar Luca Gagliardelli Sonia Bergamaschi Federico Ruozzi

WORK PACKAGE: WP 5 – DIGITAL MAKTABA

URL

https://ieeexplore.ieee.org/document/10852480

Keywords: Analytical models, Accuracy, Text analysis, Large language models, Collaboration, Training data, Chatbots, Reliability engineering, Prompt engineering, Artificial intelligence

Abstract

The emergence of large language models (LLMs) such as ChatGPT, LLaMA, Gemini, and Claude has transformed natural language processing (NLP) tasks by demonstrating remarkable capabilities in generating fluent and contextually appropriate responses. This paper examines the current state of LLMs, their applications, inherent challenges, and potential future directions necessitating multidisciplinary collaboration. A key focus is the application of generative AI in Islamic studies, particularly in managing sensitive content such as the Ahadith (corpus of sayings, actions, and approvals attributed to the Prophet Muḥammad). We detail the customization and refinement of the AI model, “Digital Muḥammad ibn Ismail Al-Bukhari,” designed to provide accurate responses based on the Sahih Al-Bukhari collection. Our methodology includes rigorous dataset curation, preprocessing, model customization, and evaluation to ensure the model’s reliability. Strategies to mitigate hallucinations involve implementing context-aware constraints, regular audits, and continuous feedback loops to maintain adherence to authoritative texts and correct biases. Findings indicate a significant reduction in hallucinations, though challenges such as residual biases and handling ambiguous queries persist. This research underscores the importance of recognizing LLMs’ limitations and highlights the need for collaborative efforts in fine-tuning these models with authoritative texts. It offers a framework for the cautious application of generative AI in Islamic studies, emphasizing continuous improvements to enhance AI reliability.




A Novel Methodology for Topic Identification in Hadith

AUTHORS: Sania Aftar Luca Gagliardelli Amina El Ganadi Federico Ruozzi Sonia Bergamaschi

WORK PACKAGE: WP 5 – DIGITAL MAKTABA

URL

https://ceur-ws.org/Vol-3643/paper12.pdf

Keywords: Topic Modeling, Hadith, Neural Topic Model

Abstract

In this paper, we present our preliminary work on developing a novel neural-based approach named
RoBERT2VecTM, aimed at identifying topics within the “Matn” of “Hadith”. This approach focuses on
semantic analysis, showing potential to outperform current state-of-the-art models. Despite the avail
ability of various models for topic identification, many struggle with multilingual datasets. Furthermore,
some models have limitations in discerning deep semantic meanings, not trained for languages such as
Arabic. Considering the sensitive nature of Hadith texts, where topics are often complexly interleaved,
careful handling is imperative. We anticipate that RoBERT2VecTM will offer substantial improvements
in understanding contextual relationships within texts, a crucial aspect for accurately identifying topics
in such intricate religious documents.