Same or Different? Diff-Vectors for Authorship Analysis

AUTHORS: Silvia Corbara, Alejandro Moreo, Fabrizio Sebastiani

WORK PACKAGE: WP 8 – UbiQuity

URL: https://dl.acm.org/doi/10.1145/3609226

Keywords: deep learning, machine learning, information retrieval, computer science, data mining, support vector, logistic regression, artificial intelligence, supervised learning

Abstract
In this article, we investigate the effects on authorship identification tasks (including authorship verification, closed-set authorship attribution, and closed-set and open-set same-author verification) of a fundamental shift in how to conceive the vectorial representations of documents that are given as input to a supervised learner. In “classic” authorship analysis, a feature vector represents a document, the value of a feature represents (an increasing function of) the relative frequency of the feature in the document, and the class label represents the author of the document. We instead investigate the situation in which a feature vector represents an unordered pair of documents, the value of a feature represents the absolute difference in the relative frequencies (or increasing functions thereof) of the feature in the two documents, and the class label indicates whether the two documents are from the same author or not. This latter (learner-independent) type of representation has been occasionally used before, but has never been studied systematically. We argue that it is advantageous, and that, in some cases (e.g., authorship verification), it provides a much larger quantity of information to the training process than the standard representation. The experiments that we carry out on several publicly available datasets (among which one that we here make available for the first time) show that feature vectors representing pairs of documents (that we here call Diff-Vectors) bring about systematic improvements in the effectiveness of authorship identification tasks, and especially so when training data are scarce (as it is often the case in real-life authorship identification scenarios). Our experiments tackle same-author verification, authorship verification, and closed-set authorship attribution; while DVs are naturally geared for solving the 1st, we also provide two novel methods for solving the 2nd and 3rd that use a solver for the 1st as a building block. The code to reproduce our experiments is open-source and available online.




A Simple Method for Classifier Accuracy Prediction Under Prior Probability Shift

AUTHORS: Lorenzo Volpi, Alejandro Moreo, Fabrizio Sebastiani

WORK PACKAGE: WP 8 – UbiQuity

URL: A Simple Method for Classifier Accuracy Prediction Under Prior Probability Shift

Keywords: Classifier accuracy prediction, Prior probability shift, Label shift, Quantification

Abstract
The standard technique for predicting the accuracy that a classifier will have on unseen data (classifier accuracy prediction – CAP) is cross-validation (CV). However, CV relies on the assumption that the training data and the test data are sampled from the same distribution, an assumption that is often violated in many real-world scenarios. When such violations occur (i.e., in the presence of dataset shift), the estimates returned by CV are unreliable. In this paper we propose a CAP method specifically designed to address prior probability shift (PPS), an instance of dataset shift in which the training and test distributions are characterized by different class priors. By solving a system of n2 independent linear equations, with n the number of classes, our method estimates the n2 entries of the contingency table of the test data, and thus allows estimating any specific evaluation measure. Since a key step in this method involves predicting the class priors of the test data, we further observe a connection between our method and the field of “learning to quantify”. Our experiments show that, when combined with state-of-the-art quantification techniques, under PPS our method tends to outperform existing CAP methods.




The Questio de aqua et terra: A Computational Authorship Verification Study

AUTHORS: Martina Leocata, Alejandro Moreo, Fabrizio Sebastiani

WORK PACKAGE: WP 3 – T-Res

URL: The \textit{Questio de aqua et terra}: A Computational Authorship Verification Study

Keywords:

Abstract
The Questio de aqua et terra is a cosmological treatise traditionally attributed to Dante Alighieri. However, the authenticity of this text is controversial, due to discrepancies with Dante’s established works and to the absence of contemporary references. This study investigates the authenticity of the Questio via computational authorship verification (AV), a class of techniques which combine supervised machine learning and stylometry. We build a family of AV systems and assemble a corpus of 330 13th- and 14th-century Latin texts, which we use to comparatively evaluate the AV systems through leave-one-out cross-validation. Our best-performing system achieves high verification accuracy (F1=0.970) despite the heterogeneity of the corpus in terms of textual genre. The key contribution to the accuracy of this system is shown to come from Distributional Random Oversampling (DRO), a technique specially tailored to text classification which is here used for the first time in AV.
The application of the AV system to the Questio returns a highly confident prediction concerning its authenticity. These findings contribute to the debate on the authorship of the Questio, and highlight DRO’s potential in the application of AV to cultural heritage.




«Vi fui presente e vidi». Contributi per uno studio dei diari di pellegrinaggio a Roma tra genere letterario e documento storico

AUTHORS: Ilaria Sabbatini

WORK PACKAGE: WP 7 – REVER

URL: Pubblicazione | ILARIA SABBATINI | Università degli Studi di Palermo

Keywords: Pilgrimage, Diaries, Rome, Literature, History

Abstract
The document analyzes pilgrimage diaries to Rome, focusing on their value as both a literary genre and historical documents. By examining various texts, including medieval itineraries and travelers’ accounts, the author explores the unique characteristics of these diaries, highlighting how they combine elements of personal narrative with geographical and historical descriptions. The concept of pilgrimage is examined not only as a physical journey but also as a spiritual path, and the importance of religious destinations such as Rome and Jerusalem is discussed. The work raises questions about the nature and evolution of pilgrimage diaries, seeking to understand the motivations and experiences of pilgrims through the centuries.




ABBIE: Attention-Based BI-Encoders for Predicting Where to Split Compound Sanskrit Words

AUTHORS: Irfan Ali, Liliana Lo Presti, Igor Spanò, Marco La Cascia

WORK PACKAGE: WP 4 – DamySim

URL ABBIE: Attention-Based BI-Encoders for Predicting Where to Split Compound Sanskrit Words – ICAART 2025

Keywords: Word Segmentation, Sanskrit Language, Sandhi Rule, Bi-Encoders, Attention.

Abstract

Sanskrit is a highly composite language, morphologically and phonetically complex. One of the major challenges in processing Sanskrit is the splitting of compound words that are merged phonetically. Recognizing the exact location of splits in a compound word is difficult since several possible splits can be found, but only a few of them are semantically meaningful. This paper proposes a novel deep learning method that uses two bi-encoders and a multi-head attention module to predict the valid split location in Sanskrit compound words. The two bi-encoders process the input sequence in direct and reverse order respectively. The model learns the character-level context in which the splitting occurs by exploiting the correlation between the direct and reverse dynamics of the characters sequence. The results of the proposed model are compared with a state-of-the-art technique that adopts a bidirectional recurrent network to solve the same task. Experimental results show that the proposed model correctly identifies where the compound word should be split into its components in 89.27% of cases, outperforming the state-of-the-art technique. The paper also proposes a dataset developed from the repository of the Digital Corpus of Sanskrit (DCS) and the University of Hyderabad (UoH) corpus.




Benchmarking BERT-based Models for Latin: A Case Study on Biblical References in Ancient Christian Literature

AUTHORS: Davide Caffagni, Federico Cocchi, Anna Mambelli, Fabio Tutrone, Marco Zanella Marcella Cornia, Rita Cucchiara

WORK PACKAGE: WP 8 – UbiQuity

URL: Benchmarking BERT-based Models for Latin: A Case Study on Biblical References in Ancient Christian Literature

Keywords:  

Abstract
Transformer-based language models like BERT have revolutionized Natural Language Processing (NLP) research, but their application to historical languages remains underexplored. This paper investigates the adaptation of BERT-based embedding models for Latin, a language central to the study of the sacred texts of Christianity. Focusing on Jerome’s Vulgate, pre-Vulgate Latin translations of the Bible, and patristic commentaries such as Augustine’s De Genesi ad litteram, we address the challenges posed by Latin’s complex syntax, specialized vocabulary, and historical variations at the orthographic, morphological, and semantic levels. In particular, we propose fine-tuning existing BERT-based embedding models on annotated Latin corpora, using self-generated hard negatives to improve performance in detecting biblical references in early Christian literature in Latin. Experimental results demonstrate the ability of BERT-based models to identify citations of and allusions to the Bible(s) in ancient Christian commentaries while highlighting the complexities and challenges of this field. By integrating NLP techniques with humanistic expertise, this work provides a case study on intertextual analysis in Latin patristic works. It underscores the transformative potential of interdisciplinary approaches, advancing computational tools for sacred text studies and bridging the gap between philology and computational analysis.




Data extraction from 3D scanning: post-processing filtering for analytic and informative models of small archaeological finds

AUTHORS: Filippo DIARA

URL: Data extraction from 3D scanning: post-processing filtering for analytic and informative models of small archaeological finds | Archeologia e Calcolatori

Work Package: WP 9 – Taurus

Abstract
Actual 3D scanners based on the structured-light principle are opening to possibilities for creating detailed models (polygon populations) with micrometric resolutions. Consequently, highly detailed models allow specific investigations. This work focuses on 3D scanning and post-processing analysis/filtering of Ancient Near East finds, especially seals and cuneiform clay tablets, fragile artefacts that can hold a lot of semantic information beyond transliteration: e.g. seal impressions (figurative and textual sealings), fingerprint evidence, retracing and erased text. Behind the ease of use of portable structured-light scanners, hides the enormous potential for feature extraction and processing. Metric analysis (e.g. deviation analysis) coupled with the application of MSII (Multi-Scale Integral Invariant) filter enhance data extraction, changing the overall perception on details of the archaeological artefact.




Medieval Sanctuaries and Miraculous Images and Relics: Tracing the Gaze through Eye Trackers

AUTHORS: Federico Ruozzi, Marco Papasidero,

WORK PACKAGE: WP 6 – YASMINE

URL: Medieval Sanctuaries and Miraculous Images and Relics: Tracing the Gaze through Eye Trackers

Keywords:  Devotion, Gaze Studies, Sanctuaries

Abstract
This article is part of the research activities of the PNRR ITSERR project, which seeks to apply new digital technologies to religious studies. Specifically focusing on gaze studies, we utilised Aria eye trackers provided by Meta to the team of computer engineers at the University of Modena and Reggio Emilia (Italy), with whom this study is being carried out. These devices can record the gaze of users who wear them, as well as identify the objects or spatial elements being observed, the user’s location, and the duration of their focus. Adopting an interdisciplinary approach, the article explores the application of this technology to Catholic sacred spaces, specifically two sanctuaries in the Tuscan-Emilian Apennines: the Sanctuary of Our Lady of Bismantova (Reggio Emilia) and that of Saints Pellegrino and Bianco in Alpe (Modena). By observing and analysing the gaze patterns of ten users – varying in age, profession, and religious orientation – the study examines how individuals engage with these sacred contexts, with particular attention to the Marian image and the relics of the saints.




The Devil is in the Fine-Grained Details: Evaluating Open-Vocabulary Object Detectors for Fine-Grained Understanding

AUTHORS: Lorenzo Bianchi, Fabio Carrara, Nicola Messina, Claudio Gennaro, Fabrizio Falchi;

URL: https://openaccess.thecvf.com/content/CVPR2024/html/Bianchi_The_Devil_is_in_the_Fine-Grained_Details_Evaluating_Open-Vocabulary_Object_CVPR_2024_paper.html

Work Package : WP 5 – Digital Maktaba

Keywords: open-vocabulary detection, fine-grained understanding, benchmark

Abstract
Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenarios where object classes are defined in free-text formats during inference. In this paper we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand fine-grained properties of objects and their parts. To this end we introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect discern and assign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing different properties like color pattern and material. We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions which shine in standard open-vocabulary benchmarks struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://lorebianchi98.github.io/FG-OVD .




Ubiquity. Il design della comunicazione nel progetto ITSERR

AUTHORS: Fabrizio D’Avenia, Cinzia Ferrara, Marcello Costa, Chiara Palillo

URL: http://www.societaitalianadesign.it/2024/10/29/design-per-la-diversita-2/

Work Package : WP 8 – UbiQuity

Keywords:

Abstract
Within the Italian Strengthening of ESFRI RI Resilience ITSERR project, Ubiquity is a research platform
developed for detecting literal and non-literal quotations of the Bible and the Quran in later exegetic Greek,
Latin and Arab commentaries. The objective of Ubiquity’s team, which is made up of humanists, computer
scientists and designers, is to study and visualize data of sacred texts and interact with them thanks to
visual components belonging to analogue and digital infographic systems. This widespread availability of
skills for designing material and immaterial artefacts could be a great support for religious studies and
scientific research.