μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context

AUTHORS: Fabio Quattrini, Carmine Zaccagnino, Silvia Cascianelli, Laura Righi, Rita Cucchiara

URL: [2408.15646] μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context

Work Package : WP 6 – YASMINE _ WP 7 – REVER

Keywords:

Abstract
Regesta are catalogs of summaries of other documents and, in some cases, are the only source of information about the content of such full-length documents. For this reason, they are of great interest to scholars in many social and humanities fields. In this work, we focus on Regesta Pontificum Romanum, a large collection of papal registers. Regesta are visually rich documents, where the layout is as important as the text content to convey the contained information through the structure, and are inherently multi-page documents. Among Digital Humanities techniques that can help scholars efficiently exploit regesta and other documental sources in the form of scanned documents, Document Parsing has emerged as a task to process document images and convert them into machine-readable structured representations, usually markup language. However, current models focus on scientific and business documents, and most of them consider only single-paged documents. To overcome this limitation, in this work, we propose {\mu}gat, an extension of the recently proposed Document parsing Nougat architecture, which can handle elements spanning over the single page limits. Specifically, we adapt Nougat to process a larger, multi-page context, consisting of the previous and the following page, while parsing the current page. Experimental results, both qualitative and quantitative, demonstrate the effectiveness of our proposed approach also in the case of the challenging Regesta Pontificum Romanorum.




Alfie: Democratising RGBA Image Generation With No $$$

AUTHORS: Fabio Quattrini, Vittorio Pippi Silvia Cascianelli Rita Cucchiara

URL: [2408.14826] Alfie: Democratising RGBA Image Generation With No $$$

Work Package : WP 6 – YASMINE

Keywords:

Abstract
Designs and artworks are ubiquitous across various creative fields, requiring graphic design skills and dedicated software to create compositions that include many graphical elements, such as logos, icons, symbols, and art scenes, which are integral to visual storytelling. Automating the generation of such visual elements improves graphic designers’ productivity, democratizes and innovates the creative industry, and helps generate more realistic synthetic data for related tasks. These illustration elements are mostly RGBA images with irregular shapes and cutouts, facilitating blending and scene composition. However, most image generation models are incapable of generating such images and achieving this capability requires expensive computational resources, specific training recipes, or post-processing solutions. In this work, we propose a fully-automated approach for obtaining RGBA illustrations by modifying the inference-time behavior of a pre-trained Diffusion Transformer model, exploiting the prompt-guided controllability and visual quality offered by such models with no additional computational cost. We force the generation of entire subjects without sharp croppings, whose background is easily removed for seamless integration into design projects or artistic scenes. We show with a user study that, in most cases, users prefer our solution over generating and then matting an image, and we show that our generated illustrations yield good results when used as inputs for composite scene generation pipelines. We release the code at this https URL.




Merging and Splitting Diffusion Paths for Semantically Coherent Panoramas

AUTHORS: Fabio Quattrini, Vittorio Pippi Silvia Cascianelli Rita Cucchiara

URL: https://link.springer.com/chapter/10.1007/978-3-031-72986-7_14

Work Package : WP 6 – YASMINE

Keywords: Image Generation Diffusion Models Text-to-Image

Abstract
Diffusion models have become the State-of-the-Art for text-to-image generation, and increasing research effort has been dedicated to adapting the inference process of pretrained diffusion models to achieve zero-shot capabilities. An example is the generation of panorama images, which has been tackled in recent works by combining independent diffusion paths over overlapping latent features, which is referred to as joint diffusion, obtaining perceptually aligned panoramas. However, these methods often yield semantically incoherent outputs and trade-off diversity for uniformity. To overcome this limitation, we propose the Merge-Attend-Diffuse operator, which can be plugged into different types of pretrained diffusion models used in a joint diffusion setting to improve the perceptual and semantical coherence of the generated panorama images. Specifically, we merge the diffusion paths, reprogramming self- and cross-attention to operate on the aggregated latent space. Extensive quantitative and qualitative experimental analysis, together with a user study, demonstrate that our method maintains compatibility with the input prompt and visual quality of the generated images while increasing their semantic coherence. We release the code at https://github.com/aimagelab/MAD.




Binarizing Documents by Leveraging both Space and Frequency

AUTHORS: Fabio Quattrini, Vittorio Pippi Silvia Cascianelli Rita Cucchiara

URL: https://dl.acm.org/doi/10.1007/978-3-031-70543-4_1

Work Package : All ITSERR WPs using Artificial Intelligence

Keywords: Document Enhancement Document Image Binarization Fast Fourier Convolution

Abstract
Document Image Binarization is a well-known problem in Document Analysis and Computer Vision, although it is far from being solved. One of the main challenges of this task is that documents generally exhibit degradations and acquisition artifacts that can greatly vary throughout the page. Nonetheless, even when dealing with a local patch of the document, taking into account the overall appearance of a wide portion of the page can ease the prediction by enriching it with semantic information on the ink and background conditions. In this respect, approaches able to model both local and global information have been proven suitable for this task. In particular, recent applications of Vision Transformer (ViT)-based models, able to model short and long-range dependencies via the attention mechanism, have demonstrated their superiority over standard Convolution-based models, which instead struggle to model global dependencies. In this work, we propose an alternative solution based on the recently introduced Fast Fourier Convolutions, which overcomes the limitation of standard convolutions in modeling global information while requiring fewer parameters than ViTs. We validate the effectiveness of our approach via extensive experimental analysis considering different types of degradations.




VATr++: Choose Your Words Wisely for Handwritten Text Generation

AUTHORS: Bram Vanherle Vittorio Pippi Silvia Cascianelli Nick Michiels, Frank Van Reeth, Rita Cucchiara

URL: https://ieeexplore.ieee.org/document/10716806

Work Package : All ITSERR WPs using Artificial Intelligence

Keywords: Handwritten text generation, handwritten text generation evaluation, synthetic data

Abstract
Styled Handwritten Text Generation (HTG) has received significant attention in recent years, propelled by the success of learning-based solutions employing GANs, Transformers, and, preliminarily, Diffusion Models. Despite this surge in interest, there remains a critical yet understudied aspect – the impact of the input, both visual and textual, on the HTG model training and its subsequent influence on performance. This work extends the VATr (Pippi et al. 2023) Styled-HTG approach by addressing the pre-processing and training issues that it faces, which are common to many HTG models. In particular, we propose generally applicable strategies for input preparation and training regularization that allow the model to achieve better performance and generalization capabilities. Moreover, in this work, we go beyond performance optimization and address a significant hurdle in HTG research – the lack of a standardized evaluation protocol. In particular, we propose a standardization of the evaluation protocol for HTG and conduct a comprehensive benchmarking of existing approaches. By doing so, we aim to establish a foundation for fair and meaningful comparisons between HTG strategies, fostering progress in the field.




Dreams, Texts, and Truths: Augustine on Hermeneutics and Oneirocriticism

AUTHORS: Fabio Tutrone

URL: http://hdl.handle.net/10447/667817

Work Package : WP 8 – UbiQuity

Keywords: Augustine, dreams, oneirocriticism, oneirology, Bible, biblical exegesis, allegory, Tertullian, Origen, Philo of Alexandria, Passio Perpetuae et Felicitatis, early Christian literature, Artemidorus of Daldis

Abstract
In the Greek and Roman worlds, oneirocriticism is hermeneutics and presupposes an epistemology – these and other cognate fields of inquiry being involved in a continuous process of social, political, and religious change. The present paper explores the relationship between dreams and hermeneutics in a meaningful passage of Augustine’s twelve-book commentary On the Literal Meaning of Genesis (De Genesi ad litteram) – a work rightly considered the most important testimony to the Christian cosmology of antiquity and the Middle Ages – in which the greatest of the Latin Church Fathers establishes a parallel between the interpretation of dreams and that of sacred texts. By elucidating the cultural background of Augustine’s understanding of dream images as cognitive phenomena that underlie both crucial passages of the Bible and the common experience of humans – both the soul and the body, both natural and supernatural powers – this paper sheds new light upon Augustine’s reaction to the materialism and literalism of Tertullian and early Christian communities, his reception of the allegorical method of Origen and the Alexandrian school, and his mystical embracing of Neoplatonic theories of knowledge. Indeed, Augustine turns out to be perfectly aware of many Greco-Roman and early Christian debates on oneirology and hermeneutical methods, and while he fiercely warns against the belief that the revelation of the Bible can be superseded or contradicted by the individual revelations of dreams, he strives to put together an original paradigm of natural philosophy, cognitive psychology, and symbolic interpretation, in an attempt to give dreams a definite place in the order of things.




Aëriae Animae: Souls and Elements from the Roman Cosmos to the Christian Afterworld

AUTHORS: Fabio Tutrone

URL: http://hdl.handle.net/10447/667820

Work Package : WP 8 – UbiQuity

Keywords: Augustine, early Christian psychology, soul, body, four elements, M. Terentius Varro, Antiochus of Ascalon, Middle Platonism, Neoplatonism, Bible, Stoicism, demonology, philosophy of nature, theology

Abstract
It has been widely recognized that until the fourth century AD Christians discussed freely about the source and the nature of the soul – the cases of Origen and Tertullian being emblematic of this situation in the East and in the West, respectively. It was only in the fourth century AD – after the so-called conversion of Constantine, with the Church’s increasing entanglement with political and social power and the emergence of a new generation of Platonizing intellectuals from the ranks of the upper class – that Christian bishops and theologians inaugurated a new discourse on the soul, its transcendent origin, immaterial constitution, and immortal destiny, which entailed the banishment and repression of earlier alternative visions. In the present paper, I shall be exploring an episode in this crucial historical transition, which, though limited in scope, can shed light upon the long-standing interactions between Greco-Roman theories of matter, elements, and principles, on the one hand, and Christian ideas of the soul and the afterworld, on the other. I am going to focus on the treatise On the City of God (De Civitate Dei) by Augustine of Hippo, who is usually regarded as one of the most decisive and influential figures in what can be called the Neoplatonic turn of fourth-century AD Christian eschatology. It is too often forgotten that throughout his long engagement with the issue of the nature and origin of the soul Augustine maintained an agnostic position, which is faithfully mirrored in all his writings. Indeed, I shall attempt to show that Augustine’s troubled reflection on the soul – on what he repeatedly terms as the ‘extremely obscure question of the soul’ (obscurissimam de anima quaestionem) – includes a meaningful dialogue with Book 16 of Varro’s Divine Antiquities (Antiquitates Rerum Divinarum) and its theory that the four elements of the cosmos host four different kinds of souls. I will investigate the philosophical pedigree of Varro’s cosmological-cum-psychological doctrine, with its recognizable mixture of Platonic and Stoic notions, arguing that Varro’s teacher, the Middle Platonist philosopher Antiochus of Ascalon, is its most likely source. However, far from restricting myself to an exercise in Quellenforschung, I shall claim that the Varronian theory reported in Book 7 of Augustine’s City of God should be read in light of Augustine’s sustained reception of the Platonic tradition in Book 8 of the same work, where the view that the body of demons is made up of air is endorsed by Augustine and attests to his serious pondering of the role of the natural elements in the emergence of a creature’s essence.




Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

AUTHORS: Nicholas Moratelli, Davide Caffagni, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

WORK PACKAGE: WP 6 – YASMINE

URL: Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

Keywords:  

Abstract
The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics.




The Revolution of Multimodal Large Language Models: A Survey

AUTHORS: Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Luca Barsellotti, Lorenzo Baraldi Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

WORK PACKAGE: WP 6 – YASMINE

URL: https://aclanthology.org/2024.findings-acl.807/

Keywords:  

Abstract
Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs.




Pixels of Faith: Exploiting Visual Saliency to Detect Religious Image Manipulation

AUTHORS: Giuseppe Cartella, Vittorio Cuculo, Marcella Cornia, Marco Papasidero, Federico Ruozzi, Rita Cucchiara

WORK PACKAGE: WP 6 – YASMINE

URL: 2024_ECCVW_Gaze_ITSERR.pdf

Keywords:  Gaze-assisted AI Human Attention Deepfake Detection Religious Studies

Abstract
The proliferation of generative models has revolutionized various aspects of daily life, bringing both opportunities and challenges. This paper tackles a critical problem in the field of religious studies: the automatic detection of partially manipulated religious images. We address the discrepancy between human and algorithmic capabilities in identifying fake images, particularly those visually obvious to humans but challenging for current algorithms. Our study introduces a new testing dataset for religious imagery and incorporates human-derived saliency maps to guide deep learning models toward perceptually relevant regions for fake detection. Experiments demonstrate that integrating visual attention information into the training process significantly improves model performance, even with limited eye-tracking data. This human-in-the-loop approach represents a significant advancement in deepfake detection, particularly for preserving the integrity of religious and cultural content. This work contributes to the development of more robust and human-aligned deepfake detection systems, addressing critical challenges in the era of widespread generative AI technologies.