Image Realness Assessment and Localization with Multimodal Features
Abstract
A reliable method of quantifying the perceptual realness of AI-generated images and identifying visually inconsistent regions is crucial for practical use of AI-generated images and for improving photorealism of generative AI via realness feedback during training. This paper introduces a framework that accomplishes both overall objective realness assessment and local inconsistency identification of AI-generated images using textual descriptions of visual inconsistencies generated by vision-language models trained on large datasets that serve as reliable substitutes for human annotations. Our results demonstrate that the proposed multimodal approach improves objective realness prediction performance and produces dense realness maps that effectively distinguish between realistic and unrealistic spatial regions.