MirrorDiff: Learning mirror diffusion for image captioning via regeneration
Abstract
Recently, diffusion models which have achieved promising progress in text-to-image generation generally have also been generally explored for image captioning. However, these diffusion-based image captioning methods usually suffer from semantic inconsistency between image content and textual description, thus producing lagging results compared with Auto-Regressive (AR) ones. To this end, in this paper, we propose a novel dual diffusion-based framework namely MirrorDiff, to achieve semantic consistency with a symmetric image-to-text-to-image generation model, which acts like a mirror that maps the original input image into a regenerated image via the generated caption. Specifically, it first utilizes both pre-trained image encoder and text encoder to obtain image representation and textual representation respectively, then forwards the image representation and the noisy textual representation into a continuous diffusion model to output an intermediate sentence. To semantically align the intermediate sentence with the input image, a diffusion-based visual regenerator is employed to regenerate the input image conditioned on the intermediate sentence, resulting in a proposed visual regeneration loss. Different from most existing image captioning methods, MirrorDiff is a plug-and-play framework which can be plugged into many previous image captioning methods, and further evaluate the generated sentence via the visual similarity between the input image and the regenerated image. Extensive experiments on the MS COCO dataset show that our method achieves obvious improvements over state-of-the-art diffusion-based methods, up to 127.9 on CIDEr, and achieves competitive performance on multiple evaluation metrics over the auto-regressive methods trained on larger-scale datasets.
Document Type
Conference Proceeding
Date of Publication
6-30-2025
Funding Information
Fundamental Research Funds for the Central Universities (D5000250044, D5000250060) / National Natural Science Foundation of China (62201460, 62302093) / Basic Research Programs of Taicang (TC2023JC22) / Jiangsu Province Natural Science Fund (BK20230833) / Big Data Computing Center of Southeast University / Edith Cowan University
School
School of Science
Copyright
subscription content
Publisher
Association for Computing Machinery
Comments
Wang, J., Fu, L., Zhu, Y., Jin, Q., Wang, H., Li, Y., Wu, X., & Hu, K. (2025). MirrorDiff: Learning mirror diffusion for image captioning via regeneration. In Proceedings of the 2025 International Conference on Multimedia Retrieval (pp. 1331-1339). Association for Computing Machinery. https://doi.org/10.1145/3731715.3733389