MirrorDiff: Learning mirror diffusion for image captioning via regeneration

Abstract

Recently, diffusion models which have achieved promising progress in text-to-image generation generally have also been generally explored for image captioning. However, these diffusion-based image captioning methods usually suffer from semantic inconsistency between image content and textual description, thus producing lagging results compared with Auto-Regressive (AR) ones. To this end, in this paper, we propose a novel dual diffusion-based framework namely MirrorDiff, to achieve semantic consistency with a symmetric image-to-text-to-image generation model, which acts like a mirror that maps the original input image into a regenerated image via the generated caption. Specifically, it first utilizes both pre-trained image encoder and text encoder to obtain image representation and textual representation respectively, then forwards the image representation and the noisy textual representation into a continuous diffusion model to output an intermediate sentence. To semantically align the intermediate sentence with the input image, a diffusion-based visual regenerator is employed to regenerate the input image conditioned on the intermediate sentence, resulting in a proposed visual regeneration loss. Different from most existing image captioning methods, MirrorDiff is a plug-and-play framework which can be plugged into many previous image captioning methods, and further evaluate the generated sentence via the visual similarity between the input image and the regenerated image. Extensive experiments on the MS COCO dataset show that our method achieves obvious improvements over state-of-the-art diffusion-based methods, up to 127.9 on CIDEr, and achieves competitive performance on multiple evaluation metrics over the auto-regressive methods trained on larger-scale datasets.

Document Type

Conference Proceeding

Date of Publication

6-30-2025

Funding Information

Fundamental Research Funds for the Central Universities (D5000250044, D5000250060) / National Natural Science Foundation of China (62201460, 62302093) / Basic Research Programs of Taicang (TC2023JC22) / Jiangsu Province Natural Science Fund (BK20230833) / Big Data Computing Center of Southeast University / Edith Cowan University

School

School of Science

Copyright

subscription content

Publisher

Association for Computing Machinery

Comments

Wang, J., Fu, L., Zhu, Y., Jin, Q., Wang, H., Li, Y., Wu, X., & Hu, K. (2025). MirrorDiff: Learning mirror diffusion for image captioning via regeneration. In Proceedings of the 2025 International Conference on Multimedia Retrieval (pp. 1331-1339). Association for Computing Machinery. https://doi.org/10.1145/3731715.3733389

Share

 
COinS