Research outputs 2022 to 2026

MirrorDiff: Learning mirror diffusion for image captioning via regeneration

Junbo Wang
Liangyu Fu
Yining Zhu
Qiangguo Jin
Hongsong Wang
Yuke Li
Xuecheng Wu
Kun Hu, Edith Cowan UniversityFollow

Author Identifier (ORCID)

Kun Hu: https://orcid.org/0000-0002-6891-8059

Abstract

Recently, diffusion models which have achieved promising progress in text-to-image generation generally have also been generally explored for image captioning. However, these diffusion-based image captioning methods usually suffer from semantic inconsistency between image content and textual description, thus producing lagging results compared with Auto-Regressive (AR) ones. To this end, in this paper, we propose a novel dual diffusion-based framework namely MirrorDiff, to achieve semantic consistency with a symmetric image-to-text-to-image generation model, which acts like a mirror that maps the original input image into a regenerated image via the generated caption. Specifically, it first utilizes both pre-trained image encoder and text encoder to obtain image representation and textual representation respectively, then forwards the image representation and the noisy textual representation into a continuous diffusion model to output an intermediate sentence. To semantically align the intermediate sentence with the input image, a diffusion-based visual regenerator is employed to regenerate the input image conditioned on the intermediate sentence, resulting in a proposed visual regeneration loss. Different from most existing image captioning methods, MirrorDiff is a plug-and-play framework which can be plugged into many previous image captioning methods, and further evaluate the generated sentence via the visual similarity between the input image and the regenerated image. Extensive experiments on the MS COCO dataset show that our method achieves obvious improvements over state-of-the-art diffusion-based methods, up to 127.9 on CIDEr, and achieves competitive performance on multiple evaluation metrics over the auto-regressive methods trained on larger-scale datasets.

Document Type

Conference Proceeding

Date of Publication

6-30-2025

Publisher

Association for Computing Machinery

School

School of Science

Funders

Fundamental Research Funds for the Central Universities (D5000250044, D5000250060) / National Natural Science Foundation of China (62201460, 62302093) / Basic Research Programs of Taicang (TC2023JC22) / Jiangsu Province Natural Science Fund (BK20230833) / Big Data Computing Center of Southeast University / Edith Cowan University

Comments

Wang, J., Fu, L., Zhu, Y., Jin, Q., Wang, H., Li, Y., Wu, X., & Hu, K. (2025). MirrorDiff: Learning mirror diffusion for image captioning via regeneration. In Proceedings of the 2025 International Conference on Multimedia Retrieval (pp. 1331-1339). Association for Computing Machinery. https://doi.org/10.1145/3731715.3733389

Copyright

subscription content

Link to Full Text

COinS

Research outputs 2022 to 2026

MirrorDiff: Learning mirror diffusion for image captioning via regeneration

Author Identifier (ORCID)

Abstract

Document Type

Date of Publication

Publisher

School

Funders

Comments

Copyright

Search

Links

Browse

Author Information

Article Locations

Research outputs 2022 to 2026

MirrorDiff: Learning mirror diffusion for image captioning via regeneration

Authors/Creators

Author Identifier (ORCID)

Abstract

Document Type

Date of Publication

Publisher

School

Funders

Comments

Copyright

Share

Search

Links

Browse

Author Information

Article Locations