SFAN: Selective Filter and Alignment Network for cross-modal retrieval
Author Identifier
Jianxin Li: https://orcid.org/0000-0002-9059-330X
Document Type
Journal Article
Publication Title
IEEE Transactions on Neural Networks and Learning Systems
Publisher
IEEE
School
School of Business and Law
Funders
National Key Research and Development Program of China (2023YFB4301800) / Natural Science Foundation of Shaanxi Province General Program Project (2025JC-YBMS-673) / New Generation Information Technology Innovation Project (2023IT080) / Basic Scientific Research Funds of Central Universities (300102404101)
Abstract
Bridging the gap between visual and textual modalities effectively has consistently been a key challenge in cross-modal retrieval. Fine-grained matching approaches improve performance by precisely aligning salient region features in visual modality with word embeddings in textual modality. However, how to effectively and efficiently filter out irrelevant features (e.g., irrelevant background regions and nonmeaningful prepositions) in multimodality remains a significant challenge. Furthermore, capturing key cross-modal relationships while minimizing misalignment interference is crucial for effective cross-modal retrieval. In this work, we propose a novel approach called the selective filter and alignment network (SFAN) to tackle these challenges. First, we propose modality-specific selective filter modules (SFMs) to selectively and implicitly filter out redundant information within each modality. We then propose the state-space models (SSMs)-based selective alignment module (SAM) to selectively capture key correspondences and reduce the disturbance of irrelevant associations. Finally, we utilize a fusion operation to combine these embeddings from both SFM and SAM to derive the final embeddings for similarity computation. Extensive experiments on the Flickr30k, MS-COCO, and MSR-VTT datasets reveal that our proposed SFAN can effectively learn robust patterns, significantly outperforming the state-of-the-art (SOTA) cross-modal retrieval methods by a wide margin.
DOI
10.1109/TNNLS.2025.3577292
Access Rights
subscription content
Comments
Huang, Y., Liu, Z., Sun, S., Cui, N., & Li, J. (2025). SFAN: Selective Filter and Alignment Network for cross-modal retrieval. IEEE Transactions on Neural Networks and Learning Systems. Advance online publication. https://doi.org/10.1109/TNNLS.2025.3577292