ArgusNet: Understanding 3D scenes more like humans
Author Identifier (ORCID)
Jianxin Li: https://orcid.org/0000-0002-9059-330X
Abstract
In areas like autonomous driving, human-computer interaction, and augmented reality, it is essential for machines to comprehend natural language commands and identify targets within 3D scenes. To this end, this paper introduces and investigates the task of Monocular Multiple targets 3D Visual Grounding (MM-3DVG), which aims to detect multiple 3D targets in a monocular RGB image using the descriptions provided in natural language. To address the absence of suitable datasets for this task, we build two comprehensive datasets: MM3DRefer and MT3DRefer. Furthermore, we propose the ArgusNet network architecture, which simulates visual reasoning processes of humans. The network involves identifying potential targets with a monocular 3D detector, followed by linking language descriptions to these targets using the proposed Selective Matching Module (SMM). The SMM consists of the Selective Fusion Module (SFM) for multimodal information fusion and the Selective Interaction Module (SIM) for deep feature interaction, where the SIM incorporates our specifically designed GateMamba module. Experimental results demonstrate that ArgusNet significantly outperforms other existing methods on multiple datasets, achieving state-of-the-art performance in the domain of language-guided multi-target 3D detection from monocular RGB images, a lightweight yet widely available 3D scene representation in practice. The code and datasets are available at: https://github.com/klaygky/ArgusNet.
Document Type
Journal Article
Date of Publication
4-7-2026
Volume
673
Publication Title
Neurocomputing
Publisher
Elsevier
School
School of Business and Law
Funders
National Key R&D Program of China (2023YFB4301800) / National Natural Science Foundation of China Joint Fund Project (U21B2041) / Fundamental Research Funds for the Central Universities (300102244202) / Shaanxi Province Science Foundation Fund (2025JC-YBQN-892, 2025JC-YBQN-902) / China Scholarship Council (202506560029)
Copyright
subscription content
Comments
Guo, K., Wei, H., Huang, Y., Song, X., Sun, S., Feng, M., Song, H., Li, J., & Zhang, Y. (2026). ArgusNet: Understanding 3D scenes more like humans. Neurocomputing, 673, 132895. https://doi.org/10.1016/j.neucom.2026.132895