ArgusNet: Understanding 3D scenes more like humans

Author Identifier (ORCID)

Jianxin Li: https://orcid.org/0000-0002-9059-330X

Abstract

In areas like autonomous driving, human-computer interaction, and augmented reality, it is essential for machines to comprehend natural language commands and identify targets within 3D scenes. To this end, this paper introduces and investigates the task of Monocular Multiple targets 3D Visual Grounding (MM-3DVG), which aims to detect multiple 3D targets in a monocular RGB image using the descriptions provided in natural language. To address the absence of suitable datasets for this task, we build two comprehensive datasets: MM3DRefer and MT3DRefer. Furthermore, we propose the ArgusNet network architecture, which simulates visual reasoning processes of humans. The network involves identifying potential targets with a monocular 3D detector, followed by linking language descriptions to these targets using the proposed Selective Matching Module (SMM). The SMM consists of the Selective Fusion Module (SFM) for multimodal information fusion and the Selective Interaction Module (SIM) for deep feature interaction, where the SIM incorporates our specifically designed GateMamba module. Experimental results demonstrate that ArgusNet significantly outperforms other existing methods on multiple datasets, achieving state-of-the-art performance in the domain of language-guided multi-target 3D detection from monocular RGB images, a lightweight yet widely available 3D scene representation in practice. The code and datasets are available at: https://github.com/klaygky/ArgusNet.

Document Type

Journal Article

Date of Publication

4-7-2026

Volume

673

Publication Title

Neurocomputing

Publisher

Elsevier

School

School of Business and Law

Funders

National Key R&D Program of China (2023YFB4301800) / National Natural Science Foundation of China Joint Fund Project (U21B2041) / Fundamental Research Funds for the Central Universities (300102244202) / Shaanxi Province Science Foundation Fund (2025JC-YBQN-892, 2025JC-YBQN-902) / China Scholarship Council (202506560029)

Comments

Guo, K., Wei, H., Huang, Y., Song, X., Sun, S., Feng, M., Song, H., Li, J., & Zhang, Y. (2026). ArgusNet: Understanding 3D scenes more like humans. Neurocomputing, 673, 132895. https://doi.org/10.1016/j.neucom.2026.132895

Copyright

subscription content

Share

 
COinS
 

Link to publisher version (DOI)

10.1016/j.neucom.2026.132895