Research outputs 2022 to 2026

ArgusNet: Understanding 3D scenes more like humans

Author Identifier (ORCID)

Jianxin Li: https://orcid.org/0000-0002-9059-330X

Abstract

In areas like autonomous driving, human-computer interaction, and augmented reality, it is essential for machines to comprehend natural language commands and identify targets within 3D scenes. To this end, this paper introduces and investigates the task of Monocular Multiple targets 3D Visual Grounding (MM-3DVG), which aims to detect multiple 3D targets in a monocular RGB image using the descriptions provided in natural language. To address the absence of suitable datasets for this task, we build two comprehensive datasets: MM3DRefer and MT3DRefer. Furthermore, we propose the ArgusNet network architecture, which simulates visual reasoning processes of humans. The network involves identifying potential targets with a monocular 3D detector, followed by linking language descriptions to these targets using the proposed Selective Matching Module (SMM). The SMM consists of the Selective Fusion Module (SFM) for multimodal information fusion and the Selective Interaction Module (SIM) for deep feature interaction, where the SIM incorporates our specifically designed GateMamba module. Experimental results demonstrate that ArgusNet significantly outperforms other existing methods on multiple datasets, achieving state-of-the-art performance in the domain of language-guided multi-target 3D detection from monocular RGB images, a lightweight yet widely available 3D scene representation in practice. The code and datasets are available at: https://github.com/klaygky/ArgusNet.

Keywords

Monocular 3D scene understanding, monocular 3D visual grounding, monocular vision, multimodal learning

Document Type

Journal Article

Date of Publication

4-7-2026

Volume

673

Publication Title

Neurocomputing

Publisher

Elsevier

School

School of Business and Law

Funders

National Key R&D Program of China (2023YFB4301800) / National Natural Science Foundation of China Joint Fund Project (U21B2041) / Fundamental Research Funds for the Central Universities (300102244202) / Shaanxi Province Science Foundation Fund (2025JC-YBQN-892, 2025JC-YBQN-902) / China Scholarship Council (202506560029)

Comments

Guo, K., Wei, H., Huang, Y., Song, X., Sun, S., Feng, M., Song, H., Li, J., & Zhang, Y. (2026). ArgusNet: Understanding 3D scenes more like humans. Neurocomputing, 673, 132895. https://doi.org/10.1016/j.neucom.2026.132895

Copyright

subscription content

Link to Full Text

COinS

Link to publisher version (DOI)

10.1016/j.neucom.2026.132895

Research outputs 2022 to 2026

ArgusNet: Understanding 3D scenes more like humans

Author Identifier (ORCID)

Abstract

Keywords

Document Type

Date of Publication

Volume

Publication Title

Publisher

School

Funders

Comments

Copyright

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations

Research outputs 2022 to 2026

ArgusNet: Understanding 3D scenes more like humans

Authors/Creators

Author Identifier (ORCID)

Abstract

Keywords

Document Type

Date of Publication

Volume

Publication Title

Publisher

School

Funders

Comments

Copyright

Share

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations