Author Identifier

Muhammad Zia Ur Rehman: https://orcid.org/0000-0001-9531-1941

Syed Mohammed Shamsul Islam: https://orcid.org/0000-0002-3200-2903

David Blake: https://orcid.org/0000-0003-3747-2960

Document Type

Conference Proceeding

Publication Title

Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications

Volume

3

First Page

143

Last Page

153

Publisher

Science and Technology Publications

School

School of Science

Publication Unique Identifier

10.5220/0013191300003912

Comments

Rehman, M. Z. U., Islam, S. M. S., UlHaq, A., Blake, D., & Janjua, N. (2025). Towards robust multimodal land use classification: A convolutional embedded transformer. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP (pp. 143-153). SciTePress. https://doi.org/10.5220/0013191300003912

Abstract

Multisource remote sensing data has gained significant attention in land use classification. However, effectively extracting both local and global features from various modalities and fusing them to leverage their complementary information remains a substantial challenge. In this paper, we address this by exploring the use of transformers for simultaneous local and global feature extraction while enabling cross-modality learning to improve the integration of complementary information from HSI and LiDAR data modalities. We propose a spatial feature enhancer module (SFEM) that efficiently captures features across spectral bands while preserving spatial integrity for downstream learning tasks. Building on this, we introduce a cross-modal convolutional transformer, which extracts both local and global features using a multi-scale convolutional embedded encoder (MSCE). The convolutional layers embedded in the encoder facilitate the blending of local and global features. Additionally, cross-modal learning is incorporated to effectively capture complementary information from HSI and LiDAR modalities. Evaluation on the Trento dataset highlights the effectiveness of the proposed approach, achieving an average accuracy of 99.04% and surpassing comparable methods.

DOI

10.5220/0013191300003912

Creative Commons License

Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Share

 
COinS