Author Identifier
Muhammad Zia Ur Rehman: https://orcid.org/0000-0001-9531-1941
Syed Mohammed Shamsul Islam: https://orcid.org/0000-0002-3200-2903
David Blake: https://orcid.org/0000-0003-3747-2960
Document Type
Conference Proceeding
Publication Title
Proceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
Volume
3
First Page
143
Last Page
153
Publisher
Science and Technology Publications
School
School of Science
Publication Unique Identifier
10.5220/0013191300003912
Abstract
Multisource remote sensing data has gained significant attention in land use classification. However, effectively extracting both local and global features from various modalities and fusing them to leverage their complementary information remains a substantial challenge. In this paper, we address this by exploring the use of transformers for simultaneous local and global feature extraction while enabling cross-modality learning to improve the integration of complementary information from HSI and LiDAR data modalities. We propose a spatial feature enhancer module (SFEM) that efficiently captures features across spectral bands while preserving spatial integrity for downstream learning tasks. Building on this, we introduce a cross-modal convolutional transformer, which extracts both local and global features using a multi-scale convolutional embedded encoder (MSCE). The convolutional layers embedded in the encoder facilitate the blending of local and global features. Additionally, cross-modal learning is incorporated to effectively capture complementary information from HSI and LiDAR modalities. Evaluation on the Trento dataset highlights the effectiveness of the proposed approach, achieving an average accuracy of 99.04% and surpassing comparable methods.
DOI
10.5220/0013191300003912
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Comments
Rehman, M. Z. U., Islam, S. M. S., UlHaq, A., Blake, D., & Janjua, N. (2025). Towards robust multimodal land use classification: A convolutional embedded transformer. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP (pp. 143-153). SciTePress. https://doi.org/10.5220/0013191300003912