Abstract
Multisource remote sensing data has gained significant attention in land use classification. However, effectively extracting both local and global features from various modalities and fusing them to leverage their complementary information remains a substantial challenge. In this paper, we address this by exploring the use of transformers for simultaneous local and global feature extraction while enabling cross-modality learning to improve the integration of complementary information from HSI and LiDAR data modalities. We propose a spatial feature enhancer module (SFEM) that efficiently captures features across spectral bands while preserving spatial integrity for downstream learning tasks. Building on this, we introduce a cross-modal convolutional transformer, which extracts both local and global features using a multi-scale convolutional embedded encoder (MSCE). The convolutional layers embedded in the encoder facilitate the blending of local and global features. Additionally, cross-modal learning is incorporated to effectively capture complementary information from HSI and LiDAR modalities. Evaluation on the Trento dataset highlights the effectiveness of the proposed approach, achieving an average accuracy of 99.04% and surpassing comparable methods.
Document Type
Conference Proceeding
Date of Publication
1-1-2025
Volume
3
School
School of Science
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Publisher
Science and Technology Publications
Identifier
Muhammad Zia Ur Rehman: https://orcid.org/0000-0001-9531-1941
Syed Mohammed Shamsul Islam: https://orcid.org/0000-0002-3200-2903
David Blake: https://orcid.org/0000-0003-3747-2960
Recommended Citation
Ur Rehman, M., Islam, S., Ulhaq, A., Blake, D., & Janjua, N. (2025). Towards robust multimodal land use classification: A convolutional embedded transformer. DOI: https://doi.org/10.5220/0013191300003912
Comments
Rehman, M. Z. U., Islam, S. M. S., UlHaq, A., Blake, D., & Janjua, N. (2025). Towards robust multimodal land use classification: A convolutional embedded transformer. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP (pp. 143-153). SciTePress. https://doi.org/10.5220/0013191300003912