Towards explainability of affordance learning in robot vision
Author Identifier
Nima Mirnateghi: https://orcid.org/0000-0002-1814-7452
Syed Mohammed Shamsul Islam: https://orcid.org/0000-0002-3200-2903
Syed Afaq Ali Shah: https://orcid.org/0000-0003-2181-8445
Document Type
Conference Proceeding
Publication Title
Proceedings - 2024 25th International Conference on Digital Image Computing: Techniques and Applications, DICTA 2024
First Page
545
Last Page
552
Publisher
IEEE
School
Centre for Artificial Intelligence and Machine Learning (CAIML)
RAS ID
71854
Funders
Centre for Artificial Intelligence and Machine Learning (CAIML) at Edith Cowan University
Abstract
Recent advances in deep learning for robotic vision have yielded remarkable performance for robot-object interaction, including scene understanding and visual affordance learning. Nevertheless, the intrinsic opaque nature of deep neural networks and the consequent lack of explainability pose significant c hallenges. U nderstanding h ow t hese intelligent systems perceive and justify their decision remains elusive to human comprehension. Although research efforts have focused extensively on enhancing the explainability of object recognition, achieving explainability in visual affordance learning for intelligent systems remains an ongoing challenge. To address this issue, we propose a novel post-hoc multimodal explainability framework that capitalizes on the emerging synergy between visual and language models. Our proposed framework initially generates a Class Activation Map (CAM) heatmap for the given affordances to provide visual explainability. It then systematically extracts textual explanations from the state-of-the-art Large Language Models (LLM), i.e., GPT-4, using CAM to enrich the explainability of visual affordance learning. In addition, by harnessing the zero-shot learning capabilities of LLMs, we illustrate their capability to intuitively articulate the behaviour of intelligent systems in affordance learning tasks. We evaluate the efficacy of our approach on a comprehensive benchmark dataset for large-scale multi-view RGBD visual affordance learning. This dataset comprises 47,210 RGBD images spanning 37 object categories annotated with 15 visual affordance categories. Our experimental findings u nderscore t he p romising performance of the proposed framework. The code is available at: https://github.com/ai-voyage/affordance-xai.git.
DOI
10.1109/DICTA63115.2024.00085
Access Rights
subscription content
Comments
Mirnateghi, N., Islam, S. M. S., & Shah, S. A. A. (2024, November). Towards explainability of affordance learning in robot vision. In 2024 International Conference on Digital Image Computing: Techniques and Applications (DICTA) (pp. 545-552). IEEE. https://doi.org/10.1109/DICTA63115.2024.00085