Towards explainability of affordance learning in robot vision

Author Identifier

Nima Mirnateghi: https://orcid.org/0000-0002-1814-7452

Syed Mohammed Shamsul Islam: https://orcid.org/0000-0002-3200-2903

Syed Afaq Ali Shah: https://orcid.org/0000-0003-2181-8445

Document Type

Conference Proceeding

Publication Title

Proceedings - 2024 25th International Conference on Digital Image Computing: Techniques and Applications, DICTA 2024

First Page

545

Last Page

552

Publisher

IEEE

School

Centre for Artificial Intelligence and Machine Learning (CAIML)

RAS ID

71854

Funders

Centre for Artificial Intelligence and Machine Learning (CAIML) at Edith Cowan University

Comments

Mirnateghi, N., Islam, S. M. S., & Shah, S. A. A. (2024, November). Towards explainability of affordance learning in robot vision. In 2024 International Conference on Digital Image Computing: Techniques and Applications (DICTA) (pp. 545-552). IEEE. https://doi.org/10.1109/DICTA63115.2024.00085

Abstract

Recent advances in deep learning for robotic vision have yielded remarkable performance for robot-object interaction, including scene understanding and visual affordance learning. Nevertheless, the intrinsic opaque nature of deep neural networks and the consequent lack of explainability pose significant c hallenges. U nderstanding h ow t hese intelligent systems perceive and justify their decision remains elusive to human comprehension. Although research efforts have focused extensively on enhancing the explainability of object recognition, achieving explainability in visual affordance learning for intelligent systems remains an ongoing challenge. To address this issue, we propose a novel post-hoc multimodal explainability framework that capitalizes on the emerging synergy between visual and language models. Our proposed framework initially generates a Class Activation Map (CAM) heatmap for the given affordances to provide visual explainability. It then systematically extracts textual explanations from the state-of-the-art Large Language Models (LLM), i.e., GPT-4, using CAM to enrich the explainability of visual affordance learning. In addition, by harnessing the zero-shot learning capabilities of LLMs, we illustrate their capability to intuitively articulate the behaviour of intelligent systems in affordance learning tasks. We evaluate the efficacy of our approach on a comprehensive benchmark dataset for large-scale multi-view RGBD visual affordance learning. This dataset comprises 47,210 RGBD images spanning 37 object categories annotated with 15 visual affordance categories. Our experimental findings u nderscore t he p romising performance of the proposed framework. The code is available at: https://github.com/ai-voyage/affordance-xai.git.

DOI

10.1109/DICTA63115.2024.00085

Access Rights

subscription content

Share

 
COinS