Research outputs 2022 to 2026

Spatially-aware speaker for vision-and-language navigation instruction generation

Muraleekrishna Gopinathan, Edith Cowan UniversityFollow
Martin Masek, Edith Cowan UniversityFollow
Jumana Abu-Khalaf, Edith Cowan UniversityFollow
David Suter, Edith Cowan UniversityFollow

Author Identifier

Muraleekrishna Gopinathan: https://orcid.org/0000-0002-1550-1129

Martin Masek: https://orcid.org/0000-0001-8620-6779

Jumana Abu-Khalaf: https://orcid.org/0000-0002-6651-2880

David Suter: https://orcid.org/0000-0001-6306-3023

Document Type

Conference Proceeding

Publication Title

Proceedings of the Annual Meeting of the Association for Computational Linguistics

Volume

First Page

13601

Last Page

13614

Publisher

ACL Anthology

School

Centre for Artificial Intelligence and Machine Learning (CAIML)

Comments

Gopinathan, M., Masek, M., Abu-Khalaf, J., & Suter, D. (2024). Spatially-aware speaker for vision-and-language navigation instruction generation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 13601–13614). Bangkok, Thailand: Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.acl-long.734

Abstract

Embodied AI aims to develop robots that can understand and execute human language instructions, as well as communicate in natural languages. On this front, we study the task of generating highly detailed navigational instructions for the embodied robots to follow. Although recent studies have demonstrated significant leaps in the generation of step-by-step instructions from sequences of images, the generated instructions lack variety in terms of their referral to objects and landmarks. Existing speaker models learn strategies to evade the evaluation metrics and obtain higher scores even for low-quality sentences. In this work, we propose SAS (Spatially-Aware Speaker), an instruction generator or Speaker model that utilises both structural and semantic knowledge of the environment to produce richer instructions. For training, we employ a reward learning method in an adversarial setting to avoid systematic bias introduced by language evaluation metrics. Empirically, our method outperforms existing instruction generation models, evaluated using standard metrics. Our code is available at https://github.com/gmuraleekrishna/SAS.

DOI

10.18653/v1/2024.acl-long.734

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Download

Included in

Artificial Intelligence and Robotics Commons

COinS

Link to publisher version (DOI)

10.18653/v1/2024.acl-long.734

Research outputs 2022 to 2026

Spatially-aware speaker for vision-and-language navigation instruction generation

Author Identifier

Document Type

Publication Title

Volume

First Page

Last Page

Publisher

School

Comments

Abstract

DOI

Creative Commons License

Included in

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations

Research outputs 2022 to 2026

Spatially-aware speaker for vision-and-language navigation instruction generation

Authors

Author Identifier

Document Type

Publication Title

Volume

First Page

Last Page

Publisher

School

Comments

Abstract

DOI

Creative Commons License

Included in

Share

Link to publisher version (DOI)

Search

Links

Browse

Author Information

Article Locations