Author Identifier

Muraleekrishna Gopinathan

http://orcid.org/0000-0002-1550-1129

Date of Award

2025

Document Type

Thesis

Publisher

Edith Cowan University

Degree Name

Doctor of Philosophy

School

School of Science

First Supervisor

Jumana Abu-Khalaf

Second Supervisor

David Suter

Third Supervisor

Martin Masek

Abstract

Embodied AI is a challenging but exciting field in which a robot learns to interact with human-living spaces to perform various tasks. This thesis studies the embodied navigation problem in which a robotic agent navigates in a previously unseen indoor environment based on a challenging task. In particular, the Vision-and-Language Navigation (VLN) task requires a robot to navigate based on a descriptive human-language instruction. This thesis aims to improve VLN agents on four key aspects - their understanding of the environment, training via additional data, correcting navigational errors, and predicting the layout of the environment for better planning.

First, we address the planning aspect of VLN. We introduce our What Is Near (WIN) method to enhance navigation planning by predicting local neighbourhood maps using knowledge of living spaces.

Next, we study the reverse problem of VLN, where instructions are generated from trajectory demonstrations. Our Spatially-Aware Speaker (SAS) model attends to panoramic visual context and action history to decode instructions. To enhance training, a Path Mixing dataset, derived from the existing expert annotated dataset, is used and adversarial training is applied to improve instruction variety.

We observe that ambiguity in the instructions and environments leads to navigation errors and agents being lost. Our work, StratXplore, proposes the optimal navigation frontier by evaluating all available options stored in the agent’s memory based on novelty, recency, and instruction alignment.

Finally, we aim to minimise the sim-to-real gap in VLN by focusing on environment mapping in a realistic indoor simulator. The research uses the Benchbot simulator, which features a photorealistic continuous action space and realistic sensors, to map objects in indoor environments under varying conditions and sensor noises. A 2D-3D fusion pipeline is developed to evaluate state-of-the-art 3D detection models in different simulated environments. Experimental results from each study show that our methods improve upon existing work.

This thesis is an encouraging step towards realising intelligent social robots with applications in healthcare, education, and industry.

DOI

10.25958/f2b4-ts49

Share

 
COinS