Author Identifier

Mariia Khan: http://orcid.org/0000-0001-6662-4607

Date of Award

2025

Document Type

Thesis

Publisher

Edith Cowan University

Degree Name

Doctor of Philosophy (Joint-ECU home)

School

School of Science

First Supervisor

Jumana Abu-Khalaf

Second Supervisor

David Suter

Third Supervisor

Bodo Rosenhahn

Fourth Supervisor

Qiu Yue

Abstract

Embodied AI explores intelligent agents that learn through interaction with their environment, aiming to replicate human-like learning processes. Achieving this requires agents capable of understanding a scene via various sensors, reasoning about their actions, and reacting accordingly. These abilities are necessary for service domestic robots to assist humans in their day-to-day activities. Embodied AI tasks can include but are not limited to: visual exploration, visual navigation, instruction following and embodied question answering, which typically consider static (unchanging) environments, where objects do not move over time. This thesis addresses one of the most challenging Embodied AI tasks – visual room rearrangement, focusing on its’ Walkthrough (Scene Understanding) and Scene Change Detection stages.

The main purpose of the rearrangement task is for the agent to change the location or state of one or more objects in the environment, from an initial state to a desired goal state [1]. In this task, the environment undergoes continual changes due to the agent’s actions over time. The rearrangement task presents several challenges. These include understanding dynamic scenes through object recognition, localization, and tracking in constantly changing embodied environments. Additionally, it involves detecting and describing scene changes across different stages of the rearrangement task. To achieve this, the novel methods are proposed and evaluated using data, collected in the Ai2Thor simulator.

First, this work introduces four datasets - SAOM, M3T, EmbSCU, and PanoSCU - to support eight embodied research tasks: single-view and panoramic object detection, single-view and panoramic segmentation, single-view and panoramic change understanding, embodied object tracking, and change reversal.

For the Scene Understanding stage of the rearrangement task, this work proposes a real-to-simulation fine-tuning strategy for the Segment Anything Model (SAM). This includes the development of SAOMv1 and SAOMv2 for single-view object segmentation and the PanoSAM model for panoramic object segmentation. Furthermore, the M3T-Round method is proposed, enabling multi-class, multi-instance, and multi-view object tracking in embodied AI scenes.

In the Scene Change Detection stage of the rearrangement task, this thesis proposes methods for both single-view and panoramic Scene Change Understanding (SCU) tasks. The EmbSCU method is not only able to detect changes, but also provides change descriptions and generates language rearrangement instructions for the robotic agents to revert the changes. Panoramic SCU task extends the SCU capabilities to full-scene panoramas, capturing a broader range of changes in the scene. Through the experiments, the challenges and limitations of current methods are highlighted for the panoramic change captioning task.

This work advances embodied AI by enhancing robot’s perception, memory, and planning abilities, providing a foundation for intelligent agents to interact with dynamic embodied environments.

Share

 
COinS