Indoor Scene Change Understanding (SCU): Segment, describe, and revert any change
Author Identifier
Mariia Khan: https://orcid.org/0000-0001-6662-4607
David Suter: https://orcid.org/0000-0001-6306-3023
Jumana Abu-Khalaf: https://orcid.org/0000-0002-6651-2880
Document Type
Conference Proceeding
Publication Title
IEEE International Conference on Intelligent Robots and Systems
First Page
9777
Last Page
9783
Publisher
IEEE
School
Centre for Artificial Intelligence and Machine Learning (CAIML) / School of Science
Abstract
Understanding of scene changes is crucial for embodied AI applications, such as visual room rearrangement, where the agent must revert changes by restoring the objects to their original locations or states. Visual changes between two scenes, pre- and post-rearrangement, encompass two tasks: scene change detection (locating changes) and image difference captioning (describing changes). While previous methods, focused on sequential 2D images, have addressed these tasks separately, it is essential to emphasize the significance of their combination. Therefore, we propose a new Scene Change Understanding (SCU) task for simultaneous change detection and description. Moreover, we go beyond change language description generation and aim to generate rearrangement instructions for the robotic agent to revert changes. To solve this task, we propose a novel method - EmbSCU, which allows to compare instance-level change object masks (for 53 frequently-seen indoor object classes) before and after changes and generate rearrangement language instructions for the agent. EmbSCU is built on our Segment Any Object Model (SAOMv2) - a fine-tuned version of Segment Anything Model (SAM), adapted to obtain instance-level object masks for both foreground and background objects in indoor embodied environments. EmbSCU is evaluated on our own dataset of sequential 2D image pairs before and after changes, collected from the Ai2Thor simulator. The proposed framework achieves promising results in both change detection and change description. Moreover, EmbSCU demonstrates positive generalization results on real-world scenes without using any real-life data during training. The dataset and the code are available here.
DOI
10.1109/IROS58592.2024.10801354
Access Rights
subscription content
Comments
Khan, M., Qiu, Y., Cong, Y., Rosenhahn, B., Suter, D., & Abu-Khalaf, J. (2024, October). Indoor Scene Change Understanding (SCU): Segment, describe, and revert any change. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 9777-9783). IEEE. https://doi.org/10.1109/IROS58592.2024.10801354