Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions

Yui Iioka,Yu Yoshida,Yuiga Wada,Shumpei Hatanaka,Komei Sugiura,Yui Iioka,Yu Yoshida,Yuiga Wada,Shumpei Hatanaka,Komei Sugiura

In this study, we aim to develop a model that comprehends a natural language instruction (e.g., “Go to the living room and get the nearest pillow to the radio art on the wall”) and generates a segmentation mask for the target everyday object. The task is challenging because it requires (1) the understanding of the referring expressions for multiple objects in the instruction, (2) the prediction of...