Audio-Visual Grounding Referring Expression for Robotic Manipulation

Yefei Wang,Kaili Wang,Yi Wang,Di Guo,Huaping Liu,Fuchun Sun,Yefei Wang,Kaili Wang,Yi Wang,Di Guo,Huaping Liu,Fuchun Sun

Referring expressions are commonly used when referring to a specific target in people's daily dialogue. In this paper, we develop a novel task of audio-visual grounding referring expression for robotic manipulation. The robot leverages both the audio and visual information to understand the referring expression in the given manipulation instruction and the corresponding manipulations are implement...