DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models
Norman Di Palo,Edward Johns,Norman Di Palo,Edward Johns
We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting with a novel object, DINOBot first uses these features to retrieve the most visually similar object experienced during human demonstrations, and then uses this object to align...