When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach
Tao Ma, Bing Bai, Haozhe Lin, Heyuan Wang, Yu Wang, Lin Luo, Lu Fang
Visual grounding refers to the process of associating natural language expressions with corresponding regions within an image. Existing benchmarks for visual grounding primarily operate within small-scale scenes with a few objects. Nevertheless recent advances in imaging technology have enabled the acquisition of gigapixel-level images providing high-resolution details in large-scale scenes containing numerous objects. To bridge this gap between imaging and computer vision benchmarks and make grounding more practically valuable we introduce a novel dataset named GigaGrounding designed to challenge visual grounding models in gigapixel-level large-scale scenes. We extensively analyze and compare the dataset with existing benchmarks demonstrating that GigaGrounding presents unique challenges such as large-scale scene understanding gigapixel-level resolution significant variations in object scales and the "multi-hop expressions". Furthermore we introduced a simple yet effective grounding approach which employs a "glance-to-zoom-in" paradigm and exhibits enhanced capabilities for addressing the GigaGrounding task. The dataset is available at www.gigavision.ai.