Extending the WILDS Benchmark for Unsupervised Adaptation
Shiori Sagawa u00b7 Pang Wei Koh u00b7 Tony Lee u00b7 Irena Gao u00b7 Sang Michael Xie u00b7 Kendrick Shen u00b7 Ananya Kumar u00b7 Weihua Hu u00b7 Michihiro Yasunaga u00b7 Henrik Marklund u00b7 Sara Beery u00b7 Etienne David u00b7 Ian Stavness u00b7 Wei Guo u00b7 Jure Leskovec u00b7 Kate Saenko u00b7 Tatsunori Hashimoto u00b7 Sergey Levine u00b7 Chelsea Finn u00b7 Percy Liang
Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can be a powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained from distributions beyond the source distribution as well. However, existing distribution shift benchmarks with unlabeled data do not reflect the breadth of scenarios that arise in real-world applications. In this work, we present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainable in deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities (photos, satellite images, microscope slides, text, molecular graphs). The update maintains consistency with the original WILDS benchmark by using identical labeled training, validation, and test sets, as well as identical evaluation metrics. We systematically benchmark state-of-the-art methods that use unlabeled data, including domain-invariant, self-training, and self-supervised methods, and show that their success on WILDS is limited. To facilitate method development, we provide an open-source package that automates data loading and contains the model architectures and methods used in this paper. Code and leaderboards are available at https://wilds.stanford.edu.