Shifts Challenge at NeurIPS 2021

Machine learning (ML) is essential for innovations such as autonomous vehicle technology (AV), weather forecasting and machine translation, however, the application of ML creates its own challenges. For instance, the mismatch between training and deployment data, known as distributional shift, can cause a degradation in model performance as the degree of shift increases. This is especially important to be aware of in applications with strict safety requirements.

In an ideal world, ML-trained models perform reliably well in various conditions. For machine translation, this means ML models are equally capable of translating both formal and colloquial texts, and when applied to weather prediction, the goal is for models to make accurate predictions far into the future and across a range of different climates. When developing AV technology, prediction models trained in sunny Tel Aviv must be able to guess the intentions of drivers in Moscow during the snowy winter.

At Yandex, we encountered the challenges of distributional shift on the day of our very first AV-technology demonstration — there was heavy snow the night before and it changed the test range landscape beyond recognition. Though it was the first time that the AV cars had come across snow, they proved to be robust, skillfully navigating the environment and successfully completing their tasks. We continue to observe distributional shift, testing our self-driving cars in different cities, countries, weather conditions and even on new platforms, such as our autonomous delivery robot.

Yandex, alongside researchers from the universities of Oxford and Cambridge and as part of the NeurIPS conference on machine learning, is launching the Shifts Challenge to raise awareness of distributional shift and accelerate the development of robust models capable of providing accurate estimates when navigating uncertain situations.

Professor Mark Gales, who heads up Cambridge University’s collaboration in the Shifts Challenge: “As deep-learning approaches become more powerful, they are being applied in ever more interesting and diverse areas. It is increasingly important for these systems to “know when they don’t know”, to prevent bad decisions. Through participation in the global Shifts Challenge, researchers have an unprecedented opportunity to evaluate on large-scale, real-world data their models’ ability to measure confidence in their own predictions.”

As AV technology has strict safety requirements, we chose to focus on it in the Shifts Challenge. Shifts Challenge participants will have access to the largest AV dataset in the industry to date, courtesy of the Yandex Self-driving Group (Yandex SDG). Yandex SDG tests its AV technology in six cities located in the United States, Israel and Russia, and has collected data through all types of weather conditions, resulting in an impressively large and diverse dataset that includes 600,000 scenes, or more than 1,600 hours of driving. The data features tacks of cars and pedestrians in the vicinity of the vehicle, including parameters like location coordinates, velocity, acceleration and more, but does not contain any imagery showing personal information such as license plates or images of pedestrians. By releasing such a comprehensive real-life dataset to researchers and developers all around the world, we aim to accelerate the global development of safe and reliable AV technology.

Andrey Malinin, Yandex senior research scientist and Shifts Challenge lead organizer: “The main obstacle to the development of robust models which yield accurate uncertainty estimates is the availability of large, diverse datasets which contain examples of distributional shift from real, industrial tasks. Most research in the area has been done on small image classification datasets with synthetic distributional shift. Unfortunately, promising results on these datasets often don't generalize to large-scale industrial applications, such as autonomous vehicles. We aim to address this issue by releasing a large dataset with examples of real distributional shift on tasks which are different from image classification. We hope that this will set the new standard in uncertainty estimation and robustness research.”

The Shifts Challenge includes two additional competition tracks focused on machine translation and weather forecasting, with datasets from the Yandex.Translate and Yandex.Weather services. Translation track participants will be tasked with building models to deal with both literary text and everyday online speech, while the weather forecasting track will test participants’ models on data from different times of the year. While translation and weather prediction are not high-risk applications, they are still scientifically relevant and provide useful insights into the nature of robustness to distributional shift and uncertainty estimation.

Each challenge track is split into two stages: development and evaluation. In the development stage, participants are provided with training and development data which they use to create and assess their models. A development leaderboard is provided for each track which participants can use to follow their progress and compare their models. In the evaluation stage, participants are provided with additional data to evaluate their proposed solutions and compete for top place on the evaluation leaderboard.

July - early October 2021: Development Stage

Participants are provided with training and development data. In the AV track, for example, participants train their models to predict the possible trajectories of surrounding cars and pedestrians, as well as assess the uncertainty in each prediction. The available training data includes hundreds of thousands of scenes from real self-driving trips around Moscow, collected during the daytime and in good weather conditions. Challenge participants also have access to a ‘development’ dataset, which includes scenes from other cities across Israel and Russia and features a variety of weather types. Using this data, participants can see how well their models cope with distributional shift and make predictions across a range of unfamiliar environments.

October 17 - 31, 2021: Evaluation Stage

Participants are provided a new ‘evaluation’ dataset which contains examples that match the training data, as well as examples that are mismatched to both the training data and the development data from the previous stage. During this period, participants will adjust and fine-tune their models, with the final deadline for submission on October 31. At the end of the evaluation phase, the top scoring participants will present their solutions to competition organizers, who then verify that models comply with competition rules.

Organizers verify submissions during November with competition results announced November 30, 2021. Evaluation set references and metadata will also be released at this point.

We are excited to share our experience with researchers all over the world and invite them to join this unique competition. The training and development datasets are already available on the challenge website and participants may join the competition right away. In October, we will release the evaluation datasets on which the participants will compete for top place. The challenge results will be revealed at the end of November at the NeurIPS conference, and participants will present their solutions at the NeurIPS2021 Shifts Challenge Workshop, where the creators of the best models will be awarded cash prizes.

We wish all Shift Challenge participants the very best of luck!