Machine Translation

Translation services, such as Google Translate or Yandex.Translate, often encounter atypical and unusual use of language in their translation queries, including slang, profanities, poor grammar, mistakes in spelling and punctuation, and emojis.

This type of language poses a significant challenge to modern translation systems, which are typically trained on corpora with a more ‘standard’ use of language. Models need to be robust to atypical language use and provide high-quality translations under these types of distributional shifts. It is also important to know when a model is challenged by the query and take appropriate action, such as refusing to make a prediction, giving multiple possible translations, or flagging the example for further examination.

Data & Task

This competition will feature a text dataset for machine translation — the freely available WMT’20 English-Russian corpus, which covers a variety of domains, but primarily focuses on parliamentary and news data. For the most part, this data is grammatically and orthographically correct and language use is formal. This is representative of the type of data used, for example, to build the Yandex.Translate systems.

In addition to the training data, we provide two in-domain, or non-distributionally shifted datasets for development and evaluation. The freely available English-Russian Newstest’20 will be used as the development set. The hold-out evaluation data will use a corpus of news data collected from the Global Voices news service and manually annotated for this competition using expert human translators (annotations for this data will be kept private for the duration of the competition).

The distributionally shifted evaluation data will be taken from the Reddit corpus prepared for the WMT’19 robustness challenge. Russian target annotations are not available, so we pass the data through a two-stage process, where mistakes in spelling, grammar, and punctuation are corrected first, and then the source-side English sentences are translated into Russian by expert translators.

The goal is to translate a sentence in a source language into a target language. It is important for models to be robust to atypical language use and provide high-quality translations under a distribution shift.


For the machine translation task we use the GLEU metric, a per-sample variant of BLEU, to assess predictive performance. GLEU will be evaluated using the standard NLTK implementation. If your model returns more than a single hypothesis per source sentence, you must submit confidence scores in each hypothesis which are non-negative and sum to one. In this case we will use confidence-weighted GLEU (eGLEU) as the error metric.

To assess the joint quality of uncertainty estimates and robustness, we will use the area under an error retention curve (R-AUC) as the competition score. Here, predictions are replaced by ground-truth targets in order of descending uncertainty. This decreases the error, measured as 100-eGLEU. Area under the curve can be reduced either by improving performance or by the uncertainty-based rank-ordering, so that bigger errors are replaced first. This will be the target metric used to rank participants’ submissions to the translation track. All metrics will be evaluated jointly on in-domain and shifted data.

For the participants’ benefit and to obtain greater insights we provide additional metrics for assessing models. Shifted-data detection performance will be evaluated using ROC-AUC. We also introduce two F1-based metrics, called F1-AUC and F1 @ 95%, which jointly evaluate the quality of uncertainty and predictions. These are also computed using 100-eGLEU as the error metric.These metrics do not affect leaderboard ranking.

Further details of all metrics are provided on GitHub and in our paper.

Getting Started

Links to the data, as well as detailed descriptions of the data, format and metrics are provided on our GitHub repository.

To help you get started we have also provided examples and made our baseline models available for download.

Submit Results

Required format: .json lines, 100MB maximum file size

Each line in your file should correspond to a separate JSON with the following example structure:

line1 = {‘ID’: ‘001’,
		‘hypos’: [hypo1, hypo2, hypo3],
		‘uncertainty’: 9000}

Hypo1 = {‘text’: “Кошка сидела на столе”,
		‘confidence’: 0.3}

Here, hypos is an array which contains up to 5 separate translation hypotheses. Participants can submit multiple hypotheses, as translation is an inherently multimodal task. Each hypothesis is a dictionary which contains the text field and the confidence field. Сonfidence scores for all hypotheses in a particular sample are positive numbers which must add up to 1. Confidence scores represent the ranking of all hypotheses.

Uncertainty is the global uncertainty in the set of predictions for a particular input sentence. It is important that UTF-8 encoding is switched off. Note that the targets have been normalized using the perl normalizer and cleaner, as outlined in the example on GitHub.

Submissions will be checked for the correct format and number of lines. Additionally, you must name your method, so that it will be easier for you to distinguish it from your other submissions in the leaderboard.


RankTeamname/ usernameMethod NameScore
BLEUeGLEUAUC-F1F1 @ 95%ROC-AUC (%)Date Submitted