Machine Translation
Translation services, such as Google Translate or Yandex.Translate, often encounter atypical and unusual use of language in their translation queries, including slang, profanities, poor grammar, mistakes in spelling and punctuation, and emojis.
This type of language poses a significant challenge to modern translation systems, which are typically trained on corpora with a more ‘standard’ use of language. Models need to be robust to atypical language use and provide high-quality translations under these types of distributional shifts. It is also important to know when a model is challenged by the query and take appropriate action, such as refusing to make a prediction, giving multiple possible translations, or flagging the example for further examination.
Data & Task
This competition will feature a text dataset for machine translation — the freely available WMT’20 English-Russian corpus, which covers a variety of domains, but primarily focuses on parliamentary and news data. For the most part, this data is grammatically and orthographically correct and language use is formal. This is representative of the type of data used, for example, to build the Yandex.Translate systems.
In addition to the training data, we provide two in-domain, or non-distributionally shifted datasets for development and evaluation. The freely available English-Russian Newstest’20 will be used as the development set. The hold-out evaluation data will use a corpus of news data collected from the Global Voices news service and manually annotated for this competition using expert human translators (annotations for this data will be kept private for the duration of the competition).
The distributionally shifted evaluation data will be taken from the Reddit corpus prepared for the WMT’19 robustness challenge. Russian target annotations are not available, so we pass the data through a two-stage process, where mistakes in spelling, grammar, and punctuation are corrected first, and then the source-side English sentences are translated into Russian by expert translators.
The goal is to translate a sentence in a source language into a target language. It is important for models to be robust to atypical language use and provide high-quality translations under a distribution shift.
Metrics
For the machine translation task we use the GLEU metric, a per-sample variant of BLEU, to assess predictive performance. GLEU will be evaluated using the standard NLTK implementation. If your model returns more than a single hypothesis per source sentence, you must submit confidence scores in each hypothesis which are non-negative and sum to one. In this case we will use confidence-weighted GLEU (eGLEU) as the error metric.
To assess the joint quality of uncertainty estimates and robustness, we will use the area under an error retention curve (R-AUC) as the competition score. Here, predictions are replaced by ground-truth targets in order of descending uncertainty. This decreases the error, measured as 100-eGLEU. Area under the curve can be reduced either by improving performance or by the uncertainty-based rank-ordering, so that bigger errors are replaced first. This will be the target metric used to rank participants’ submissions to the translation track. All metrics will be evaluated jointly on in-domain and shifted data.
For the participants’ benefit and to obtain greater insights we provide additional metrics for assessing models. Shifted-data detection performance will be evaluated using ROC-AUC. We also introduce two F1-based metrics, called F1-AUC and F1 @ 95%, which jointly evaluate the quality of uncertainty and predictions. These are also computed using 100-eGLEU as the error metric.These metrics do not affect leaderboard ranking.
Further details of all metrics are provided on GitHub and in our paper.
Leaderboard
Rank | Teamname/ username | Method Name | Score (R-AUC eGLEU) | BLEU | eGLEU | AUC-F1 | F1 @ 95% | ROC-AUC (%) | Date Submitted |
---|