Papers accepted to EMNLP 2022
Two papers have been accepted for publication at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2022).
RuCoLA: Russian Corpus of Linguistic Acceptability by Vladislav Mikhailov, Tatiana Shamardina, Max Ryabinin, Alena Pestova, Ivan Smurov, Ekaterina Artemova
RuCoLA is a dataset consisting of Russian language sentences with their binary acceptability judgements. It includes expert-written sentences from linguistic publications and machine-generated examples.The corpus covers a variety of language phenomena, ranging from syntax and semantics to generative model hallucinations. We release RuCoLA to facilitate the development of methods for identifying errors in natural language and create a public leaderboard to track the progress made on this problem.
Improved grammatical error correction by ranking elementary edits by Alexey Sorokin
We try to decompose the task of grammatical error correction into two stages: candidate generation and candidate ranking. Any model that produces a list of candidate edits together with their scores may be used as an edit generator provided its recall is sufficiently high. In the second stage a binary classification model decides whether an edit is correct. Roughly speaking, it answers if the suggested modification improves sentence grammaticality. The second model is trained on the list of candidates from the first stage. We improve over current state-of-the-art both in English, where the GECToR edit generator is used, and in Russian, where we generate edits by a finetuned GPT model.