A few attention heads for reasoning in multiple languages

We discovered that the reasoning capabilities of cross-lingual Transformers are concentrated in a small set of their attention heads. We also compiled and released a multilingual dataset to encourage research on commonsense reasoning in languages other than English.
Building robust systems that apply to many languages is one of today's main challenges in natural language processing (NLP). In a paper published in Findings of ACL 2021, we suggested a new method for commonsense reasoning. In particular, we collected XWINO — a cross-lingual dataset of Winograd Schema tasks, evaluated several state-of-the-art methods and proposed a new competitive approach that uses self-attention outputs only. Analysis of this method reveals intriguing properties of cross-lingual Transformer representations: it turns out that the same attention heads encode reasoning knowledge for different languages. 
Winograd Schemas in NLP
Over the history of artificial intelligence, researchers have introduced numerous benchmarks for tracking progress in the field. One of them is the Winograd Schema Challenge, described by Levesque et al. in 2012. This benchmark was initially proposed as an alternative to the famous Turing test specifically designed not to be easily solved by statistical methods. 
Conceptually, every Winograd Schema is a simple binary choice problem. Given a sentence with two entities and a pronoun, the task is to determine which noun phrase corresponds to this pronoun. In this example, the pronoun is in bold, and two options are in italic: 
Problem: The town councilors refused to give the demonstrators a permit because they feared violence. 
Answer: The town councilors 
While choosing the correct answer is pretty straightforward for humans, the lack of specific clues makes the task much harder for machine learning algorithms. As a result, even modern ML-based language understanding systems that easily solve many other tasks often struggle with this challenge. 
Nevertheless, the emergence of pre-trained Transformer models in deep learning for NLP has led to noticeable improvements for this task. Over the past several years, multiple authors have proposed different strategies to solve the Winograd Schema Challenge. However, it is unclear whether the reported improvements translate to high performance in other languages. These studies mainly focused on English, and there have been few attempts at comprehensive multilingual evaluation of commonsense reasoning. We explain this gap mainly by lacking a proper benchmark encompassing multiple languages and containing well-defined examples, similar to XGLUE or XTREME for other NLP tasks. 
A multilingual dataset for commonsense reasoning
To resolve this issue, we first compiled XWINO — a multilingual dataset of Winograd Schema tasks in English, French, Japanese, Russian, Chinese and Portuguese, using the datasets published in prior work. Our dataset contains a total of 3,961 examples and could be used to evaluate both in-language performance and cross-lingual generalization of natural language processing models. 
One major issue that we faced when compiling XWINO is that the task format varied noticeably across all datasets, which would make it difficult to evaluate the same method on two languages without making language-specific adjustments. As a result, we converted all tasks in our dataset to the same schema, keeping the data where possible with minimal changes and filtering it otherwise. For instance, we fixed minor inconsistencies, such as typos or missing articles, by hand, but the correct answer was not even present in the sentence for some cases. 
By releasing this benchmark, we hope to encourage research on multilingual commonsense reasoning and facilitate new methods. The data, along with the code of our experiments, is available in the GitHub repository for the paper. 
Our method
To provide a baseline for the performance of other models, we designed a straightforward approach. It relies only on the attention head outputs of a multilingual Transformer, training just a linear classifier on top of them. As a result, it does not involve changing the parameters of the pre-trained model and is much faster than finetuning the entire network. More specifically, for a given example, we take each answer and construct a feature vector using attention from the pronoun to this candidate for all attention heads. Next, we take the difference of these vectors for two candidates and use the result as an input for the binary logistic regression classifier. 
In the experiments, we evaluated this method and several state-of-the-art approaches on XWINO. We chose two cross-lingual language models to obtain sentence representations — multilingual BERT and XLM-RoBERTa. Despite its simplicity, our method performs competitively with other approaches, especially in a zero-shot scenario: we can train the model on examples from one language and test it on others with similar or even better results! 
Language-agnostic reasoning from attention heads
Because this method uses only a linear classifier on top of pre-trained attention representations, it is pretty easy to analyze the learned weights to determine which attention heads contribute to the prediction the most.  
One of our key findings was that even a tiny subset of attention heads (just the top five, which amounts to 1.3 percent of the overall count for XLM-R Large) might be used for commonsense reasoning in all studied languages. Restricting the classifier to only these five features preserves and sometimes improves the performance. 
We consider this a remarkable result, which aligns both with prior work on the redundancy of attention heads and the analysis of multilingual Transformers yet provides a different perspective. Although the masked language models are not explicitly trained to resolve pronouns in text, we can see that they encode specific subtasks of the NLP pipeline in a small subset of their heads. Moreover, these subsets appear to have a considerable overlap even for typologically different languages. 
In addition, we discovered that restricting the subset of heads to these top five significantly improves the quality of Masked Attention Score. This unsupervised method is also based on attention head outputs. It means that a more optimal strategy of head choice could further improve such methods. 
Conclusion
In this work, we released XWINO — a dataset of Winograd Schemas in six languages meant to be used as a benchmark of multilingual commonsense reasoning capabilities. We also proposed an easy-to-use yet competitive baseline for this task that uses a linear classifier over attention weights. The analysis of this baseline provides valuable insights into the representations of popular pre-trained cross-lingual models. 
Of course, there are many research questions that one can raise from this point, and we can name just a few of them. First, are the existing datasets sufficient, or do we need more labeled examples for a benchmark that is balanced across languages and language groups? Second, do existing methods for the Winograd Schema Challenge perform well because of cross-lingual generalization or simply because of surface-level similarities? We would be excited to learn about other findings, so don't hesitate to contact us if you conduct research in this area or have questions about our work.