Yandex Research at BigScience
From 2021 to 2022, Yandex Research collaborated with hundreds of scientists within the BigScience workshop to produce BLOOM, the world’s largest open multilingual language model. In this post, we'll explain how the BigScience project resulted in the BLOOM model and how it could advance research in machine learning.
What is BigScience The link has been copied to clipboard
The BigScience workshop was inspired by massive scientific endeavors like CERN, where open collaborations facilitate the creation of valuable knowledge for the entire research community and humanity in general.
More than 1,000 researchers from 70 countries and over 250 organizations, including Yandex Research, joined BigScience to produce a very large multilingual language model named BLOOM and a multilingual text dataset called ROOTS. BigScience became an international volunteer-led collaboration.
One of the challenges BigScience had to resolve was collecting diverse and representative training data. Compared to well-established English-language models, their non-English counterparts traditionally underperformed. We collected the 341-billion-word dataset from books, scientific papers, podcasts and other sources across different languages, including Swahili, Bengali and Vietnamese.
Overall, it took us over a year of planning and training to obtain the final version of BLOOM.
Why BLOOM is important The link has been copied to clipboard
BLOOM is an autoregressive large language model trained on vast amounts of text data using industrial-scale computational resources of the Jean Zay supercomputer. With 176 billion parameters, BLOOM generates text in 46 natural languages and 13 programming languages. BLOOM has become the first language model with over 100 billion parameters for languages like Spanish, French and Arabic.
To ensure the stability of the BLOOM training experiments, we ran a series of preliminary experiments on a smaller 100 billion scale to encounter as many instabilities as possible, to compare different methods for mitigating these instabilities, and to document our findings.
One of BigScience's goals was to make large language models more accessible. For the first time, the training process of such a model was fully transparent. This will make it easier for researchers to investigate the model's performance and behavior and to replicate similar models in the future. BLOOM could also lead to a fresh wave of AI-driven products and research from organizations that did not previously have the resources to train or even use such large language models.
For example, another project that came out of BigScience is Petals, a decentralized platform for running BLOOM and other large models at home. This platform, developed by several researchers and engineers from Yandex, allows everybody to access the largest open models available today and to contribute their computing resources to the benefit of everybody.