Large-scale machine learning

Today, training most powerful models often takes significant resources. Our research aims to make large-scale training more efficient and accessible to the entire machine learning community.

Area 7. Large-scale machine learning.svg

Publications

  • Secure Distributed Training at Scale

    Large-scale machine learning
    Eduard Gorbunov
    Alexander Borzunov
    Michael Diskin
    Max Ryabinin
    ICML

    Many areas of deep learning benefit from using increasingly larger neural networks trained on public data, as is the case for pre-trained models for NLP and computer vision. Training such models requires a lot of computational resources (e.g., HPC clusters) that are not available to small research groups and independent researchers. One way to address it is for several smaller groups to pool their computational resources together and train a model that benefits all participants. Unfortunately, in this case, any participant can jeopardize the entire training run by sending incorrect updates, deliberately or by mistake. Training in presence of such peers requires specialized distributed training algorithms with Byzantine tolerance. These algorithms often sacrifice efficiency by introducing redundant communication or passing all updates through a trusted server, making it infeasible to apply them to large-scale deep learning, where models can have billions of parameters. In this work, we propose a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency.

  • Training Transformers Together

    Computer visionLarge-scale machine learningGenerative models
    Alexander Borzunov
    Max Ryabinin
    Tim Dettmers
    Quentin Lhoest
    Lucile Saulnier
    Michael Diskin
    Yacine Jernite
    Thomas Wolf
    NeurIPS Demos

    The infrastructure necessary for training state-of-the-art models is becoming overly expensive, which makes training such models affordable only to large corporations and institutions. Recent work proposes several methods for training such models collaboratively, i.e., by pooling together hardware from many independent parties and training a shared model over the Internet. In this demonstration, we collaboratively trained a text-to-image transformer similar to OpenAI DALL-E. We invited the viewers to join the ongoing training run, showing them instructions on how to contribute using the available hardware. We explained how to address the engineering challenges associated with such a training run (slow communication, limited memory, uneven performance between devices, and security concerns) and discussed how the viewers can set up collaborative training runs themselves. Finally, we show that the resulting model generates images of reasonable quality on a number of prompts.

  • Distributed Deep Learning In Open Collaborations

    Computer visionNatural language processing Large-scale machine learning
    Michael Diskin
    Alexey Bukhtiyarov
    Max Ryabinin
    Lucile Saulnier
    Quentin Lhoest
    Anton Sinitsin
    Dmitry Popov
    Dmitry Pyrkin
    Maxim Kashirin
    Alexander Borzunov
    Albert Villanova del Moral
    Denis Mazur
    Ilia Kobelev
    Yacine Jernite
    Thomas Wolf
    Gennady Pekhimenko
    NeurIPS

    Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and maintenance are both environmentally costly and well beyond the budget of most organizations. As a result, some research directions become the exclusive domain of a few large industrial and even fewer academic actors. To alleviate this disparity, smaller groups may pool their computational resources and run collaborative experiments that benefit all participants. This paradigm, known as grid- or volunteer computing, has seen successful applications in numerous scientific areas. However, using this approach for machine learning is difficult due to high latency, asymmetric bandwidth, and several challenges unique to volunteer computing. In this work, we carefully analyze these constraints and propose a novel algorithmic framework designed specifically for collaborative training. We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost. Finally, we provide a detailed report of successful collaborative language model pretraining with nearly 50 participants.