Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Published

December 2023

Venue

NeurIPS, 2023

Authors

Alexander Borzunov
HSE Univesity
Yandex
Max Ryabinin
HSE Univesity
Yandex
Artem Chumachenko
Neiro.ai
Dmitry Baranchuk
Yandex
Tim Dettmers
University of Washington
Younes Belkada
Hugging Face
Pavel Samygin
Yandex School of Data Analysis
Colin Raffel
Hugging Face

Research areas

Paper

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them inaccessible to most researchers. In this work, we investigate methods for cost-efficient inference and fine-tuning of LLMs, comparing local and distributed strategies. We observe that a large enough model (50B+) can run efficiently even on geodistributed devices in a consumer-grade network. This could allow running LLM efficiently by pooling together idle compute resources of multiple research groups and volunteers. We address two open problems: (1) how to perform inference and fine-tuning reliably if any device can disconnect abruptly and (2) how to partition LLMs between devices with uneven hardware, joining and leaving at will. In order to do that, we develop special fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput. We showcase these algorithms in Petals — a decentralized system that runs Llama 2 (70B) and BLOOM (176B) over the Internet up to 10х faster than offloading for interactive generation. We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.