Petals: decentralized inference and finetuning of large language models

Large language models are among the most significant recent advances in machine learning. Still, leveraging these models can be difficult: offloading and quantization have limitations, and third-party APIs are less flexible. As an alternative solution, we propose Petals, an open-source decentralized system (showcased this week at the ACL 2023 Demonstrations track) allowing anybody to run large models or even adapt them using the idle resources of volunteers. In this post, you will learn the motivation behind the system, its underlying ideas, and its advantages compared to other ways of using large models.
Petals was developed as a part of the BigScience collaboration by engineers and researchers from Yandex Research, HSE University, University of Washington, Hugging Face, ENS Paris-Saclay, and Yandex School of Data Analysis.

Background: open LLMs and methods of running them The link has been copied to clipboard

Since 2020, we have seen large language models (LLMs) like GPT-3 rapidly improve their capabilities, sometimes gaining emergent properties such as in-context learning. In 2022 and early 2023, many open-access alternatives to proprietary LLMs were released: notable examples of this trend include BLOOM, OPT, LLaMA, as well as YaLM developed by Yandex. However, using these models with high performance is still an engineering challenge: models with over 170 billion parameters need over 340 gigabytes of GPU memory to be stored in FP16 precision, which exceeds the limits of any single accelerator.
From a user perspective, an easy solution is to use APIs, where the model is hosted by an external provider charging for requests to this LLM. While this approach does not require expertise in model serving, it is also the least flexible one: API maintainers usually do not allow inspecting the internal states of the neural network, which can be helpful for its analysis. Also, the model itself might be phased out of service by the provider, which makes it especially difficult to conduct reproducible research using such APIs.
In contrast, offloading the neural network weights to larger local storage (such as RAM or SSD) grants full control over the model. Even if your personal computer has enough memory, the latency of this approach is significantly higher: generating a single token with BLOOM-176B, offloading will take more than 5 seconds because of the data transfer bottleneck. Such a delay might be acceptable for batch processing but not for interactive applications: hence, we need something that is transparent yet can be fast enough.

Overview of the approach The link has been copied to clipboard

On a surface level, Petals works as a decentralized pipeline designed for fast inference of neural networks. It splits any given model into several blocks (or layers) that are hosted on different servers. These servers can be spread out across continents, and anybody can connect their own GPU! In turn, users can connect to this network as a client and apply the model to their data.
When a client sends a request to the network, it is routed through a chain of servers that is built to minimize the total forward pass time. Upon joining the system, each server selects the most optimal set of blocks based on the current bottlenecks within the pipeline. Below, you can see an illustration of Petals for several servers and clients running different inputs for the model.
An overview of Petals
As our network consists of volunteers and not on-demand servers, each participant of Petals can disconnect at any point. To address potential failures, a client stores intermediate activations sent to each block and reroutes them from an offline server to an online node hosting the same block.
Importantly, the transparency of intermediate states has an extra benefit here. Because each input and output of the block is sent over the network, it is possible to insert task-specific adapters between layers of the model, which enables lightweight finetuning without altering the pretrained model hosted on servers. The paper about Petals
[1] covers the system in more depth, covering other components such as activation and weight compression of the model.

Petals in practice The link has been copied to clipboard

If you simply want to use Petals as a client, you do not need to know anything about the system details. The interface of Petals is intentionally very similar to the Transformers library: if you only need to obtain generated outputs or adapt the model with prompt tuning, the snippet below covers all the necessary steps. As you can see, connection to the Petals public network and any complicated routing logic are not visible to the end user. The Petals repository contains several tutorials and examples showing how to use it for different tasks.
from petals import AutoDistributedBloomForCausalLM

model = DistributedBloomForCausalLM.from_pretrained(
    "bigscience/bloom-petals", tuning_mode="ptune", pre_seq_len=16)

inputs = tokenizer("A cat sat", return_tensors="pt")["input_ids"]
outputs = model.generate(inputs, max_new_tokens=5)
print(tokenizer.decode(outputs[0]))  # A cat sat on a mat ...

# Fine-tuning (updates only prompts or adapters hosted locally)
optimizer = torch.optim.AdamW(model.parameters())
for input_ids, labels in data_loader:
    outputs = model.forward(input_ids)
    loss = cross_entropy(outputs.logits, labels)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
Of course, a volunteer-based platform is operational only if there are enough GPUs handling the requests of clients. Assuming you have a Linux system with CUDA and PyTorch installed, connecting your own machine to the public swarm and becoming a server is also a matter of two Terminal commands:
pip install -U petals
python -m petals.cli.run_server bigscience/bloom
Alternatively, if you want to create your own swarm (for example, to use a specific model internally at the company), we offer a guide for deploying a self-hosted version of Petals.

Benchmarks The link has been copied to clipboard

We compare the performance of Petals with offloading, as it is the most popular method for using 100B+ models on local hardware. We test both single-batch inference as an interactive setting and parallel forward pass throughput for a batch processing scenario. Our experiments are run on BLOOM-176B and cover various network conditions, from a few high-speed nodes to real-world Internet links. As you can see from the table below, Petals is predictably slower than offloading in terms of throughput but 3–25x faster in terms of latency when compared in a realistic setup. This means that inference (and sometimes even finetuning) is much faster with Petals, despite the fact that we are using a distributed model instead of a local one.
Comparison of sequential and parallel inference speed with offloading (RTT is round-trip latency)

Conclusion The link has been copied to clipboard

Our work on Petals continues the line of research towards making the latest advances in deep learning more accessible for everybody. With this work, we demonstrate that it is feasible not only to train large models with volunteer computing, but to run their inference in such a setup as well. The development of Petals is an ongoing effort: it is fully open-source (hosted at https://github.com/bigscience-workshop/petals), and we would be happy to receive any feedback or contributions regarding this project!