SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices

We introduce SpecExec, a speculative decoding method that delivers up to 20 tokens per iteration and up to 15x speedups in offloading settings. This enables usable LLM inference speeds for users with limited GPU VRAM sizes, who have had to use offloading or settle for lower quality models until now. 
Intro
As large language models (LLMs) like LLaMA and Mistral gain widespread adoption, data science enthusiasts and practitioners are looking for ways to run them faster and with lower hardware requirements. One alternative for users with less capable hardware is to use offloading, a method where only a few layers are initially loaded onto the GPU and the rest are kept in RAM / on SSD and loaded to the GPU sequentially. Naturally, it is quite slow since such loading for Llama-2-70B in 16-bit can take over 5 seconds even with PCIe gen 4 bus.
Speculative decoding
To accelerate generation in an offloading setup, one can use speculative decoding. This approach typically involves a much smaller “draft” model for fast generation of proposed continuation token series (or trees of such series). Once the proposed continuations are generated, the main “target” model validates them and chooses to accept one or none, using a stochastic sampling algorithm. 
The performance of speculative decoding algorithms is measured in terms of number of tokens generated per iteration i.e., the number of tokens accepted by the target model based on the draft model proposals. Combined with the model timings, this metric defines the resulting performance of the algorithm. 
SpecExec method
Our method named SpecExec (after Speculative Execution) was designed to get best speculative decoding performance in such situations. SpecExec takes the most probable tokens continuation from the draft model to build a “cache” tree for the target model, which then gets validated by the target model in a single pass. It allows arbitrary shapes of draft token trees and works with greedy, nucleus or any other sampling function.
The method works especially well because of the high spikiness of token probability distributions in modern large LLMs.  As shown in the picture below, the top-1 token in the Llama-2-70B model contains almost 90%+ of probability space on average and a capable companion model, like Llama-2-7B, covers almost 90% of it with a mere four tokens. This means that a few top predictions by the draft model can be considered an execution cache for the target model with a very high hit likelihood.
Since our method performs best with a rather capable draft model (Llama2-7B for Llama2-70B target), it shows the most impressive results in offloading settings, where the time budget for the speculation phase can be rather large.
SpecExec Performance
We compare SpecExec performance with the popular SpecInfer speculative decoding method. With low token budgets performance in terms of generated tokens per step is close. As the budget grows into hundreds and thousands, Specinfer stops showing improvements while SpecExec shows over 20 generated tokens per step with budgets beyond 1K. The chart above is based the MTBench dataset and Llama2-7/70 chat models.
The table below compares SpecExec’s performance in offloading settings with SpecInfer, a popular speculative decoding method introduced in 2023. While the latter shows impressive speedups, our method more than doubles its performance both in speed and in accepted token counts.
Draft / Target modelsDatasetTemperatureMethodBudgetGeneration rateSpeed, tok/sSpeedup

 

Llama2-7b / 70b

 

 

OAsst

 

0.6SX204820.603.1218.7x
0.6SI10248.411.348.0x 
0SX102418.82.7416.4x
0SI10247.861.187.1x
Llama2-7b / 70b GPTQOAsst0.6SX12812.106.028.9x
0SX25613.436.179.1x
Mistral-7b / Mixtral-8x7b-GPTQOAsst0.6SX25612.383.583.5x
Llama3-8b / 70b 0.6SX102418.882.6215.6x
Llama3-8b / 70bMTBench0.6SX102418.162.7916.6x
0SX204821.582.9417.5x
Inference speed with RAM offloading, A100 GPU, Chat / Instruct models, using SpecExec(SX) and SpecInfer(SI) methods
SpecExec can speed up LLM inference for various types of hardware. In addition to researcher-grade A100, we evaluated SpecExec with consumer GPUS ranging from 2080Ti to 4090. The results below were achieved with a quantized model that may fit the RAM of consumer grade computers. Note that speedup ranges from 4.6x to 10.6x, allowing generation speed in 3-6 tokens/s range.
GPUDraft modelBudgetGen. rateSpeed, tok/sSpeedup
RTX 4090Llama2-7b GPTQ25613.465.668.3x
RTX 4060Llama2-7b GPTQ1289.703.284.6x
RTX 3090Llama2-7b GPTQ25614.33.6810.6x
RTX 2080TiShearedLlama-1.3B1287.341.866.1x
We evaluated SpecExec with LLMs mainly with Llama family models, but believe that the results can be applied to other model families. 
SpecExec represents a significant advancement in running large language models on consumer hardware. By leveraging high spikiness in token probability distributions and a capable draft model, it achieves impressive speedups and efficient token generation. This method not only democratizes access to powerful LLMs but also ensures high-quality inference is within reach for a broader audience.
Whether you are a researcher looking to maximize your hardware’s potential or a developer aiming to integrate powerful language models into your applications, SpecExec offers a robust, scalable, and efficient solution.
Refer to our paper for more details.
The implementation of SpecExec is available at GitHub.