Speculative and parallel decoding

Modern LLMs are autoregressive models that generate one token at a time, which is inefficient on parallel hardware. These works accelerate generation by processing multiple tokens per forward pass.

Posts

Publications

  • AutoJudge: Judge Decoding Without Manual Annotation

    Speculative and parallel decodingLarge-scale machine learningNatural language processing
    Roman Garipov
    Fedor Velikonivtsev
    Ivan Ermakov
    Ruslan Svirschevski
    Vage Egiazarian
    Max Ryabinin
    NeurIPS, 2025

    We introduce AutoJudge, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the response, relaxing the distribution match guarantee so that the "unimportant" tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected to preserve quality and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We evaluate the effectiveness of AutoJudge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8k with the Llama 3.1 70B target model, our approach achieves up to ≈2× speedup over speculative decoding at the cost of ≤1% drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge automatically detects programming-specific important tokens, accepting ≥25 tokens per speculation cycle at 2% drop in Pass@1. Our approach requires no human annotation and is easy to integrate with modern LLM inference frameworks.

  • Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

    Speculative and parallel decodingNatural language processing
    Gleb Rodionov
    Roman Garipov
    Alina Shutova
    George Yakushev
    Erik Schultheis
    Vage Egiazarian
    Anton Sinitsin
    Denis Kuznedelev
    Dan Alistarh
    NeurIPS, 2025

    Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM “workers” in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the LLM instances to come up with their own collaboration strategy for the problem at hand, all the while “seeing” each other’s memory in the concurrent KV cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with “instant” access to each other’s memory. Hogwild! Inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

  • Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding

    Speculative and parallel decodingNatural language processing Large-scale machine learning
    Zhuoming Chen
    Avner May
    Ruslan Svirschevski
    Yuhsun Huang
    Max Ryabinin
    Zhihao Jia
    Beidi Chen
    NeurIPS, 2024

    As the usage of large language models (LLMs) grows, it becomes increasingly important to serve them quickly and efficiently. While speculative decoding has recently emerged as a promising direction for accelerating LLM serving, existing methods are limited in their ability to scale to larger speculation budgets and adapt to different hyperparameters. This paper introduces Sequoia, a scalable and robust algorithm for speculative decoding. To improve scalability, Sequoia introduces a dynamic programming algorithm to find an optimal tree structure for the speculated tokens. To achieve robust speculative decoding, Sequoia uses a novel sampling and verification method that outperforms prior work across different decoding temperatures. Sequoia improves the decoding speed of Llama2-7B, Llama2-13B, and Vicuna-33B on an A100 GPU by up to 4.04×, 3.73×, and 2.27×. To serve Llama3-70B-Instruct on a single L40 GPU through offloading, Sequoia reduces the per-token decoding latency to 0.60 s/token, 9.5× faster than DeepSpeed-Zero-Inference.