Speculative and parallel decoding

Modern LLMs are autoregressive models that generate one token at a time, which is inefficient on parallel hardware. These works accelerate generation by processing multiple tokens per forward pass.