Speculative and parallel decoding
Modern LLMs are autoregressive models that generate one token at a time, which is inefficient on parallel hardware. These works accelerate generation by processing multiple tokens per forward pass.
Modern LLMs are autoregressive models that generate one token at a time, which is inefficient on parallel hardware. These works accelerate generation by processing multiple tokens per forward pass.