Denis Kuznedelev

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention
Speculative and parallel decoding Natural language processing
Gleb Rodionov
Roman Garipov
Alina Shutova
George Yakushev
Erik Schultheis
Vage Egiazarian
Anton Sinitsin
Denis Kuznedelev
Dan Alistarh
NeurIPS, 2025
Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM “workers” in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the LLM instances to come up with their own collaboration strategy for the problem at hand, all the while “seeing” each other’s memory in the concurrent KV cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with “instant” access to each other’s memory. Hogwild! Inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.
YaART: Yet Another ART Rendering Technology
Computer vision Generative models
Sergey Kastryulin
Artem Konev
Alexander Shishenya
Eugene Lyapustin
Artem Khurshudov
Alexander Tselousov
Nikita Vinokurov
Denis Kuznedelev
Alexander Markovich
Grigoriy Livshits
Alexey Kirillov
Anastasiia Tabisheva
Liubov Chubarova
Marina Kaminskaia
Alexander Ustyuzhanin
Artemii Shvetsov
Daniil Shlenskii
Valerii Startsev
Dmitrii Kornilov
Mikhail Romanov
Dmitry Baranchuk
Artem Babenko
Sergei Ovcharenko
Valentin Khrulkov
KDD, 2025
In the rapidly progressing field of generative models, the development of efficient and high-fidelity text-to-image diffusion systems represents a significant frontier. This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences using Reinforcement Learning from Human Feedback (RLHF). During the development of YaART, we especially focus on the choices of the model and training dataset sizes, the aspects that were not systematically investigated for text-to-image cascaded diffusion models before. In particular, we comprehensively analyze how these choices affect both the efficiency of the training process and the quality of the generated images, which are highly important in practice. Furthermore, we demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets, establishing a more efficient scenario of diffusion models training. From the quality perspective, YaART is consistently preferred by users over many existing state-of-the-art models.
Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models
Large-scale machine learning Natural language processing
Alina Shutova
Vladimir Malinovskii
Vage Egiazarian
Denis Kuznedelev
Denis Mazur
Nikita Surkov
Ivan Ermakov
Dan Alistarh
ICML, 2025
Efficient real-world deployments of large language models (LLMs) rely on Key-Value (KV) caching for processing and generating long outputs, reducing the need for repetitive computation. For large contexts, Key-Value caches can take up tens of gigabytes of device memory, as they store vector representations for each token and layer. Recent work has shown that the cached vectors can be compressed through quantization, pruning or merging, but these techniques often compromise quality towards higher compression rates. In this work, we aim to improve Key & Value compression by exploiting two observations: 1) the inherent dependencies between keys and values across different layers, and 2) the existence of high-compression methods for internal network states (e.g. attention Keys & Values). We propose AQUA-KV, an adaptive quantization for Key-Value caches that relies on compact adapters to exploit existing dependencies between Keys and Values, and aims to “optimally” compress the information that cannot be predicted. AQUA-KV significantly improves compression rates, while maintaining high accuracy on state-of-the-art LLM families. On Llama 3.2 LLMs, we achieve near-lossless inference at 2-2.5 bits per value with under 1 relative error in perplexity and LongBench scores. AQUA-KV is one-shot, simple, and efficient: it can be calibrated on a single GPU within 1-6 hours, even for 70B models.

Denis Kuznedelev

Publications

Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

YaART: Yet Another ART Rendering Technology

Cache Me If You Must: Adaptive Key-Value Quantization for Large Language Models

Datasets

Heterophilous graph datasets