DK

Denis Kuznedelev

My scholarly and technical focus spans two cutting‑edge domains within artificial intelligence:

  1. Efficient Deep Learning — exploring methods to optimize neural network architectures, reduce computational costs, and improve inference speed without sacrificing model performance. This includes work on model compression, quantization, pruning, knowledge distillation, and hardware‑aware algorithm design to make deep learning more scalable and accessible across diverse computing environments.

  2. Generative AI — investigating advanced generative models such as diffusion models, generative adversarial networks (GANs), and large autoregressive transformers. My work in this area involves enhancing the quality, diversity, and controllability of generated content (e.g., images, text, audio), as well as exploring ethical implications, bias mitigation, and responsible deployment of generative technologies.

Together, these research interests aim to advance the frontiers of AI by making powerful models more efficient, interpretable, and broadly applicable across real‑world scenarios.

Publications

  • Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

    Model compressionNatural language processing Large-scale machine learning
    Vage Egiazarian
    Roberto L. Castro
    Denis Kuznedelev
    Andrei Panferov
    Eldar Kurtić
    Shubhra Pandit
    Alexandre Marques
    Mark Kurtz
    Saleh Ashkboos
    Torsten Hoefler
    Dan Alistarh
    ICLR, 2026

    The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVFP4 for post-training quantization, revealing gaps between their promise and real-world performance. Our analysis shows that state-of-the-art methods struggle with FP4, due to two key issues: (1) NVFP4’s small group size provably neutralizes traditional outlier mitigation techniques; (2) MXFP4’s power-of-two scale quantization severely degrades accuracy due to high induced error. To bridge this gap, we introduce Micro-Rotated-GPTQ (MR-GPTQ), a variant of the classic GPTQ quantization algorithm that tailors the quantization process to FP4’s unique properties, by using block-wise Hadamard transforms and format-specific optimizations. We support our proposal with a set of high-performance GPU kernels that enable the MR-GPTQ format with negligible overhead, by rotation fusion into the weights, and fast online computation of the activations. This leads to speedups vs. FP16 of up to 3.6x layer-wise, and 2.2x end-to-end on NVIDIA B200, and of 6x layer-wise and 4x end-to-end on RTX5090. Our extensive empirical evaluation demonstrates that MR-GPTQ matches or outperforms state-of-the-art accuracy, significantly boosting MXFP4, to the point where it nears that of NVFP4. We conclude that, while FP4 is not an automatic upgrade over INT4, format-specialized methods like MR-GPTQ can unlock a new frontier of accuracy-performance trade-offs.

  • Scale-wise Distillation of Diffusion Models

    Computer visionGenerative models
    Nikita Starodubcev
    Ilya Drobyshevskiy
    Denis Kuznedelev
    Artem Babenko
    Dmitry Baranchuk
    ICLR, 2026

    Recent diffusion distillation methods have achieved remarkable progress, enabling high-quality ∼4-step sampling for large-scale text-conditional image and video diffusion models. However, further reducing the number of sampling steps becomes more and more challenging, suggesting that efficiency gains may be better mined along other model axes. Motivated by this perspective, we introduce SwD, a scale-wise diffusion distillation framework that equips few-step models with progressive generation, avoiding redundant computations at intermediate diffusion timesteps. Beyond efficiency, SwD enriches the family of distribution matching distillation approaches by introducing a simple patch-level distillation objective based on Maximum Mean Discrepancy (MMD). This objective significantly improves the convergence of existing distillation methods and performs surprisingly well in isolation, offering a competitive baseline for diffusion distillation. Applied to state-of-the-art text-to-image/video diffusion models, SwD approaches the sampling speed of two full-resolution steps and largely outperforms alternatives under the same compute budget, as evidenced by automatic metrics and human preference studies.

  • Hogwild! Inference: Parallel LLM Generation via Concurrent Attention

    Speculative and parallel decodingNatural language processing
    Gleb Rodionov
    Roman Garipov
    Alina Shutova
    George Yakushev
    Erik Schultheis
    Vage Egiazarian
    Anton Sinitsin
    Denis Kuznedelev
    Dan Alistarh
    NeurIPS, 2025

    Large Language Models (LLMs) have demonstrated the ability to tackle increasingly complex tasks through advanced reasoning, long-form content generation, and tool use. Solving these tasks often involves long inference-time computations. In human problem solving, a common strategy to expedite work is collaboration: by dividing the problem into sub-tasks, exploring different strategies concurrently, etc. Recent research has shown that LLMs can also operate in parallel by implementing explicit cooperation frameworks, such as voting mechanisms or the explicit creation of independent sub-tasks that can be executed in parallel. However, each of these frameworks may not be suitable for all types of tasks, which can hinder their applicability. In this work, we propose a different design approach: we run LLM “workers” in parallel , allowing them to synchronize via a concurrently-updated attention cache and prompt these workers to decide how best to collaborate. Our approach allows the LLM instances to come up with their own collaboration strategy for the problem at hand, all the while “seeing” each other’s memory in the concurrent KV cache. We implement this approach via Hogwild! Inference: a parallel LLM inference engine where multiple instances of the same LLM run in parallel with the same attention cache, with “instant” access to each other’s memory. Hogwild! Inference takes advantage of Rotary Position Embeddings (RoPE) to avoid recomputation while improving parallel hardware utilization. We find that modern reasoning-capable LLMs can perform inference with shared Key-Value cache out of the box, without additional fine-tuning.

Datasets

  • Heterophilous graph datasets

    Graph machine learning
    Oleg Platonov
    Denis Kuznedelev
    Michael Diskin
    Artem Babenko
    Liudmila Prokhorenkova

    A graph dataset is called heterophilous if nodes prefer to connect to other nodes that are not similar to them. For example, in financial transaction networks, fraudsters often perform transactions with non-fraudulent users, and in dating networks, most connections are between people of opposite genders. Learning under heterophily is an important subfield of graph ML. Thus, having diverse and reliable benchmarks is essential.

    We propose a benchmark of five diverse heterophilous graphs that come from different domains and exhibit a variety of structural properties. Our benchmark includes a word dependency graph Roman-empire, a product co-purchasing network Amazon-ratings, a synthetic graph emulating the minesweeper game Minesweeper, a crowdsourcing platform worker network Tolokers, and a question-answering website interaction network Questions.