Model compression

Deep learning models are outgrowing the hardware that runs them. We try to make large models fit on smaller devices through quantization, pruning, factorization, and other means.

Publications

  • EvoPress: Accurate Dynamic Model Compression via Evolutionary Search

    Model compressionLarge-scale machine learningNatural language processing
    Oliver Sieberling
    Denis Kuznedelev
    Eldar Kurtic
    Dan Alistarh
    ICML, 2025

    The high computational costs of large language models (LLMs) have led to a flurry of research on LLM compression, via methods such as quantization, sparsification, or structured pruning. A new frontier in this area is given by dynamic, non-uniform compression methods, which adjust the compression levels (e.g., sparsity) per-block or even per-layer in order to minimize accuracy loss, while guaranteeing a global compression threshold. Yet, current methods rely on estimating the “importance” of a given layer, implicitly assuming that layers contribute independently to the overall compression error. We begin from the motivating observation that this independence assumption does not generally hold for LLM compression: pruning a model further may even significantly recover performance. To address this, we propose EvoPress, a novel evolutionary framework for dynamic LLM compression. By formulating dynamic compression as a general optimization problem, EvoPress identifies optimal compression profiles in a highly efficient manner, and generalizes across diverse models and compression techniques. Via EvoPress, we achieve state-of-the-art performance for dynamic compression of Llama, Mistral, and Phi models, setting new benchmarks for structural pruning (block/layer dropping), unstructured sparsity, and quantization with dynamic bitwidths.

  • PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

    Model compressionLarge-scale machine learningNatural language processing
    Vladimir Malinovskii
    Denis Mazur
    Ivan Ilin
    Denis Kuznedelev
    Konstantin Burlachenko
    Kai Yi
    Dan Alistarh
    Peter Richtarik
    NeurIPS, 2024

    There has been significant interest in "extreme" compression of large language models (LLMs), i.e. to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accuracy-vs-bit-width trade-off. State-of-the-art quantization methods such as QuIP# and AQLM include fine-tuning (part of) the compressed parameters over a limited amount of calibration data; however, such fine-tuning techniques over compressed weights often make exclusive use of straight-through estimators (STE), whose performance is not well-understood in this setting. In this work, we question the use of STE for extreme LLM compression, showing that it can be sub-optimal, and perform a systematic study of quantization-aware fine-tuning strategies for LLMs. We propose PV-Tuning - a representation-agnostic framework that generalizes and improves upon existing fine-tuning strategies, and provides convergence guarantees in restricted cases. On the practical side, when used for 1-2 bit vector quantization, PV-Tuning outperforms prior techniques for highly-performant models such as Llama and Mistral. Using PV-Tuning, we achieve the first Pareto-optimal quantization for Llama-2 family models at 2 bits per parameter.

  • The Iterative Optimal Brain Surgeon: Faster Sparse Recovery by Leveraging Second-Order Information

    OptimizationModel compression
    Diyuan Wu
    Ionut-Vlad Modoranu
    Mher Safaryan
    Denis Kuznedelev
    Dan Alistarh
    NeurIPS, 2024

    The rising footprint of machine learning has led to a focus on imposing model sparsity as a means of reducing computational and memory costs. For deep neural networks (DNNs), the state-of-the-art accuracy-vs-sparsity is achieved by heuristics inspired by the classical Optimal Brain Surgeon (OBS) framework [LeCun et al., 1989, Hassibi and Stork, 1992, Hassibi et al., 1993], which leverages loss curvature information to make better pruning decisions. Yet, these results still lack a solid theoretical understanding, and it is unclear whether they can be improved by leveraging connections to the wealth of work on sparse recovery algorithms. In this paper, we draw new connections between these two areas and present new sparse recovery algorithms inspired by the OBS framework that come with theoretical guarantees under reasonable assumptions and have strong practical performance. Specifically, our work starts from the observation that we can leverage curvature information in OBS-like fashion upon the projection step of classic iterative sparse recovery algorithms such as IHT. We show for the first time that this leads both to improved convergence bounds in well-behaved settings and to stronger practical convergence. Furthermore, we present extensions of this approach to training accurate sparse DNNs, and validate it experimentally at scale.