Papers accepted to ICML 2023

Four papers by the Yandex Research team and our collaborators have been accepted for publication at the International Conference on Machine Learning (ICML 2023).
TabDDPM: Modelling Tabular Data with Diffusion Models by Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, Artem Babenko
Denoising diffusion probabilistic models are currently becoming the leading paradigm of generative modeling for many data modalities. In this work, we investigate if the framework of diffusion models can be advantageous for general tabular problems, where datapoints are represented by vectors of heterogeneous features. We introduce TabDDPM — a diffusion model that can be universally applied to any tabular dataset and handles any feature type. We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, consistent with the advantage of diffusion models in other fields. Additionally, we show that TabDDPM is eligible for privacy-oriented setups, where the original datapoints cannot be publicly shared. Our approach is also compared with a simple SMOTE baseline, which, despite its simplicity, is hard to beat.
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient by Max Ryabinin, Tim Dettmers, Michael Diskin, Alexander Borzunov
We present a new approach for training large neural networks that is cost-efficient and accessible to researchers with limited resources. This approach relies on the finding that pipeline-parallel training larger models is less communication-intensive relative to their compute costs. Our proposed method, SWARM parallelism, enables the training of large models on heterogeneous and unreliable devices, such as preemptible instances or volunteer servers. This method achieves robustness to network failures by creating randomized pipelines between nodes that are rebalanced in case of a failure. We show that SWARM outperforms existing model-parallel algorithms in these conditions and can train a large Transformer language model with 1B shared parameters on preemptible T4 GPUs with less than 200Mb/s network speed.
High-throughput Generative Inference of Large Language Models with a Single GPU by Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang
Large language models (LLMs) typically require powerful hardware because of their high computational and memory requirements. Our paper introduces FlexGen, a new offloaded generation engine that allows these models to run efficiently on a single commodity GPU. FlexGen adapts to resource constraints by aggregating memory and computation from the GPU, CPU, and disk. It uses a linear programming optimizer to find the best way to store and access data, and compresses model weights and attention cache to just 4 bits without sacrificing accuracy. FlexGen outperforms state-of-the-art solutions for inference with offloading, achieving a generation throughput of 1 token/s on a single 16GB GPU for OPT-175B. With FlexGen, models with 30B parameters can be evaluated with one 16GB GPU on a wide range of HELM benchmark scenarios in less than 24 hours.
Which Tricks are Important for Learning to Rank? by Ivan Lyzhin, Aleksei Ustimenko, Andrey Gulin, Liudmila Prokhorenkova
The most well-known learning-to-rank algorithm is LambdaMART. LambdaMART is based on gradient-boosted decision trees (GBDT) and was proposed more than a decade ago. Since then, several other GBDT-based ranking methods have been proposed. In this paper, we thoroughly analyze these methods in a unified setup and show which algorithmic details are important for GBDT-based ranking. In addition to LambdaMART, we cover another long-known algorithm called YetiRank and the recently proposed StochasticRank. We empirically show that YetiRank outperforms its competitors in most cases. We also propose a simple improvement of the YetiRank approach that allows for optimizing specific ranking loss functions. As a result, we gain insights into learning-to-rank techniques and obtain a new state-of-the-art algorithm.