Artem Babenko

TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling
Tabular data
Yury Gorishniy
Akim Kotelnikov
Artem Babenko
ICLR, 2025
Deep learning architectures for supervised learning on tabular data range from simple multilayer perceptrons (MLP) to sophisticated Transformers and retrieval-augmented methods. This study highlights a major, yet so far overlooked opportunity for substantially improving tabular MLPs; namely, parameter-efficient ensembling -- a paradigm for imitating an ensemble of models with just one model. We start by describing TabM -- a simple model based on MLP and BatchEnsemble (an existing technique), improved with our custom modifications. Then, we perform a large scale evaluation of tabular DL architectures on public benchmarks in terms of both task performance and efficiency, which renders the landscape of tabular DL in a new light. In particular, we find that TabM outperforms prior tabular DL models, while the complexity of attention- and retrieval-based methods does not pay off. Lastly, we conduct a detailed empirical analysis, that sheds some light on the high performance of TabM. For example, we show that parameter-efficient ensembling is not an arbitrary trick, but rather a highly effective way to reduce overfitting and improve optimization dynamics of tabular MLPs. Overall, our work brings an impactful technique to tabular DL, analyses its behaviour, and advances the performance-efficiency tradeoff with TabM -- a simple and powerful baseline for researchers and practitioners.
TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks
Tabular data
Ivan Rubachev
Nikolay Kartashev
Yury Gorishniy
Artem Babenko
ICLR, 2025
Advances in machine learning research drive progress in real-world applications. To ensure this progress, it is important to understand the potential pitfalls on the way from a novel method's success on academic benchmarks to its practical deployment. In this work, we analyze existing tabular deep learning benchmarks and find two common characteristics of tabular data in typical industrial applications that are underrepresented in the datasets usually used for evaluation in the literature. First, in real-world deployment scenarios, distribution of data often changes over time. To account for this distribution drift, time-based train/test splits should be used in evaluation. However, existing academic tabular datasets often lack timestamp metadata to enable such evaluation. Second, a considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines. This can have an impact on the absolute and relative number of predictive, uninformative, and correlated features compared to academic datasets. In this work, we aim to understand how recent research advances in tabular deep learning transfer to these underrepresented conditions. To this end, we introduce TabReD — a collection of eight industry-grade tabular datasets. We reassess a large number of tabular ML models and techniques on TabReD. We demonstrate that evaluation on both time-based data splits and richer feature sets leads to different methods ranking, compared to evaluation on random splits and smaller number of features, which are common in academic benchmarks. Furthermore, simple MLP-like architectures and GBDT show the best results on the TabReD datasets, while other methods are less effective in the new setting.
Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps
Generative models Computer vision
Nikita Starodubcev
Mikhail Khoroshikh
Artem Babenko
Dmitry Baranchuk
NeurIPS, 2024
Diffusion distillation represents a highly promising direction for achieving faithful text-to-image generation in a few sampling steps. However, despite recent successes, existing distilled models still do not provide the full spectrum of diffusion abilities, such as real image inversion, which enables many precise image manipulation methods. This work aims to enrich distilled text-to-image diffusion models with the ability to effectively encode real images into their latent space. To this end, we introduce invertible Consistency Distillation (iCD), a generalized consistency distillation framework that facilitates both high-quality image synthesis and accurate image encoding in only 3-4 inference steps. Though the inversion problem for text-to-image diffusion models gets exacerbated by high classifier-free guidance scales, we notice that dynamic guidance significantly reduces reconstruction errors without noticeable degradation in generation performance. As a result, we demonstrate that iCD equipped with dynamic guidance may serve as a highly effective tool for zero-shot text-guided image editing, competing with more expensive state-of-the-art alternatives.

Heterophilous graph datasets
Graph machine learning
Oleg Platonov
Denis Kuznedelev
Michael Diskin
Artem Babenko
Liudmila Prokhorenkova
A graph dataset is called heterophilous if nodes prefer to connect to other nodes that are not similar to them. For example, in financial transaction networks, fraudsters often perform transactions with non-fraudulent users, and in dating networks, most connections are between people of opposite genders. Learning under heterophily is an important subfield of graph ML. Thus, having diverse and reliable benchmarks is essential.

We propose a benchmark of five diverse heterophilous graphs that come from different domains and exhibit a variety of structural properties. Our benchmark includes a word dependency graph Roman-empire, a product co-purchasing network Amazon-ratings, a synthetic graph emulating the minesweeper game Minesweeper, a crowdsourcing platform worker network Tolokers, and a question-answering website interaction network Questions.
Text-to-Image dataset for billion-scale similarity search
Nearest neighbor search Natural language processing Computer vision
Dmitry Baranchuk
Artem Babenko
Yandex Text-to-Image (T2I) dataset is collected to foster the research in billion-scale nearest neighbor search (NNS) when query distribution differs from the indexing one. In particular, this dataset addresses the cross-domain setting: a user specifies a textual query and requests the search engine to retrieve the most relevant images to the query. Notably, current large-scale indexing methods perform poorly in this setting. Therefore, novel highly-performant indexing solutions robust to out-of-domain queries are in high demand.

The dataset represents a snapshot of the Yandex visual search engine and contains 1 billion 200-dimensional image embeddings for indexing. The image embeddings are produced by the Se-ResNext-101 model. The embeddings for textual queries are extracted by a variant of the DSSM model.

Read more about the data format and how to download the dataset in the related post.

Artem Babenko

Publications

TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling

TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks

Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps

Posts

Benchmarks for Billion-Scale Similarity Search

Datasets

Heterophilous graph datasets

Text-to-Image dataset for billion-scale similarity search