Papers accepted to NeurIPS

October 19, 2021

We are happy to announce that eight papers, two benchmarks and one demo were accepted for publication at the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021). 


Revisiting Deep Learning Models for Tabular Data by Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov and Artem Babenko

Tabular Data problems are ubiquitous, highly practical and cover many business domains, motivating researchers to develop new neural network architectures. At the same time, the field lacks strong baselines, i.e., powerful but simple and easy-to-use models. In this work, we propose two such models. The first is a fast ResNet-like architecture that is already enough to achieve (and often surpass) the performance of many existing sophisticated architectures. The second is FT-Transformer, our adaptation of the Transformer architecture for Tabular Data. It is slower but demonstrates the best performance among deep learning models on the tasks where gradient boosting dominates.

Moshpit SGD: Communication-Efficient Decentralized Training on Heterogeneous Unreliable Devices by Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk and Gennady Pekhimenko

Training of deep neural networks is often accelerated by combining the power of multiple servers with distributed algorithms. Unfortunately, communication-efficient versions of these algorithms frequently require reliable high-speed connections usually available only in dedicated clusters. This work proposes Moshpit All-Reduce — a fault-tolerant scalable algorithm for decentralized averaging that maintains favorable convergence properties than regularly distributed approaches. We show that Moshpit SGD, a distributed optimization method based on this algorithm, has both strong theoretical guarantees and high practical efficiency. In particular, we demonstrate gains of 1.3-1.5x in large-scale deep learning experiments such as ImageNet classification with ResNet-50 or ALBERT-large pretraining on BookCorpus. 

Good Classification Measures and How to Find Them by Martijn Gösgens, Anton Zhiyanov, Alexey Tikhonov and Liudmila Prokhorenkova

Several measures can be used for evaluating classification results: accuracy, F-measure, Cohen's Kappa, and so on. Can we say that some of them are better than others, or, ideally, choose one best measure for all situations? We conduct a systematic theoretical analysis to answer this question: we formally define a list of desirable properties and theoretically analyze which measures satisfy which properties. We also prove an impossibility theorem: some desirable properties cannot be simultaneously satisfied. Finally, we propose a new family of measures that generalizes known measures with good properties and satisfies all desirable properties except one. Find out more in this post.

Distributed Deep Learning In Open Collaborations by Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, Dmitry Popov, Dmitriy Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, Denis Mazur, Yacine Jernite, Thomas Wolf and Gennady Pekhimenko

Training the most powerful neural networks requires computational resources that are often unavailable outside of large organizations, ultimately slowing down scientific progress. In this work, we propose an approach that allows training large neural networks in collaborations that can span the entire globe. Our method, named DeDLOC, can adapt to different hardware and network conditions, which makes it significantly more efficient than standard methods designed for uniform setups. We demonstrate the beneficial properties of the DeDLOC in cost-efficient cloud setups and a volunteer experiment, training a high-quality language model for Bengali with 40 participants. 

Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets by Max Ryabinin, Andrey Malinin and Mark Gales

Ensembles of machine learning models produce improved system performance as well as robust and interpretable uncertainty estimates. However, their inference costs can be prohibitively high. Ensemble Distribution Distillation is an approach that allows a single model to efficiently capture the predictive performance and uncertainty estimates of an ensemble. For classification, this is achieved by training a Dirichlet distribution over the ensemble members' output distributions via the maximum likelihood criterion. Although theoretically principled, this work shows that the criterion exhibits poor convergence when applied to large-scale tasks where the number of classes is very high. We propose a new training objective — the reverse KL-divergence to a Proxy-Dirichlet target derived from the ensemble — which resolves the gradient issues of the Ensemble Distribution Distillation approach. We demonstrate this on the ImageNet and WMT17 En-De datasets containing 1,000 and 40,000 classes each. 

Overlapping Spaces for Compact Graph Representations by Kirill Shevkunov and Liudmila Prokhorenkova 

We introduce overlapping spaces that can be used for compact graph embeddings. The principal idea is to allow subsets of coordinates to be shared between spaces of different types (Euclidean, hyperbolic, spherical). As a result, parameter optimization automatically learns an optimal combination of spaces. Additionally, overlapping spaces allow for more compact representations due to their complex geometry.  

Distributed Saddle-Point Problems Under Similarity by Aleksandr Beznosikov, Gesualdo Scutari, Alexander Rogozin and Alexander Gasnikov

We study solution methods for saddle point problems over networks of two type: master/workers (centralized) architectures and meshed (decentralized) networks. The local functions at each node are assumed to be similar, due to statistical data similarity or otherwise. We establish lower complexity bounds for a fairly general class of algorithms solving saddle point problems. We show that in such a setup, the number of communications can be significantly reduced and in some cases does not depend on the properties of the functions at all. We then propose optimal algorithms matching the lower bounds over either types of networks.

On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay by Katerina Lobacheva, Maxim Kodryan, Nadezhda Chirkova, Andrey Malinin and Dmitry Vetrov 

Despite the conventional wisdom that using batch normalization with weight decay may improve neural network training, some recent work shows their common usage may cause instabilities at the late stages of training. Other work, in contrast, shows convergence to the equilibrium, i.e., the stabilization of training metrics. We study this contradiction and show that the training dynamics converge to consistent periodic behavior instead of converging to a stable equilibrium. Specifically, the training process regularly exhibits instabilities that do not lead to complete failure but cause a new training period. We rigorously investigate the mechanism underlying this periodic behavior and show that it is caused by the interaction between batch normalization and weight decay. 

Datasets & Benchmarks:

Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks by Andrey Malinin, Neil Band, Ganshin, Alexander, German Chesnokov, Yarin Gal, Mark J. F. Gales, Alexey Noskov, Andrey Ploskonosov, Liudmila Prokhorenkova, Ivan Provilkov, Vatsal Raina, Vyas Raina, Roginskiy, Denis, Mariya Shmatova, Panos Tigas and Boris Yangel  

Researchers have rarely considered developing standard datasets and benchmarks for assessing robustness to distributional shift and uncertainty estimation. Instead, most work in this area has focused on developing new techniques based on small-scale regression or image classification tasks. However, many practical tasks have different modalities, such as tabular data, audio, text, or sensor data, that offer significant challenges involving regression and discrete or continuous structured prediction. In this work, we propose the Shifts Dataset to evaluate uncertainty estimates and robustness to distributional shifts across various tasks and data modalities. The dataset, collected from industrial sources and services, comprises three tasks, each corresponding to a particular data modality: tabular weather prediction, machine translation and self-driving car vehicle motion prediction. Real distributional shifts affect these data modalities and tasks and pose exciting challenges concerning uncertainty estimation. We hope that the dataset will enable researchers to meaningfully evaluate the plethora of recently developed uncertainty quantification methods, as well as assessment criteria and state-of-the-art baselines.  

CrowdSpeech and Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription by Nikita Pavlichenko, Ivan Stelmakh and Dmitry Ustalov

Audio transcription is one of the most popular crowdsourcing tasks. Since crowdsourcing annotations are noisy, every recording receives transcriptions from multiple performers, which requires aggregation — choosing the best possible transcription. To benchmark transcription aggregation methods, we present CrowdSpeech, the first large-scale crowdsourced transcription dataset for English, based on recordings from a popular LibriSpeech dataset annotated on the Toloka crowdsourcing platform. We also offer a crowdsourcing pipeline Vox DIY for creating similar datasets for an arbitrary natural language and release such a dataset for Russian. Our evaluation shows that language models for text summarization outperform all other techniques, including retrieval-based methods, establishing a new state-of-the-art in this task. 


Training Transformers Together by Max Ryabinin, Michael Diskin, Tim Dettmers, Lucile Saulnier, Quentin Lhoest, Alexander Borzunov, Yacine Jernite, and Thomas Wolf

We invite volunteers to train a large Transformer language model over the Internet. Instead of using supercomputers, we will pool together all available computational resources: desktops, laptops, servers and even cloud TPUs from around the world. All training artifacts, such as model checkpoint and optimizer states will be shared online for public use. For this demonstration, we will provide an open-source starter kit that volunteers can use to join the globally distributed training run and host similar experiments independently in the future.