On December, 6, on NeurIPS conference in Montreal, Yandex team presented the paper “CatBoost: unbiased boosting with categorical features”. This work introduces and studies two key algorithmic techniques of CatBoost, an open-source machine learning library developed by Yandex and used in its numerous services.
These techniques are based on the same Ordering Principle and solve two different problems. The first technique avoids target leakage, a special kind of overfitting, which arises in all previous implementations of gradient boosting and makes their predictions biased. The second one converts categorical features to numerical ones before they are used in the training process in a most effective way. The Ordering Principle consists of the following steps. First, we order training examples using their natural temporal order or by introducing a random order (depending on the task). Then, to obtain a prediction for each one example in the boosting process, CatBoost uses only examples preceding to that one, what makes the obtained values unbiased. In a similar way, to convert a categorical feature of an example to a numerical value, Catboost uses only preceding examples.
In the experiments described, these techniques greatly improve the quality of classification models trained by CatBoost. Besides, the combination of these features allows CatBoost to significantly outperform XGBoost and LightGBM. The audience of the conference highly appreciated these results and committed to use CatBoost for their machine learning tasks. If you are interested, see CatBoost website and the source code.