Internet companies use crowdsourcing to collect large amounts of data needed for creating products based on machine learning techniques. A significant source of such labels for OCR data sets is (re)CAPTCHA, which distinguishes humans from automated bots by asking them to recognize text and, at the same time, receives new labeled data in this way. An important component of such approach to data collection is the reduction of noisy labels produced by bots and non-qualified users.
In this paper, we address the problem of labeling text images via CAPTCHA, where user identification is generally impossible. We propose a new algorithm to aggregate multiple guesses collected through CAPTCHA. We employ incremental relabeling to minimize the number of guesses needed for obtaining the recognized text of a good accuracy. The aggregation model and the stopping rule for our incremental relabeling are based on novel machine learning techniques and use meta features of CAPTCHA tasks and accumulated guesses. Our experiments show that our approach can provide a large amount of accurately recognized texts using a minimal number of user guesses. Finally, we report the great improvements of an optical character recognition model after implementing our approach in Yandex.