Running machine learning models in a production environment brings its own challenges. In this talk we would like to present our solution of a machine learning lifecycle for the text-based cataloging classification system from idealo.de. We will share lessons learned and talk about our experiences during the lifecycle migration from a hosted cluster to a cloud solution within the last 3 years. In addition, we will outline how we embedded our ML components as part of the overall idealo.de processing architecture.
idealo.de offers a price comparison service for millions of products from a wide variety of categories. The automated classification of the offers is carried out using both traditional and deep learning-based approaches. Our machine learning components are part of a fully automated life cycle and process up to 500 million offers daily at peak times.
In addition to the enormous amount of data that we process, we particularly face the challenges of being online 24/7 while adapting to an ever-changing catalog structure. This requires a high level of reliability from our inference service and continuous automated retraining and model deployment.
In this talk we would like to share and present our view on MLOps:
- How we integrate our CI/CD and continuous training pipelines with Github and AWS Sagemaker
- How we migrate the lifecycle from a hosted cluster (running Kubernetes, Argo Workflows and ArgoCD) to the cloud (running AWS Sagemaker and Datalake).
- How we monitor our models as well as data and performance indicators up to date and alert in case of disruptions
- How we embed the classifiers in an event-driven heterogeneous software architecture (based on Kotlin and Python).
And share lessons learned on:
- How we keep reliability high while deploying, updating, and scaling our classification inference services
- How we meet a valid compromise between performance and cost requirements.