Architecture for the extraction, automation and massive data processing

PyCon Sweden 2021

Live broadcast: https://www.youtube.com/watch?v=OcgLuOs1Hrc Present a solution that integrates various components in its architecture, both computational resources, databases and its own python applications and other open source ones. The idea is to show the problems and challenges posed by traditional scraping and how we have been able to build solutions that reduce them, even more so if what is sought is to do it en masse and in parallel. This also means building an automated flow for the post-processing and transformation of the data using machine learning services such as NLP and classification.

Due to the diversity of content on the web, its formats and technologies, the talk proposes a micro-service architecture solution built in Python, but that integrates a workflow with advanced scraping techniques and that allows the transformation of the data obtained. up to service application for NLP and ML classification. The proposal implies the use of Linux, postgresql, redis, mongodb, clickhouse, airflow, among others, but above all, their own developments and frameworks that consider not only the extraction process but also the consumption of RAM, parallel processing and even the website blocking, as well as the analysis and transformation processes of the data obtained.

Speakers: Alfonso de la Guarda