Data Analytics with MySQL, Apache Spark and Apache Drill

FOSDEM 2017

Apache Spark is a cluster computing framework, similar to Apache Hadoop. There are a number of tasks where MySQL does not show great performance: for example MySQL is not massively parallel system and a single query will only utilize 1 CPU core . Spark, on the the other hand is designed to be massively parallel; in addition Spark is a clustering framework, so you can easily add more compute nodes so that Spark can utilize more resources and scale.

Apache Drill is similar project aimed to make data discovery easier. For example it allow you to join data sources in MySQL, MongoDB, flat files, other RDBMS, etc.

In this talk I will demonstrate how to use Apache Spark together with MySQL for data analysis. I will sho how Apache Spark aggregates data (wikipedia pageview statistics) and stores the resultset in MySQL. I will also show how to use Apache Spark with multiple sources and join virtual tables from MySQL, flat files and even MongoDB.

Speakers: Sveta Smirnova Alexander Rubin