Using BigBench to compare Hive and Spark versions and features

FOSDEM 2017

BigBench is the brand new standard for benchmarking and testing Big Data systems. This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with with their respective 1 and 2 versions under different configurations. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and log-file analysis.

BigBench is the brand new standard (TPCx-BB) for benchmarking and testing Big Data systems. The BigBench specification describes several application use cases combining the need for SQL queries, Map/Reduce, user code (UDF), Machine Learning, and even streaming. From the available implementation, we can test the different framework combinations such as Hadoop+Hive+Tez (with Mahout) and Spark (SparkSQL+SparkML) in their different versions and configurations, helping us to spot problems and possible optimizations of our data stacks.

This talk first introduces BigBench and how problems can it solve. Then, presents both Hive and Spark benchmark results with with their respective 1 and 2 versions under different configurations including: Tez, LLAP, and file formats. Experiments are run on Cloud and On-Prem clusters of different numbers of nodes and testing data scales, taking into account interactive and batch usage. Results are further classified by use cases, showing where each platform shines (or doesn't), and why, based on performance metrics and log-file analysis. The talk concludes with the main findings, the scalability and limits of each framework.

Speakers: Nicolas Poggi Alejandro Montero