Join operator is one of the standard operations available in relational databases. In a large-scale distributed scenario, efficiently implementing joins poses unique challenges as the data is usually spread around a cluster of machines instead of stored on a single machine. The goal of this talk is to illustrate the approach to implementing distributed joins in the CrateDB database that exhibits significant performance improvements compared to the existing algorithms. In the first part of the talk, we will cover the limitations of the nested loop and block nested loop join algorithms. The second part will show how the hash algorithm can work in distributed settings by addressing some of its memory limitations. Finally, we will introduce the distributed block hash join algorithm and how it enables CrateDB to analyze massive amounts of data 23 thousand times faster compared to the initial nested loop implementation.
In relational databases, join operators are usually implemented with nested or block nested loop algorithms. However, in a large-scale distributed scenario, the ability to efficiently query massive amounts of data can be challenging. Traditional approaches in implementations of join operators are no longer enough to achieve the high performance of complex joins in distributed data processing systems. For instance, a nested loop algorithm is relatively simple to implement and could be easily adjusted to execute distributed joins. Unfortunately, it comes with a high-performance cost that equals quadratic time complexity with respect to the number of rows of the two tables joined.
This talk will show our approach to implementing distributed equi-join operator in CrateDB that exhibits significant performance improvements compared to the original nested loop algorithm. CrateDB is an open-source, distributed SQL database that runs queries on millions of data records daily. It scales up to hundreds of nodes and PBs of indexed data making the performance of join operators highly important: it is required to have efficient algorithms that can scale with the input size.
More specifically, we explore the implementation of the distributed block hash join algorithm. First, we address the memory limitations of the basic hash join algorithm with a switch to block-based processing. Block-based processing refers to a procedure of dividing a large dataset up into smaller blocks that can be worked on separately. As those blocks can be distributed across the CrateDB cluster the join can be executed in parallel using multiple nodes for increased performance and load distribution. Second, we illustrate the changes in the single node block hash join algorithm to enable its distributed execution.
To evaluate the performance of distributed block hash join algorithm, we run CrateDB benchmarks against two algorithms: the original nested loop algorithm and the single node block hash join algorithm. The benchmark consists of queries with join operators and runs on tables of various sizes, up to 50 million rows. The final result illustrates that the distributed block hash join algorithm enables CrateDB to analyze massive amounts of data 23 thousand times faster than the initial nested loop implementation.