Having a separate cluster of Metadata Servers (MDS) is a well known design strategy among distributed file-system architectures. One challenge faced by this approach is how to distribute metadata among the MDSs. Unlike data storage and it's associated I/O throughput, which can be scaled linearly with the number of storage devices, file-system metadata is a fairly complex entity to scale due to it's hierarchical nature. In hindsight, a pure hashing based metadata distribution strategy seems like a perfect fit. But, this is not exactly the case. What are the pitfalls then? Too many inter-MDS hops (due to POSIX traversal semantics), loss of hierarchical locality degrades file-system performance, and as a result, this is not beneficial for a workload whose directory hierarchy tree grows in depth rather than breadth. CephFS's metadata balancer takes a different approach by partitioning metadata sub-trees across MDSs thereby preserving good locality benefits. Although efficient, this involves a lot of back and forth migrations of sub-trees and the locality benefits are sometimes trumped by sub-optimal distributions.
In this talk, we present a new metadata distribution strategy employed in CephFS - Ephemeral Pinning. This strategy combines the benefits of hashing and naive sub-tree partitioning by intelligently pinning sub-trees to MDSs so as to obtain a balanced distribution as the workload metadata grows by depth and breadth. A consistent hashing based load balancer helps in maintaining an optimal distribution during addition or failure of MDSs.
This talk will cover the following key ideas:
This talk would be beneficial for every distributed file-system project that handles file metadata separately. They would get an overview on existing metadata distribution strategies - it's pitfall's and benefits and the reason why we at CephFS came up with this approach. The benefit's of using consistent hashing for distributing metadata are also discussed.
Speakers: Sidharth Anupkrishnan