Having a separate cluster of Metadata Servers (MDS) is a well known design strategy among distributed file-system architectures. One challenge faced by this approach is how to distribute metadata among the MDSs. Unlike data storage and it's associated I/O throughput, which can be scaled linearly with the number of storage devices, file-system metadata is a fairly complex entity to scale due to it's hierarchical nature. In hindsight, a pure hashing based metadata distribution strategy seems like a perfect fit. But, this is not exactly the case. What are the pitfalls then? Too many inter-MDS hops (due to POSIX traversal semantics), loss of hierarchical locality degrades file-system performance, and as a result, this is not beneficial for a workload whose directory hierarchy tree grows in depth rather than breadth. CephFS's metadata balancer takes a different approach by partitioning metadata sub-trees across MDSs thereby preserving good locality benefits. Although efficient, this involves a lot of back and forth migrations of sub-trees and the locality benefits are sometimes trumped by sub-optimal distributions.
In this talk, we present a new metadata distribution strategy employed in CephFS - Ephemeral Pinning. This strategy combines the benefits of hashing and naive sub-tree partitioning by intelligently pinning sub-trees to MDSs so as to obtain a balanced distribution as the workload metadata grows by depth and breadth. A consistent hashing based load balancer helps in maintaining an optimal distribution during addition or failure of MDSs.
This talk will cover the following key ideas:
- How metadata is handled in distributed file systems.
- Why it is so important to have an optimal distribution of metadata among Metadata Servers.
- The drawbacks and advantages of commonly used and popular metadata distribution strategies.
- Ephemeral Pinning - The new metadata distribution strategy employed by CephFS. We will explain the design and implementation of the distribution strategy and delineate it's strong suits.
This talk would be beneficial for every distributed file-system project that handles file metadata separately. They would get an overview on existing metadata distribution strategies - it's pitfall's and benefits and the reason why we at CephFS came up with this approach. The benefit's of using consistent hashing for distributing metadata are also discussed.