Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe
We manage a 200TB glusterfs cluster in production. While we were managing this, we learnt some key points. In this session, we will share with you:
- What are the minimal health checks that are needed for a glusterfs volume, to ensure high availability and consistency
- What are the problems with the current cluster expansion steps(rebalance) in glusterfs we experienced? How did we manage to avoid the need for a rebalancing of data, for our use-case. Proof of concept for new rebalance algo for future.
- How are we scheduling our maintenance activities such that we never have downtime even if the things go wrong.
- How did we reduce the time to replace a node from weeks to a day.
As the number of clients increased we had to scale the system to handle the increasing load, here are our learnings scaling glusterfs
- How to profile glusterfs to find performance bottlenecks.
- Why client-io-threads feature didn't work for us? How we improved applications to achieve 4x throughput by scaling mounts instead.
- How to Improve the incremental heal speed and patches contributed to upstream
- Road map for glusterfs based on these findings
Speakers:
SanjuRakonde
Pranith Kumar Karampuri