Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe

We manage a 200TB glusterfs cluster in production. While we were managing this, we learnt some key points. In this session, we will share with you:

What are the minimal health checks that are needed for a glusterfs volume, to ensure high availability and consistency
What are the problems with the current cluster expansion steps(rebalance) in glusterfs we experienced? How did we manage to avoid the need for a rebalancing of data, for our use-case. Proof of concept for new rebalance algo for future.
How are we scheduling our maintenance activities such that we never have downtime even if the things go wrong.
How did we reduce the time to replace a node from weeks to a day.

As the number of clients increased we had to scale the system to handle the increasing load, here are our learnings scaling glusterfs

How to profile glusterfs to find performance bottlenecks.
Why client-io-threads feature didn't work for us? How we improved applications to achieve 4x throughput by scaling mounts instead.
How to Improve the incremental heal speed and patches contributed to upstream
Road map for glusterfs based on these findings