Historically, changes in the scheduling algorithm of Dask have often been based on theory, single use cases, or even gut feeling. Coiled has now moved to using hard, comprehensive performance metrics for all changes - and it's been a turning point!
Any developer worth their salt scrupulously practices functional regression testing: all functionality is covered by automated tests, and every time anybody changes something all tests must remain green.
Performance testing however is a much fuzzier and often neglected area, typically due to the fact that, frequently, in order to measure realistic performance you need a production-sized test bench, and that performance typically includes some degree of variance.
Historically, changes to the scheduling algorithm in Dask have gone through this thought process. There have always been plenty of functional unit tests that verify that the scheduler does whatever minute decisions the developers expects, but until recently there weren't any end-to-end, production-sized test benches on realistic use cases to measure performance.
At Coiled, we have now implemented a new test suite that does just that - statistical analysis of performance metrics - that lets us understand if a change is beneficial or detrimental in terms of runtime and memory usage.
This presentation delves into how we collect data, visualize it, and act on it and how much it changed our development process for the better.