Designing for Failure
While we all work very hard to build high-available, fault-tolerant and
resillient applications and infrastructures the end-goal is currently often
something along the lines of loosly-coupled/microservices with zero-downtime in
mind.
Upgrades are tied to CI/CD pipelines and we should be sipping pina coladas
on the beach. Time to unleash the Chaos Monkey, because that is what Netflix
does, so we should try it as well.
Now, the backend DB failed. The middleware application is returning errors, and
your frontend is showing a fancy 5xx.
While each layer is able to scale independently or fail-over to another region,
even a simple timeout @ the DB can cause a cascading failure.
The application is designed to work, not designed to recover from failure.
Designing for failure applies to both software development and infrastructure
architecture, and I'd like to talk about a couple of points to highlight this
paradigm.
Please note that this talk replaces one entitled "Introduction to MetalΒ³" that was due to have been given by Stephen Benjamin, who has sent his apologies but is now unable to attend.