Designing for Failure
While we all work very hard to build high-available, fault-tolerant and resillient applications and infrastructures the end-goal is currently often something along the lines of loosly-coupled/microservices with zero-downtime in mind. Upgrades are tied to CI/CD pipelines and we should be sipping pina coladas on the beach. Time to unleash the Chaos Monkey, because that is what Netflix does, so we should try it as well.
Now, the backend DB failed. The middleware application is returning errors, and your frontend is showing a fancy 5xx.
While each layer is able to scale independently or fail-over to another region, even a simple timeout @ the DB can cause a cascading failure.
The application is designed to work, not designed to recover from failure.
Designing for failure applies to both software development and infrastructure architecture, and I'd like to talk about a couple of points to highlight this paradigm.
Please note that this talk replaces one entitled "Introduction to MetalĀ³" that was due to have been given by Stephen Benjamin, who has sent his apologies but is now unable to attend.
Speakers: Walter Heck