In a Cloud Native infrastructure, component failure is normal and expected. The loss of a single node or a dozen hard drives is automatically handled by the systems running a datacenter, removing the need to page someone at 4am. This calls for an alerting system that understands service availability at a global scope, yet is still able to give detailed reports if and when there is a service-impacting incident. Prometheus achieves this by defining alerting conditions directly on time series data. The resulting alerts are grouped and aggregated into comprehensive and meaningful notifications.
Fabian will walk through the philosophy of time series based alerting, the Prometheus architecture behind it, and how practical anomaly detection can be implemented.
This talk was previously scheduled for 09:50.