There is no shortage of great tools to monitor distributed applications. However, most of them focus on monitoring overall performance metrics and error rates, giving only a general idea of the health of an infrastructure.
Unfortunately, rarely occurring issues are often hidden by general trends making it difficult to completely understand infrequent, yet sometimes catastrophic, problems.
Tracers are great at tracking down sporadic problems in production environments, but the amount of data they generate can be hard to manage in the wild.
This talk will present how the work done on LTTng over the last year, notably the introduction of a session rotation mode, makes it easier to integrate fine-grained monitoring in production environments.
The talk will also cover approaches to collecting traces on multiple hosts to troubleshoot problems occurring in distributed systems using both kernel and user-space traces.