Much of Twitter's infrastructure is powered by open source. For cluster management we use Apache Aurora and Apache Mesos to run a variety of workloads across multiple large clusters, each typically tens of thousands of servers. Almost all of Twitter's stateless services run in Aurora/Mesos and we're constantly working to migrate other workloads into Aurora/Mesos. Through many years of development, and lots of hard learned lessons, we've helped build an extremely reliable, scalable, and robust platform.
Mesos provides a unified abstraction of cluster resources to one or more scheduling frameworks (Aurora is one example) which run a variety of workloads. To fully achieve the resource abstraction, Mesos must ensure that co-located jobs are well contained and isolated from each other. To do this Mesos has several options for containerization and resource isolation, implementing its own control of Linux cgroups and namespaces or delegating to external providers like Docker.
In this talk I'll speak about some of the lessons learned running large scale containerized infrastructure at Twitter and talk a lot about where we're heading next. I'll focus on isolation improvements to support other workloads and to support increased system utilization, all while maintaining (as close as feasible) the ideal resource abstraction. In particular, we're looking at effective CPU isolation when co-locating very latency sensitive services with latency insensitive batch-style workloads.
Speakers: Ian Downes