Open Source and Scale at Twitter

Keynote at 2015 Kansas Linux Fest, hosted at Lawrence Public Library

Dave Lester @davelester

OSS Advocate at Twitter, Inc.
Apache Mesos and Aurora PMC Member

A lot of metadata are in tweets.

Twitter is a big proponent of Open Source — see their website. Some are on Github; some are not.

Front-end developing — Bootstrap; typeahead.js

Dave focuses on key infrastructure projects at Twitter: finagle, scalding; analytics and infrastructure

1. How is Twitter scaling?

What is scaling? See Wikipedia entry. Reaching beyond your current capacity — social and technical solutions.

Twitter numbers (2014):

  • 500 million tweets /day
  • 3.5 billion/week;
  • 6000+ tweets/sec (steady state)

Twitter is “the pulse of the planet”. Can sometimes predict spikes (live, popular events, like the World Cup); sometimes can’t. Could throw 10x the servers at the problem OR improve scalability.

Remember the Fail Whale?

Previously, Twitter: ruby on rails, 200 engineers pushing code; needed a solution to isolate failure and isolate feature development

During 2010 World Cup — lots of issues keeping Twitter up; 2014 after scalability and OS projects, much more stable.

Breaking up monolithic applications into microservices. Common pattern among companies; see Groupon talk, “Breaking up the monolithic

Today, building a distributed system.

2. Twitter’s Open Source infrastructure

  • “Twitter Stack” including Apache Mesos, Aurora, Finagle
    • Mesos: top-level software at Apache; began as research project at UC Berkeley; layer of abstraction between machines in a datacenter and applications that run: cluster manager & resource manager. Mesos actively monitors what’s happening across the cluster (Zookeeper). Addresses the problems of fault tolerance and resource efficiency and utilization.
      • Design Challenges: each framework may have different scheduling needs; must scale to tens of thousands of nodes running hundreds of jobs with millions of tasks; must be fault-tolerant and highly available
      • Master-Worker architecture + Zookeeper cluster
      • Marathon scheduler
      • A lot of #klf15 scalability preso is going over my head, but I do wonder what #kohails project could “get” from Mesos/Aurora scalability
  • Why care about resource utilization? Fewer machines; less human resources.
    • How to best reuse idling times? Early research
    • Quasar — users specify performance target for applications instead of typical resource reservations; machine-learning used to predict resources usage and for cluster scheduling; research by Christina Delimitrou and Christos Kozyrakis at Stanford
    • Google Borg — Google’s cluster management solution; AMP Lab, and John Wilkes spoke at MesosCon 2014.
    • Aurora provides deployment and scheduling of jobs; rich DSL for defining services; health checking; one scheduler to rule them all: can manage both long-running services, as well as cron; can mark production and non-production jobs; production jobs can pre-empt non-prod jobs; has an additional priority system. Aurora has executor features — responsible for executive code on individual worker machines, sending status to Mesos when a task completes.
  • Hundreds of separate services with different owners
  • Managed by Site Reliability Engineer (SRE) teams

3. How and why OSS?

“many parts building on and amplifying each other” –Gordon Haff, Red Hat

Building an ecosystem.

Frameworks

Services: Aurora; Marathon; Kubernetes; Singularity

Big Data: Spark; Storm; Hadoop

Batch: Chronos; Jenkins

Framework bindings — C++, Java, Clojure, Haskell, Python, or write your own.

Resources for writing mesos frameworks — his slides will go online with links to this info.

Community > Code. Very very much true.

Let’s Scale in the Open: increased speed of innovation; more-reliable software; more-visible contributions and impact; broader peer group and sense of community.