Smells Like Teen Systems: DevOps Nirvana

Frank Wiles, @fwiles @revsys  Slides will be online later.

Smells Like Teen Systems: Advice for raising healthy happy systems and getting to DevOps nirvana

People are fearful of change. Must be small at first. Baby steps. Be agile — little a, not big A: be spiritual, not fundamentalist; mandating….just because you read it somewhere, doesn’t mean you must do it if it doesn’t work for your organization. Have ammunition: managers need data, explanations to make decisions.

Apply metrics mentality to:

  • change requests
  • trouble tickets and bugs
  • deployments
  • outages of the smallest magnitude
  • interoffice political fights
  • approved and denied requests for equipment or funds
  • hires, fires, and quits
  • $$; labor hours, etc

“We spend on average 19 hours per week requesting more information”

Guilt tripping — no other option to keep up.

“Once we put <insert system> in place, we realized we no longer needed that weekly meeting…”

DevOps: Develop Everything Visibly Automate Paranoid Services

DEV: Develop Everything Visibly: “Everything has to happen out in the open”

OPS: Operate/Automate Paranoid Services “Automate everything with ridiculous amounts of monitoring and metrics”

Everything is version-controlled. Log of why things happened.
Everything is tracked. Ticketing; Trello; Bugs; etc.

Even more visibility:

  • Level 1: Team Chat. Like Slack. Email is for outsiders.
  • Level 2: Chat Ops <– mmmmmbot!
  • Level 3: Have some fun <– Fun bots

Chat ops suggestions

  • Deployments and config changes
  • Status summaries: bot check load db3
  • Maintenance: bot start maintenance file-server-1
  • Display Alerts and Warnings
  • Server boot/shutdown messages
  • Ops logs: bot log Upgraded redis to 2.8.19
  • Resolutions: bot resolve ticket #8 Ended up just needing to restart Apache
  • Common actions: bot restart apache on production

Tools: This is how we do it

  • Python: scripting language {relatively easy to learn and readable; libraries for talking to everything} Lots of libraries: Fabric highly rec’d, shell scripting on steroids
  • SaltStack: master & and then salt (minion) code. as simple or as complicated as you want; fast communication even among hundreds of systems (zeromq +aes); extensible via python; ability to return data to the master for monitoring or metrics purposes; simple to crazy complicated orchestration between systems. Examples of uses: Targeting (/srv/salt/top.sls); Pillars (/srv/pillar/* (config differences as data such as); templating
  • Consul: service discovery and monitoring: health checks; discover services via DNS or HTTP REST apis; deadman health checks.
  • ELK: Elastic Search/Logstash/Kibano <– fast log searching for when you don’t.
  • “Logs that aren’t centralized are rarely checked and logs that aren’t searchable are never correlated” -Frank Wiles
  • Graphana: for metrics visualization; pretty graphs.
  • Don’t capture exceptions in your inbox; put in a system. Exception.io; Rollbar. Rollbar also tracks deployments.
  • What to capture? As much you can store.
    • general collectd system stats
    • logins/signups/emails sent
    • failed login attempts/emails bounced
    • run time of crons and batch jobs
    • backup run times and file size(s)

Resistance. Route around it. If you don’t work with the process….

Maverick Ricardo Semler {1993}

Turn resistance back on others, sometimes so it’s so cumbersome that it burdens their way of thinking.

Open Source and Scale at Twitter

Keynote at 2015 Kansas Linux Fest, hosted at Lawrence Public Library

Dave Lester @davelester

OSS Advocate at Twitter, Inc.
Apache Mesos and Aurora PMC Member

A lot of metadata are in tweets.

Twitter is a big proponent of Open Source — see their website. Some are on Github; some are not.

Front-end developing — Bootstrap; typeahead.js

Dave focuses on key infrastructure projects at Twitter: finagle, scalding; analytics and infrastructure

1. How is Twitter scaling?

What is scaling? See Wikipedia entry. Reaching beyond your current capacity — social and technical solutions.

Twitter numbers (2014):

  • 500 million tweets /day
  • 3.5 billion/week;
  • 6000+ tweets/sec (steady state)

Twitter is “the pulse of the planet”. Can sometimes predict spikes (live, popular events, like the World Cup); sometimes can’t. Could throw 10x the servers at the problem OR improve scalability.

Remember the Fail Whale?

Previously, Twitter: ruby on rails, 200 engineers pushing code; needed a solution to isolate failure and isolate feature development

During 2010 World Cup — lots of issues keeping Twitter up; 2014 after scalability and OS projects, much more stable.

Breaking up monolithic applications into microservices. Common pattern among companies; see Groupon talk, “Breaking up the monolithic

Today, building a distributed system.

2. Twitter’s Open Source infrastructure

  • “Twitter Stack” including Apache Mesos, Aurora, Finagle
    • Mesos: top-level software at Apache; began as research project at UC Berkeley; layer of abstraction between machines in a datacenter and applications that run: cluster manager & resource manager. Mesos actively monitors what’s happening across the cluster (Zookeeper). Addresses the problems of fault tolerance and resource efficiency and utilization.
      • Design Challenges: each framework may have different scheduling needs; must scale to tens of thousands of nodes running hundreds of jobs with millions of tasks; must be fault-tolerant and highly available
      • Master-Worker architecture + Zookeeper cluster
      • Marathon scheduler
      • A lot of #klf15 scalability preso is going over my head, but I do wonder what #kohails project could “get” from Mesos/Aurora scalability
  • Why care about resource utilization? Fewer machines; less human resources.
    • How to best reuse idling times? Early research
    • Quasar — users specify performance target for applications instead of typical resource reservations; machine-learning used to predict resources usage and for cluster scheduling; research by Christina Delimitrou and Christos Kozyrakis at Stanford
    • Google Borg — Google’s cluster management solution; AMP Lab, and John Wilkes spoke at MesosCon 2014.
    • Aurora provides deployment and scheduling of jobs; rich DSL for defining services; health checking; one scheduler to rule them all: can manage both long-running services, as well as cron; can mark production and non-production jobs; production jobs can pre-empt non-prod jobs; has an additional priority system. Aurora has executor features — responsible for executive code on individual worker machines, sending status to Mesos when a task completes.
  • Hundreds of separate services with different owners
  • Managed by Site Reliability Engineer (SRE) teams

3. How and why OSS?

“many parts building on and amplifying each other” –Gordon Haff, Red Hat

Building an ecosystem.

Frameworks

Services: Aurora; Marathon; Kubernetes; Singularity

Big Data: Spark; Storm; Hadoop

Batch: Chronos; Jenkins

Framework bindings — C++, Java, Clojure, Haskell, Python, or write your own.

Resources for writing mesos frameworks — his slides will go online with links to this info.

Community > Code. Very very much true.

Let’s Scale in the Open: increased speed of innovation; more-reliable software; more-visible contributions and impact; broader peer group and sense of community.