Smells Like Teen Systems: DevOps Nirvana

Frank Wiles, @fwiles @revsys  Slides will be online later.

Smells Like Teen Systems: Advice for raising healthy happy systems and getting to DevOps nirvana

People are fearful of change. Must be small at first. Baby steps. Be agile — little a, not big A: be spiritual, not fundamentalist; mandating….just because you read it somewhere, doesn’t mean you must do it if it doesn’t work for your organization. Have ammunition: managers need data, explanations to make decisions.

Apply metrics mentality to:

  • change requests
  • trouble tickets and bugs
  • deployments
  • outages of the smallest magnitude
  • interoffice political fights
  • approved and denied requests for equipment or funds
  • hires, fires, and quits
  • $$; labor hours, etc

“We spend on average 19 hours per week requesting more information”

Guilt tripping — no other option to keep up.

“Once we put <insert system> in place, we realized we no longer needed that weekly meeting…”

DevOps: Develop Everything Visibly Automate Paranoid Services

DEV: Develop Everything Visibly: “Everything has to happen out in the open”

OPS: Operate/Automate Paranoid Services “Automate everything with ridiculous amounts of monitoring and metrics”

Everything is version-controlled. Log of why things happened.
Everything is tracked. Ticketing; Trello; Bugs; etc.

Even more visibility:

  • Level 1: Team Chat. Like Slack. Email is for outsiders.
  • Level 2: Chat Ops <– mmmmmbot!
  • Level 3: Have some fun <– Fun bots

Chat ops suggestions

  • Deployments and config changes
  • Status summaries: bot check load db3
  • Maintenance: bot start maintenance file-server-1
  • Display Alerts and Warnings
  • Server boot/shutdown messages
  • Ops logs: bot log Upgraded redis to 2.8.19
  • Resolutions: bot resolve ticket #8 Ended up just needing to restart Apache
  • Common actions: bot restart apache on production

Tools: This is how we do it

  • Python: scripting language {relatively easy to learn and readable; libraries for talking to everything} Lots of libraries: Fabric highly rec’d, shell scripting on steroids
  • SaltStack: master & and then salt (minion) code. as simple or as complicated as you want; fast communication even among hundreds of systems (zeromq +aes); extensible via python; ability to return data to the master for monitoring or metrics purposes; simple to crazy complicated orchestration between systems. Examples of uses: Targeting (/srv/salt/top.sls); Pillars (/srv/pillar/* (config differences as data such as); templating
  • Consul: service discovery and monitoring: health checks; discover services via DNS or HTTP REST apis; deadman health checks.
  • ELK: Elastic Search/Logstash/Kibano <– fast log searching for when you don’t.
  • “Logs that aren’t centralized are rarely checked and logs that aren’t searchable are never correlated” -Frank Wiles
  • Graphana: for metrics visualization; pretty graphs.
  • Don’t capture exceptions in your inbox; put in a system. Exception.io; Rollbar. Rollbar also tracks deployments.
  • What to capture? As much you can store.
    • general collectd system stats
    • logins/signups/emails sent
    • failed login attempts/emails bounced
    • run time of crons and batch jobs
    • backup run times and file size(s)

Resistance. Route around it. If you don’t work with the process….

Maverick Ricardo Semler {1993}

Turn resistance back on others, sometimes so it’s so cumbersome that it burdens their way of thinking.