We’ve invested a lot of time and thought into the ways that we can reduce risks to the system, and avoid disrupting the thousands of applications that rely on us. However, mistakes and accidents do occasionally happen. If you don’t have a process for doing post-mortems after incidents, you probably should invest time in building one.

While you can prepare for as many eventualities as you can think of, an effective operations team also needs to retrospect on things that went wrong, and work out how to prevent them happening again.

What happens when a bad thing occurs?

Downtime affects our customers pretty badly in Pusher – we have to keep disruption to a minimum. A critical incident generally looks like this:

  • An alert is triggered
  • Someone looks at the issue
  • If necessary, other people are roped in
  • We chat in IRC while we investigate and fix
  • An ”incident commander” keeps our status page up to date and chats to users
  • We fix the issue and go to bed

However, this is just the start of the process for us. Any disruption is so unacceptable that we need to work out why it happened and prevent it from happening in future.

Being systematic about your post-mortems

Learning from mistakes is something that’s often quite difficult to do. Without a framework to help you do it consistently, it can be haphazard and important details can be overlooked or forgotten.

After trying other options, (such as adding pages to our wiki), we opted to write a small application that would help us maintain consistency of reports.

icarus.io Post mortem app

As I mentioned in another article, we see process as a necessary evil that lets us focus the majority of our time on important things. Part of this involves making computers do as much of our work as possible. Any app that can make a process more consistent and bearable is a great asset.

What makes a good post-mortem

While we’re still perfecting our technique for post-mortems, we bear the following goals in mind:

  • Educate other technical team members as to why an incident occurred
  • Highlight the impact (if any) to our customers (for stakeholders and support)
  • Commit to some changes that will reduce the likelihood of the same thing happening again

After an incident, the primary person who dealt with it (usually the person who was on call) writes up the post-mortem. At the next weekly meeting, we go through recent post-mortems as a team and create actions for them. We go through this if the incident is large (involved an outage) or small (could have become dangerous).

Our post-mortems require the following:

  • Time frame it occurred in
  • Impact to our users
  • Timeline of events
  • Conclusions
  • Actions

The importance of a timeline

It’s generally not good enough just to say that a process died, or that a particular user was acting in an unpredictable way. We need to show what broke, in what order and how we reacted.

Doing this teaches people about symptoms to look for in future incidents, and where to look for them. It also allows us to have a well-informed discussion about how effective our response was.

Timelines are a fundamental part of our reports. Piecing together notes from the event afterwards is a big pain however, and leads to process friction (people not wanting to do it).

We reduce this friction by including logs from our chatroom in the report. This shows how people reacted, what they were thinking, and how long things took. When there are consequences we can see the fallout from them. Aware that logs will be recorded, we even make suggestions for things we’d like to change in IRC while we’re working, acting as a real-time stream of consciousness that captures our impressions while they’re still fresh.

Having monitoring systems report into the incident channel on IRC is also a great help here. It anchors the stream of consciousness to definite measurable changes in the system status, further reducing the friction in creating an accurate timeline later – we get a solid base just by curating the chat logs.

Anything we can add to a report to show how the events unfolded is very useful for later discussion. We attach graphs from munin, or our other internals tools where it helps.

Committing to and recording actions

Committing to actions that might prevent similar incidents is a key part of our post-mortems. It’s not good enough to say in passing “oh it would be good if we had a nagios alert on X”, because it’ll rarely get prioritised. We also want to see trends like open actions that could could have prevented an incident.

We discuss actions as a group and attach them to the report. These are stored in the post-mortem system, but also create Github issues. The created issues are labelled as ‘post-mortem’ and include a link to the report to help with our normal backlog prioritisation.

We’ve found that methodically recording incidents we encounter allows us to deal with issues systematically, and ensures we learn from our mistakes.

  • http://lizwangphotography.com/liz.php?orobianco-662425.html オロビアンコ ノベルティ

    こんにちは!ハッカーと 私はちょうどあなたがいずれかを持っている場合聞きしたかった? 私の最後のブログ(ワードプレス)は、ハッキングされたと私は負けてしまった原因なしにハードワークの数ヶ月。ハッカーストップに| メソッドソリューション何かありますか?

  • http://lizwangphotography.com/liz.php?loewe-862728.html ロエベ フラメンコ ブログ

    ちょうどあなたの記事より少し多くを| 追加などについて|あなたは今まで思想とみなさがありますか?とすべて、私はあなたがが言うことを意味する。または写真あなたはいくつかの素晴らしいを追加した場合しかし|より多くのあなたの記事を与えるために、ビデオクリップ動画 “ポップ、想像する “!あなたのコンテンツが優れているがと画像と動画、この確かにでしたサイトのいずれかである最も有益フィールド|そのニッチ最高。 ブログ!

  • http://www.anyidea.nl/any.php?prada-686809.html プラダを着た悪魔 訳

    出版執筆|あなたはと考えについて考えたことがあります電子書籍他の上またはゲストオーサリングサイトのブログ?あなたが議論トピックとするのと同じ上のを中心に基づいて私がブログを持って愛は持っている、いくつかの物語/情報を共有する。 訪問者値だろうあなたの作品私は自分を知っている。場合は あなたがしているでもリモートで興味を持って、お気軽にシュートを送る。

  • http://www.geocenter.co.uk/sale/Nike-roshe-Mens-Black.html Nike roshe Mens Black
  • http://frankcanon.blogbaker.com/2014/08/06/reading-old-data-tapes software free

    Hola! I’ve been reading your web site for a while now and finally got the bravery to go
    ahead and give you a shout out from Huffman Texas!
    Just wanted to mention keep up the good job!

  • http://www.fifacoing.com/ Fifa 14 Coin

    hello guys
    cheap Fifa coins for you! Please look at my username! :P

  • http://www.bali-hotels.org/Celine-handbags/ replica celine handbags

    Love It! New with tags as stated. Speedy shipping. Thanks!
    replica celine handbags http://www.bali-hotels.org/Celine-handbags/

  • http://www.giorgiosportsc.it/modules.php?name=Your_Account&op=userinfo&username=KVSFletche Addie

    Very nice post. I just stumbled upon your blog and wished to say that I’ve truly enjoyed browsing your blog posts.
    After all I will be subscribing to your rss feed and I hope you write again soon!

    Feel free to visit my blog … deer antler extract + inci

  • http://www.venables.co.uk/second/max/Nike-roshe-run-women-Reviews.html Nike roshe run women Reviews
  • http://tinyurl.com/wheretobuygarciniacambogiaq tinyurl.com

    hello there and thank you for your info – I have definitely picked up anything new from right here.
    I did however expertise several technical issues using
    this web site, as I experienced to reload the web site many times previous to I could get it to load properly.
    I had been wondering if your web hosting is OK?
    Not that I am complaining, but slow loading instances times
    will very frequently affect your placement in google and could damage your quality
    score if advertising and marketing with Adwords.
    Anyway I am adding this RSS to my email and can look out for much more of your respective exciting content.
    Ensure that you update this again soon.

    Check out my web site: pure cambogia garcinia (tinyurl.com)

  • http://www.mermaidandidolphin.com/special_michaelkors.htm michael kors sac soldes

    Expose your small business to a minimum of 35 persons each and every day. The attention ought to be on the will get you paid back, which can be sponsoring and retail store. To become a good circle online marketer, pay out as a minimum two hours per day on revealing your organization to individuals. There is no way to fail at your network marketing business if you can do this day in and day out for a year.

  • http://www.nsbenycalumni.org replica louis vuitton

    Appreciating the dedication you put into your website and in depth information you present. It’s great to come across a blog every once in a while that isn’t the same outdated rehashed material. Wonderful read! I’ve saved your site and I’m including your RSS feeds to my Google account.
    replica louis vuitton http://www.nsbenycalumni.org

  • http://www.salescafe.net/groups/how-become-tree-climber tower climbing gear store

    We are cleared for take off at 13:00 regional,
    from runway 01, and quickly after we are climbing for our boat trip altitude at 28,
    000 feet.

    Check out my web site: tower climbing gear store

  • http://tinyurl.com/o6r866p what is garcinia cambogia

    Very good post. I will be dealing with some of these issues as

    My site … what is garcinia cambogia

About Max Williams

Max is CEO of Pusher, and is passionate about the ways that technology can be used to make life better and more enjoyable for people. He loves using APIs and developer tools, and is obsessed with finding things that can be better done by a machine. His posts tend to be about life at Pusher, and the ways that we experiment with our culture and processes to create awesome things.