How we built Pusher-JS 2.0 – part 3 – metrics

building_pusher_part3.png

This is the final part of my series on how we built pusher-js 2.0, don’t forget to read part 1 and part 2 too. One of the trickiest parts of rolling out a library is figuring out how it behaves in production scenarios. You can spend months designing and writing tests, but nothing beats releasing \[…\]

Introduction

This is the final part of my series on how we built pusher-js 2.0, don’t forget to read part 1 and part 2 too.

One of the trickiest parts of rolling out a library is figuring out how it behaves in production scenarios. You can spend months designing and writing tests, but nothing beats releasing your software into the wild, especially when you’re programming for Web browsers. Additionally, metrics collected from real-world usage allow us to take a more scientific approach to improving our connection strategy. These are just some of the reasons why we decided to make metric collection one of the key aspects in the 2.0 release of pusher-js.

Timelines

We started designing the metric collection system by asking ourselves – what questions do we need to answer to know that the new strategy works correctly? We came up with a list of potential queries we’d need to run against gathered data:

  • what was the connection environment (browser, Flash support)?
  • which transport connected first?
  • what was the connection latency?
  • if additional files were needed, how long did it take to fetch them?
  • did the first transport work or was it replaced later?
  • how was the strategy evaluated?
  • were there any errors when/after connecting?
  • were clients getting disconnected periodically?

We wanted the code used to collect the stats and store them to be simple. We also wanted the data that was collected to be simple to navigate and query. Finding a solution which would allow these requirements, along with our need to collect as much data as possible wasn’t easy. The best solution we found was append-only logs.

Every Pusher connection has a log containing a number of events, which we decided to call a timeline. There are only 2 operations – append and send. Appending an event will sign it with a timestamp and severity, and store it in a buffer. Since there are lots of events happening over relatively small periods of time, sending the log to our servers had to be separated from appending, and is done every minute or when the client connects.

Currently, only transports and strategies get the timeline instance and report the following events:

  • method calls with debug level,
  • per-instance transport state transitions e.g. Connecting -> Connected,
  • transport errors,
  • transport cache usage.

By collecting all these events and environment details – which are sent on the first timeline submission – we can extract a large amount of informative metrics regarding our production connections, including all those mentioned a few paragraphs above.

Rollout

Even though we spent a lot of time testing the library and thinking about the connecting strategy, we were fully aware that we might have missed some edge-cases. Initial connection timeouts and delays were also our best guess, so we couldn’t be 100% sure they would work well for our users.

To gain more confidence before starting the beta-testing phase, we ran a pre-version of 2.0 on our homepage, so we could start recording metrics, including success rates, connection latency distributions and transport usage. We also used the flexibility of the new architecture to run an A/B-test comparing strategies from 1.12 and 2.0. Since metric collection wasn’t supported before 2.0, we decided to emulate the old strategy in the new codebase.

In a matter of days, we learned that:

  • new strategy improved 99th percentile connection latency by a few seconds,
  • backporting the timeout scheme from the old strategy resulted in unnecessary retries,
  • our guessed HTTP fallback delay was working well for our users,
  • there was a bug with clients reconnecting too quickly,
  • the first implementation of disabling failed transports didn’t work as expected.

Every time we updated the pre-release, we checked the metrics to make sure that our changes worked as expected. Sometimes problems were fixed and occasionally new edge cases were exposed. Thanks to gathered statistics, we could iterate quickly and in a matter of days produce a beta version, which didn’t have any major flaws.

Current state of affairs

After releasing version 2.0.0, we started receiving more and more logs – we now capture several gigabytes of timeline data per day. Since there’s so much valuable information submitted, we are planning to expand our local monitoring setup with lots of client metrics we that didn’t have before. For example, we can now graph:

  • connection latency distributions over time,
  • real-time connection latency on a World map,
  • user-agent and browser feature trends.

We can now dig into individual timeline entries to troubleshoot problematic connections for our customers. Also, similar to the beta-testing phase, before releasing every new version of pusher-js, we test it on our homepage and compare results to previous versions – keeping a close eye on the resulting statistics.

Without our metrics it would be impossible to confidently tune the connection strategy to the extent that we can. Thanks to both granular and aggregate usage statistics from our production system we were able to focus on the main goal of the 2.0 release, which was improving real-world user experience.