How the Pusher team built subscription counting at scale
Last week we released Subscription Count events, which allow you to build scaled presence views for Channels apps. You can enable the new feature from the Channels dashboard in the GUI.
Presence channels are a very popular implementation and offer a great solution for showing smaller groups of users (<100) who is online, but there are limitations to the use case. We introduced subscription counting events so that you can display a broader view of subscriptions in very large channels (100–1,000,000).
The small scale presence solution
The common solution for presence views previously was to use Presence channels. This is great for building chat rooms, collaborator presence, and “who’s online” style functionality. But it had a limit. We allowed a maximum of 100 subscribers within a channel because Presence channels were designed to serve small groups.
The other way to get the subscription count was to poll our API to get current subscribers and then serve that to the client.
The solution we implemented for Presence channels was not scalable to a large group of subscribers. At 100+ users, it’s very unlikely that everyone needs to be notified about anyone joining or leaving the channel. It also eats away at your message count.
Imagine 1% of channel users leave every second… For a channel with 100 members that would be 100 messages per second. That’s 1 user leaving which triggers a notification to the other 99 members.
For a channel with 1000 members, it’s 10,000 events per second (10 notifications to 999).
For 10,000 members, 1,000,000 events!
This could not have scaled for a lot of public and private channels we serve. So we limited our scope to only counting socket connections. This seems like a relatively easy task to do—just use Redis and increment a counter for every subscription. However, the Pusher architecture makes it a bit more tricky…
Brainstorming the build
We run Pusher services as a distributed application. That means connections might be coming from multiple pods. Having a single global counter runs into issues when a pod goes down and takes all its connections with it. Then you don’t have the one true count. It’s not reliable anymore.
Because of this, we maintain counts at two levels—a global level and an instance level. If the instance goes down, the cleanup script is in charge of correcting the global count.
To broadcast these events at a given interval, the easiest way would be to use a fan-out approach. We would have a worker that periodically scans all active channels and publishes their count to the respective channels. This has a time complexity of O(n^2). #NotScalable
To refine the above-mentioned approach, we also considered sharding the events into various time buckets of 5 seconds each and using Lambda and SQS to process these. This could have worked but the approach was very complicated to begin with and we needed something simpler.
How we built subscription_count
Luckily for us, the Pusher team gathered in Amsterdam this May to do our first summit. We’re a global company now and getting everyone in a room to solve things in person has been really exciting. Together we realized we wanted to make things far more simple.
After approaching the problem from multiple angles and considering its pros and cons we boiled it down to an event-driven model. The new solution was much simpler to implement and scale as well.
We still maintain the count as stated above. To broadcast these events, we use `subcsription_suceeded` as our event trigger. There we increment the count for that specific subscription event and in the same operation also obtain the current connection count for that channel.
We divided channels into two groups: large (more than 100 subscribers) and small (equal to or less than 100 subscribers).
When you implement subcsription_count:
Small groups immediately receive the new subscriber count.
For a large group, we again use Redis to apply a 5-second lock on broadcasting these events. If a lock is detected, we ignore the event. If the lock has expired, we broadcast the new count and set the lock on broadcasting new events for the next 5 seconds.
So does that mean if a user joins within that 5-second window they will have to wait for it to expire before any count is broadcasted to them?
No. In that case, we already have the count at the time when the users subscribed: (`subcsription_suceeded`) we then broadcast an event to the new subscriber only.
Here’s a sequence diagram of how our Pusher subscription count event works.
Technical trade offs and pitfalls
You might have noticed a trade off. What if users join a large channel within the lock-in period but no one joins after it has expired? That means the correct count is published to those who had joined last but not to channel subscribers who had joined before.
This is true. If your use case depends on a narrow subscription period—a meeting stream platform for example, where many users will join in a short period and there will be little activity following—subscription_count might not be the ideal solution for you.
When you’re working with very large channels which are consistently active, the exact count at all times doesn’t necessarily matter that much. Often you’re looking to give a sense of scale: hundreds of game players online or thousands of people in a group.
We’re now considering how we might go about building a version of this feature with a different approach backed up by more complex infrastructure.