Lower delays in presence channel disconnection timeouts

Pusher Channels’ “presence channel” feature lets you implement “who’s online” functionality. When clients subscribe to a presence channel, all other subscribers are notified with a pusher:member_added event. When the client unsubscribes, subscribers are notified with a pusher:member_removed event.

Introduction

Pusher Channels’ “presence channel” feature lets you implement “who’s online” functionality. When clients subscribe to a presence channel, all other subscribers are notified with a pusher:member_added event. When the client unsubscribes, subscribers are notified with a pusher:member_removed event. Previously, the pusher:member_removed could be delayed by up to 230 seconds, leading to other users believing that a user is still online when they’re not. Today, the maximum delay is down to 75 seconds. In this post I show why there’s a delay, and how we reduced this.

In most cases, the pusher:member_removed event is real-time: the client notifies Pusher Channels that it is unsubscribing or disconnecting, and then the pusher:member_removed event is immediately broadcast to everyone else. However, there is one case where pusher:member_removed events can be delayed: when a client disconnects without sending Pusher a FIN packet to tell us that it’s going away. This is called an “unclean disconnection”, and it’s most common on mobiles, which can suddenly lose their network connection, or just run out of battery.

While the client is sending or acknowledging messages, we know that the client is still connected. However, if the client uncleanly disconnects, Pusher Channels has no way to know until it tries to send the client a message. By default, this means we would never send the pusher:member_removed event for that client! So, to check if an inactive connection is still alive, our protocol has a ping/pong mechanism. Either party may check that the other side is responding by sending a ping message, to which the other party should respond with a pong message.

Previously, Pusher Channels would send a ping message to a client if the connection was inactive for 200 seconds. Then Pusher Channels would wait for 30 seconds, hoping to receive a pong message from the client. If Pusher Channels did not receive a pong message within this timeout, it would broadcast the pusher:member_removed event. In the worst case, where a client uncleanly disconnects immediately after activity, the pusher:member_removed event would be delayed by 200 + 30 = 230 seconds.

We have improved this delay by reducing these timeouts. The ping timeout is reduced from 200 seconds to 60 seconds. The pong timeout is reduced from 30 seconds to 15 seconds. This leads to a lower maximum delay of 60 + 15 = 75 seconds.

We reached these timeout configurations via careful measurement. Reducing the ping timeout results in more ping and pong messages, and thus more network and battery usage on your clients. Reducing the pong timeout too far results in cutting off high-latency clients, and thus message loss. Our new configuration is measured to balance these trade-offs.

This change mostly affects presence channels, but it’s worth noting that it affects some other nice results:

  • Your connection counts may be marginally lower, since an uncleanly disconnected client will count towards your connection quota.
  • Presence webhooks had the same delay, which is now reduced to 75 seconds.
  • Channel existence webhooks also had the same delay, which is now reduced to 75 seconds.

For now, we do not provide custom configuration of these values (unless you have a dedicated cluster). We believe these new defaults will work better for everyone.