Increase in Locations Offline
Incident Report for Omnivore.io
Postmortem

Executive Summary

On January 24th, 2022 between UTC 23:10 and 23:50, Omnivore agents experienced a major interruption of service. Agent connectivity was intermittent during this window, with a period of total downtime between 23:23 and 23:39.

We know the uptime and stability of POS integrations is critical, and take any disruptions to agent connectivity seriously. We have already remediated the root cause, and apologize for the disruption.

Background and Root Cause

Omnivore agents communicate primarily with a service named 'connect', located at connect.omnivore.io. The 'connect' service acts as a proxy for multiple backend services used by agents, including connectivity, monitoring, log aggregation, upgrades, etc.

The agent logs aggregation service (referred to simply as 'logs') receives uploads of log content from agents, and forwards said content to our logging vendor for real-time indexing and storage. This system is used heavily by Omnivore Support to diagnose and troubleshoot agent issues, and is a critical component of Omnivore Infrastructure.

The 'logs' service receives and manages more than a hundred thousand active connections from agents. In cases where our upstream vendor has performance issues or downtime, the logs service can get backed up, sometimes crashing due to out of memory (OOM) events stemming from the influx of data with reduced or no corresponding outflux to our vendor. In these kinds of events, the active connections need to be reestablished. This influx of a significant number of new connections can sometimes cause a reduction in performance in the 'connect' service.

In response to this known failure mode, Omnivore staff have been implementing a more intelligent logs service that more carefully manages connections and memory to prevent this kind of crash and backpressure. This replacement 'logs' service has been running in production for some time receiving a fraction of incoming traffic. On the day of this event, it was configured for the first time to receive all traffic.

This new 'logs' service was deployed in kubernetes and was configured with an incorrect priority class and pod disruption budget, using the configuration of a less critical component. Priority classes let kubernetes know which containers matter most, and which to destroy if it needs to make room to run a higher priority container. Disruption budgets inform kubernetes how many containers are allowed to be killed within a service. The incorrect configuration of these two service level controls allowed an unrelated service to preempt the logs containers, causing a significant number of connections to be simultaneously terminated, and thus reconnecting through connect.omnivore.io, resulting in increased load at the proxy layer.

The increased load was significant enough to cause 'connect' processes to both be removed from the load balancer due to failed health checks, as well as die from out-of-memory events. The additional disconnects for other backend services (including agent connectivity services) further exacerbated the issue, and resulted in a thundering herd problem that overwhelmed the 'connect' service past the point where it could recover organically.

The issue was resolved when Omnivore Operations staff significantly increased the number of instances running the 'connect' service, and rolled back the release directing all traffic at the new logs service.

Timeline

All times are in UTC.

19:35: Omnivore Operations performed a software deployment to the agent logs aggregation ('logs').

23:10: An unrelated service in the same kubernetes cluster scaled up and preempted a significant portion of 'logs'.

23:15: Omnivore Operations staff are paged for a connection timeout on a connect server.

23:16: Omnivore Operations staff respond and begin investigating.

23:20: Omnivore Operations staff receive several additional alerts as the situation degrades.

23:22: Most agents begin disconnecting and are unable to reconnect.

23:23: Omnivore Development staff are paged for elevated delivery order failure rates through Menu Management.

23:25 - 23:35: Additional 'connect' servers are put in rotation multiple times in an attempt to remediate the instability. The change request from earlier in the day is identified as a possible contributor to the issue, and the engineer who executed it is paged directly.

23:38: The 'logs' release is rolled back.

23:39: Most agents begin successfully reconnecting.

23:47: Additional, larger, 'connect' servers are put in rotation to prevent out of memory crashes.

23:50: Online agent counts return to normal.

00:54: Incident conference call terminated.

Action Items

  1. The priority class and disruption budget of the 'logs' service has already been corrected to prevent this as a cause of disruption. Additionally, moving forward, any service behind 'connect' which handles a large number of connections will be considered an essential service not eligible for preemption.
  2. Omnivore Development staff are evaluating mechanisms to prevent thundering herd problems from disruptions to other services behind connect from affecting agent connectivity.
Posted Jan 28, 2022 - 11:46 PST

Resolved
All agents remain connected and orders are flowing. We will continue to monitor closely. Postmortem by 1/28.
Posted Jan 24, 2022 - 16:47 PST
Update
All agents remain online and connected. We are continuing to investigate the root cause, and will continue monitoring throughout the evening.
Posted Jan 24, 2022 - 16:19 PST
Monitoring
All agents have reconnected or are actively reconnecting.
Posted Jan 24, 2022 - 15:51 PST
Investigating
We are investigating a high number of locations reporting as Offline. This is causing a high number of order failures on the MMS side.
Posted Jan 24, 2022 - 15:40 PST
This incident affected: Core Services (API).