On January 24th, 2022 between UTC 23:10 and 23:50, Omnivore agents experienced a major interruption of service. Agent connectivity was intermittent during this window, with a period of total downtime between 23:23 and 23:39.
We know the uptime and stability of POS integrations is critical, and take any disruptions to agent connectivity seriously. We have already remediated the root cause, and apologize for the disruption.
Omnivore agents communicate primarily with a service named 'connect', located at connect.omnivore.io. The 'connect' service acts as a proxy for multiple backend services used by agents, including connectivity, monitoring, log aggregation, upgrades, etc.
The agent logs aggregation service (referred to simply as 'logs') receives uploads of log content from agents, and forwards said content to our logging vendor for real-time indexing and storage. This system is used heavily by Omnivore Support to diagnose and troubleshoot agent issues, and is a critical component of Omnivore Infrastructure.
The 'logs' service receives and manages more than a hundred thousand active connections from agents. In cases where our upstream vendor has performance issues or downtime, the logs service can get backed up, sometimes crashing due to out of memory (OOM) events stemming from the influx of data with reduced or no corresponding outflux to our vendor. In these kinds of events, the active connections need to be reestablished. This influx of a significant number of new connections can sometimes cause a reduction in performance in the 'connect' service.
In response to this known failure mode, Omnivore staff have been implementing a more intelligent logs service that more carefully manages connections and memory to prevent this kind of crash and backpressure. This replacement 'logs' service has been running in production for some time receiving a fraction of incoming traffic. On the day of this event, it was configured for the first time to receive all traffic.
This new 'logs' service was deployed in kubernetes and was configured with an incorrect priority class and pod disruption budget, using the configuration of a less critical component. Priority classes let kubernetes know which containers matter most, and which to destroy if it needs to make room to run a higher priority container. Disruption budgets inform kubernetes how many containers are allowed to be killed within a service. The incorrect configuration of these two service level controls allowed an unrelated service to preempt the logs containers, causing a significant number of connections to be simultaneously terminated, and thus reconnecting through connect.omnivore.io, resulting in increased load at the proxy layer.
The increased load was significant enough to cause 'connect' processes to both be removed from the load balancer due to failed health checks, as well as die from out-of-memory events. The additional disconnects for other backend services (including agent connectivity services) further exacerbated the issue, and resulted in a thundering herd problem that overwhelmed the 'connect' service past the point where it could recover organically.
The issue was resolved when Omnivore Operations staff significantly increased the number of instances running the 'connect' service, and rolled back the release directing all traffic at the new logs service.
All times are in UTC.
19:35: Omnivore Operations performed a software deployment to the agent logs aggregation ('logs').
23:10: An unrelated service in the same kubernetes cluster scaled up and preempted a significant portion of 'logs'.
23:15: Omnivore Operations staff are paged for a connection timeout on a connect server.
23:16: Omnivore Operations staff respond and begin investigating.
23:20: Omnivore Operations staff receive several additional alerts as the situation degrades.
23:22: Most agents begin disconnecting and are unable to reconnect.
23:23: Omnivore Development staff are paged for elevated delivery order failure rates through Menu Management.
23:25 - 23:35: Additional 'connect' servers are put in rotation multiple times in an attempt to remediate the instability. The change request from earlier in the day is identified as a possible contributor to the issue, and the engineer who executed it is paged directly.
23:38: The 'logs' release is rolled back.
23:39: Most agents begin successfully reconnecting.
23:47: Additional, larger, 'connect' servers are put in rotation to prevent out of memory crashes.
23:50: Online agent counts return to normal.
00:54: Incident conference call terminated.