Between 08:48 UTC and 09:53 UTC on August 5th approximately 40% of Omnivore agents experienced a period of instability during which API calls would have succeeded intermittently. Additionally between 09:00 UTC and 09:30 UTC approximately 11% of Omnivore agents went offline causing all API calls to those locations to be unsuccessful.
The incident began when the service responsible for maintaining agent connections, known internally as broker
, became overloaded and stopped functioning properly. Our on-call operations team member responded to monitoring alerts and was able to appropriately scale broker to restore customer service.
Much of our infrastructure is hosted on Amazon Web Services (AWS). On July 13th, as part of standard operations, the Operations team adjusted the resource allocations to many containerized services, including broker
. Too little memory was allocated to broker
and the service became unstable and would occasionally crash. These crashes were not frequent enough to be detected by our monitoring system and did not significantly impact customers due to the design of the system so the problem went unnoticed.
The outage response / recovery section below references our "typical deployment process". This is the process by which software is deployed in a slow and controlled fashion. When, for example, broker
is typically deployed it is done so over the course of an hour so that customers are not impacted and engineers can verify functionality and stability.
After receiving a monitoring alert at 08:53 UTC our on-call operations engineer began to troubleshoot the issue to determine severity and scope. After making the determination that this was a customer impacting outage the engineer opened a conference call and notified other engineers.
Investigation continued until 09:07 UTC when the engineer received another monitoring alert which indicated the broker
service had run out of memory and was crashing. This caused the engineer to increase the memory allocation to broker
through the typical deployment process. After seeing the broker
service was still crashing the engineer again increased the memory allocation through the typical deployment process and created an incident on our status page.
At 09:25 UTC the engineer saw the increased memory allocations made through the typical deployment process were not being applied due to the slow nature of the process. The engineer once again increased the memory allocation but did so in a fashion which bypassed the typical deployment process and allowed the broker
service to restart quickly with the increased memory allocation.
After verifying the increased memory allocation had taken effect, the engineer noticed the broker
database had significantly higher CPU and memory load than usual and started the process of scaling the database to ensure that was not the cause of any instability. After the database was resized all instability resided and the incident was marked as resolved.
The root cause of the outage was that the broker
service went into an unrecoverable failure mode after an initial disruption to network traffic. While broker
was busy processing new inbound connections for those lost during the initial disruption, CPU and memory consumption were higher than was allocated. Connections and queries to the database to record those new connections spiked significantly from the high influx of traffic, causing more delays, further overloading the system and preventing it from reaching a new stable and steady state. The combination resulted in some inbound connections to not be processed within an acceptable time-frame, causing them to be retried close to completion. The increased database usage tied up all available connections preventing critical actions from completing, further backing up the system, continuing the failure mode.
broker
as well as other pieces of critical infrastructure.The development team has already developed and deployed changes to broker
that significantly alter its database usage profile in a way that will be resilient to similar overloads and prevent the kind of unrecoverable failure mode experienced. Specifically, these changes include:
All times are from August 5th, 2020, UTC.
08:48 - Agent response times increased and general agent instability begins
08:53 - Agent response times trigger monitoring alerts and Operations investigation
09:00 - broker
began to crash due to lack of RAM causing agents to go offline
09:11 - broker
RAM increased, but as a slow rollout, yielding no impact
09:17 - broker
RAM increased again, but as a slow rollout, yielding no impact
09:18 - Statuspage incident created
09:25 - broker
RAM restart forced to enable using higher memory allocation
09:30 - Agents come back online as broker
stabilizes
09:35 - broker
database increased in size and restarted due to high CPU/memory usage
09:50 - broker
database resize completed
09:53 - Agent general instability ends
10:17 - Status page incident resolved