Agent Connectivity
Incident Report for Omnivore.io
Postmortem

Summary

Between 08:48 UTC and 09:53 UTC on August 5th approximately 40% of Omnivore agents experienced a period of instability during which API calls would have succeeded intermittently. Additionally between 09:00 UTC and 09:30 UTC approximately 11% of Omnivore agents went offline causing all API calls to those locations to be unsuccessful.

The incident began when the service responsible for maintaining agent connections, known internally as broker, became overloaded and stopped functioning properly. Our on-call operations team member responded to monitoring alerts and was able to appropriately scale broker to restore customer service.

Leadup / Background

Much of our infrastructure is hosted on Amazon Web Services (AWS). On July 13th, as part of standard operations, the Operations team adjusted the resource allocations to many containerized services, including broker. Too little memory was allocated to broker and the service became unstable and would occasionally crash. These crashes were not frequent enough to be detected by our monitoring system and did not significantly impact customers due to the design of the system so the problem went unnoticed.

The outage response / recovery section below references our "typical deployment process". This is the process by which software is deployed in a slow and controlled fashion. When, for example, broker is typically deployed it is done so over the course of an hour so that customers are not impacted and engineers can verify functionality and stability.

Response / Recovery

After receiving a monitoring alert at 08:53 UTC our on-call operations engineer began to troubleshoot the issue to determine severity and scope. After making the determination that this was a customer impacting outage the engineer opened a conference call and notified other engineers.

Investigation continued until 09:07 UTC when the engineer received another monitoring alert which indicated the broker service had run out of memory and was crashing. This caused the engineer to increase the memory allocation to broker through the typical deployment process. After seeing the broker service was still crashing the engineer again increased the memory allocation through the typical deployment process and created an incident on our status page.

At 09:25 UTC the engineer saw the increased memory allocations made through the typical deployment process were not being applied due to the slow nature of the process. The engineer once again increased the memory allocation but did so in a fashion which bypassed the typical deployment process and allowed the broker service to restart quickly with the increased memory allocation.

After verifying the increased memory allocation had taken effect, the engineer noticed the broker database had significantly higher CPU and memory load than usual and started the process of scaling the database to ensure that was not the cause of any instability. After the database was resized all instability resided and the incident was marked as resolved.

Root Cause

The root cause of the outage was that the broker service went into an unrecoverable failure mode after an initial disruption to network traffic. While broker was busy processing new inbound connections for those lost during the initial disruption, CPU and memory consumption were higher than was allocated. Connections and queries to the database to record those new connections spiked significantly from the high influx of traffic, causing more delays, further overloading the system and preventing it from reaching a new stable and steady state. The combination resulted in some inbound connections to not be processed within an acceptable time-frame, causing them to be retried close to completion. The increased database usage tied up all available connections preventing critical actions from completing, further backing up the system, continuing the failure mode.

Action Items

  • Operations Engineers have already, and will continue to, significantly improve resource monitoring (RAM, CPU, network, etc.) for broker as well as other pieces of critical infrastructure.
  • The development team has already developed and deployed changes to broker that significantly alter its database usage profile in a way that will be resilient to similar overloads and prevent the kind of unrecoverable failure mode experienced. Specifically, these changes include:

    • Moving some asynchronous queries during the connection handshake process to synchronous, ensuring the process completes before additional subsequent connections from the same agent are established, reducing the number of possible parallel operations.
    • Resolving an identified race condition that caused a significant number of queries to the database in order to recover from that race condition.
    • Adjusting the database connection pool settings to control the distribution of query types performed to ensure availability of online agents during similar disruptions.

Timeline

All times are from August 5th, 2020, UTC.

08:48 - Agent response times increased and general agent instability begins

08:53 - Agent response times trigger monitoring alerts and Operations investigation

09:00 - broker began to crash due to lack of RAM causing agents to go offline

09:11 - broker RAM increased, but as a slow rollout, yielding no impact

09:17 - broker RAM increased again, but as a slow rollout, yielding no impact

09:18 - Statuspage incident created

09:25 - broker RAM restart forced to enable using higher memory allocation

09:30 - Agents come back online as broker stabilizes

09:35 - broker database increased in size and restarted due to high CPU/memory usage

09:50 - broker database resize completed

09:53 - Agent general instability ends

10:17 - Status page incident resolved

Posted Aug 12, 2020 - 06:56 PDT

Resolved
Agents are now able to properly connect to Omnivore services.
Posted Aug 05, 2020 - 03:17 PDT
Investigating
We are currently investigating reports of agents not being able to contact Omnivore services.
Posted Aug 05, 2020 - 02:18 PDT
This incident affected: Core Services (API).