20:35 UTC | 12:35 PT
We are experiencing dropped calls and talk performance issues across all pods. Investigation is underway.
On Feb 20, from 12:00 PM PST to 12:33 PM PST our Talk vendor Twilio's Programmable Voice /Calls REST API experienced higher latency and errors. As the delays cascaded into failures, 2.6% of dialed calls failed between 12:10PM PST and 12:33 PM PST. 75% of /Calls REST API GET requests failed. In general, Programmable Voice Outbound API calls saw higher latency.
Root Cause Analysis
The issue was caused by increased disk latency affecting writes of customer data into a historical
database. Continued retries into this internal Twilio database caused API requests to queue internally. This caused resource starvation in a critical service that processes queries to the databases. This led to delays and failures on all internal Voice databases, affecting dialed Voice calls and API. The incident was resolved when the databases and services processing queries were replaced.
Twilio's resolution plan for this outage are as follows:
1. We will add measures to isolate the delays and failures of writing to the database from impacting other servers.
2. We will add a fallback mechanism to handle high disk latency and thus increasing the fault
3. We will also add more monitoring in place to detect this pattern earlier and automatically failover to another database system accordingly.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.