17:29 UTC | 09:29 PT
Pod 14 performance should be back to normal for all services at this point.
16:32 UTC | 08:32 PT
Pod 14 performance should be improving, but you may see slowness with Views for the time being. Please reach out if you see other issues.
16:01 UTC | 08:01 PT
We are still continuing to work on a resolution for the POD14 performance issues. Thank you for your patience. We will keep you updated.
15:46 UTC | 07:46 PT
We are continuing to work on a resolution for the POD14 performance issues. More info to follow.
15:31 UTC | 07:31 PT
We have identified a potential root cause for the issues on POD14 and are working on a resolution. More info to follow.
15:13 UTC | 07:13 PT
We are currently experiencing performance issues with POD14. Updates to follow.
On November 17, 2017 a known bug with a third party software utilized by Zendesk caused a failure in our US based POD 14 environment. This resulted in an outage lasting 1 hour and 5 minutes as well as a time of degraded service lasting and additional 27 minutes. During this time agents would initially have not been able to effectively use the system, and later on they would have experienced poor performance. Talk customers would have experienced dropped calls and the performance for end users of Guide would have been poor.
Starting at 14:58 UTC / 6:58 AM PST customers in POD 14 began experiencing service availability issues. The Zendesk team began investigating and identified that the issue was centered around Kafka (a 3rd Party Software we use). During the recovery period it was noted that there was a known bug that causes underlying issues for the Kafka server processes. A bug exception was caught on one of the Kafka brokers in POD 14, causing partitions to go offline and ultimately restart the Kafka service itself. After the Kafka service restarted, it became available again.
Kafka availability is critical for many of our services, including the querying service. By the time we restored Kafka, a thundering herd of requests flooded our querying service hosts. This resulted in the querying service hosts being overloaded. After the instance type was upgraded to increase the number of CPUs, the querying service was quickly restored, and performance returned to normal shortly after that.
The Zendesk Data Operations team has been aware of a current bug with the Kafka software. This known bug initiated the incident. A patch to upgrade Kafka has been in test to ensure a smooth deployment to our production environment. As noted below, this priority of the Kafka upgrade has been prioritized as a result of this incident.
In order to prevent this from happening again in the future, we will be deploying a patch to address the known Kafka bug, which has been in test, to prevent this specific incident from reoccurring. We will also be conducting further analysis on a possible architectural change to address the cascading effects of other services. Finally, we will increase cores on all querying service nodes in AWS pods to handle additional concurrent threads.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.