00:02 UTC | 17:02 PT
Issues on pod 5,6, 9 are resolved. Issue may have led to small delays in email processing.
23:18 UTC | 16:18 PT
Monitoring indicates service has been restored for most customers. Work continues to confirm full stability on pods 5,6, and 9.
22:39 UTC | 15:39 PT
We are seeing performance improvements across all affected pods. We'll continue to monitor the situation and will update shortly.
22:00 UTC | 15:00 PT
We are still working to diagnose network issues impacting service availability in pods 5, 6, 9 and 13. Next update in 30 minutes.
21:25 UTC | 14:25 PT
We are working to isolate network related issues on the east coast. We will update as soon as we have more information.
20:50 UTC | 13:50 PT
We are continuing the investigation of issues on Pods 5, 6, 9 and 13. Support, Talk, and Chat may still be impacted.
20:15 UTC | 13:15 PT
We continue to investigate the issues impacting services on pods 5, 6, and 9. Next update in 30 minutes.
19:36 UTC | 12:36 PT
We are currently experiencing issues in pod 5, 6 and 9 primarily impacting support, chat, and talk. More updates to follow.
18:58 UTC | 11:58 PT
We are currently experiencing issues in pod 5, 6 and 9 that may impact availability of services. We are working to remediate.
Starting March 15 at 6:18 PM UTC Zendesk Support, Talk, and Help Center services became unavailable, primarily for customers in our East Coast data center including PODs 5, 6, and 9. This service impact lasted until 10:27 PM UTC. During the period of impact, service was unavailable for normal use, although there were brief periods when services returned.
All members of our cross-functional incident response team immediately investigated multiple potential sources for this event and ultimately took steps to limit inbound connections while gradually ramping up demand traffic to PODs 5, 6, and 9 in the East Coast data center. At the conclusion of this process, all services were back online as of 10:27 PM UTC.
Two back-end service changes, one production deploy, and some exceptionally high-volume customer traffic were systematically eliminated as sources of this threshold-exceeding traffic identifying a maintenance event on a host server that rendered a key agent service unavailable as the root cause. In addition, secondary issues encountered as a result of this unavailable agent service included network load imbalance resulting from suboptimal failover performance as well as overactive platform security services. Measures were taken to redirect traffic to load-bearing resources and platform security services were modified to allow known good traffic to be restored.
To immediately prevent recurrence of this issue we have halted all changes related to a set of infrastructure upgrade programs we've been working through, and which directly contributed to this incident. This freeze will continue through the end of March and until we have a new plan of attack for these programs that substantially reduces risk.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.