17:25 UTC | 10:25 PT
Issues regarding Zopim access, history, and analytics are now recovered.
17:03 UTC | 10:03 PT
Chat history and analytics are available but seeing delays. Expecting full recovery in less than 1 hour.
16:51 UTC | 09:51 PT
Access to Zopim chat is functioning properly, however history and analytics are slow or unavailable.
16:31 UTC | 09:31 PT
We are currently investigating an issue with Zopim. Investigation is underway.
During this incident customers experienced a brief outage with login failures/disconnections for chat functionality followed by degraded services and unavailable chat history and analytics. As part of the troubleshooting steps for an issue where some accounts were not able to chat, we put in an emergency request to the Network Operations team to disable DDoS mitigation to address network issues. Unfortunately, this caused global connection re-establishments which led to a "thundering herd" scenario. Most critical subsystems self-recovered in under 5 minutes, enabling business-critical functions to resume but left chat history and analytics slow to load or unavailable. The Zopim Elasticsearch cluster then took an hour for full recovery before chat history and analytics were available again.
Going forward, we are investigating an Elasticsearch upgrade and associated architecture changes so that a thundering herd scenario would only affect less critical systems.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.