16:27 UTC | 09:27 PT
Chat history and search service functionality have been fully restored; all known issues should be resolved.
15:31 UTC | 08:31 PT
We continue to work to bring chat history and search back - data is becoming gradually available. More updates soon.
14:22 UTC | 07:22 PT
Our Operations team continue to work on bringing Zopim chat history and search back to normal functionality.
13:46 UTC | 06:46 PT
We are working to bring chat history and search back to normal functionality. We will provide an update shortly. (Tweet 2/2)
13:46 UTC | 06:46 PT
Chat functionality for the Zopim service has been restored. (Tweet 1/2)
13:08 UTC | 06:08 PT
The unscheduled maintenance for Zopim has now started. We expect the Zopim service to be available in approximately 30 minutes.
12:34 UTC | 05:34 PT
in order to resolve ongoing performance issues affecting Zopim customers. (Tweet 2/2)
12:34 UTC | 05:34 PT
At 12:45 UTC, The Zopim service will not be available for approximately 30 minutes (Tweet 1/2)
12:17 UTC | 05:17 PT
We’re still working on fixing the issue regarding connectivity to Zopim services. Thanks again for bearing with us!
11:25 UTC | 04:25 PT
We’re still working on resolving the connectivity issues with our Zopim service. We want to thank you for your ongoing patience!
10:52 UTC | 03:52 PT
We have identified the cause and are working on a resolution - please bear with us. Thank you!
10:22 UTC | 03:22 PT
We are investigating an outage impacting the Zopim dashboard, more details to follow.
Zendesk recently completed a Disaster Recovery test exercise of our Zopim Chat service. The goal of this exercise was to evaluate our disaster recovery plans, technical runbooks, and failover capabilities to our disaster recovery environment. This is an important part of our overall set of practices to ensure that our services can be recovered and remain available to customers in the event of a significant disaster. The test exercise included an initial service disruption during the failover as well as a longer interruption as the result of a later unanticipated service failure in our Recovery environment.
On September 18, 2016, during our previously scheduled maintenance window for Zopim Chat, we conducted the Disaster Recovery exercise. We successfully failed over from our primary Zopim datacenter to the Recovery environment in less than 30 minutes. Once in Recovery mode, we encountered a networking issue that caused database replication between the two sites to be out of sync.
On September 19, 2016, at approximately 3:00 am PDT, we experienced a failure in the Recovery environment. A key part of our background job processing environment relies on a data store for management of short term data changes and job status. While an overall small part of our infrastructure, this data store is important to keep many functions of our application platform operational. This loss of service resulted in a cascading failure in other services, such as our Elasticsearch cluster.
At this point the chat service was no longer available in the Recovery environment and work to troubleshoot the issue and recover the service began. After troubleshooting these issues for 1 hour and 20 minutes, at 5:30am PDT, it was decided the best course of action was to rollback to our primary Zopim data center. In the time since the initial failover activity, the database sync issue had been resolved, and our primary environment was ready for full service once again. The rollback was started at 6:19am and completed at 6:30am PDT. It was at this time chat activity and ticket creation was restored. During the following 2 hours we focused on the search infrastructure to ensure chat history and analytics were current and responsive.
Zendesk conducted a Post Mortem review of these events with parties from all groups involved in the planning, execution, and recovery of this event. During the review a full timeline was assembled and discussed. Most importantly, a list of remediations was identified in support of improving our execution of future recovery test exercise events. The areas of remediation include reviews of disaster recovery communications process, functionality, and automation.
Zendesk regrets the unscheduled service interruption which occurred during this overall exercise event. We appreciate your patience as we continue to improve our ability to recover from catastrophic events that could impact future service. Our ability to recover from such events is important to ensure continuity of your business. We look to improve the experience for our next test event.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.