19:24 UTC | 12:24 PT
Impact for customers on Pod 13 is now resolved. We expect Pod 5 to return to normal service for Chat in approx 10 hours. If you are on Pod 5 and need to urgently change a chat setting, please reach out to us at email@example.com
18:42 UTC | 11:42 PT
We are continuing to monitor pods 5 and 13 as they recover. We will give an update in 30 minutes.
18:11 UTC | 11:11 PT
We have identified the cause of the access issues and are deploying a fix. Pod 5 has stabilized and Pod 13 is recovering.
17:36 UTC | 10:36 PT
We are investigating reports of users not being able to login Chat on pods 5 and 13.
As part of a remediation item for a prior incident, our engineering team deployed a new version of our staff service that included a significant change to how we handle accounts. Unfortunately, that change contained a defect that affected consumers under heavy production load. A large volume of notifications triggered a large number of account lookups which eventually exceeded the set rate limit in the staff service. The rate limit was not handled properly on the account service side and led to the stopping of all consumers on the affected host. In order to resolve the issue, we restarted the account service which eventually cleared a backlog of events and service was restored to normal. In order to prevent this from happening again in the future, we have fixed the defect, and will be improving our error monitoring, rate limiting, and internal playbook for these types of incidents.
FOR MORE INFORMATION
For current system status information about your Zendesk, check out our system status page. During an incident, you can also receive status updates by following @ZendeskOps on Twitter. The summary of our post-mortem investigation is usually posted here a few days after the incident has ended. If you have additional questions about this incident, please log a ticket with us.