Google Workspace Status Dashboard
Incident affecting Google Groups
Incident began at 2021-11-12 08:30 and ended at 2021-11-12 10:26 (times are in Coordinated Universal Time (UTC)).
Date | Time | Description | |
---|---|---|---|
| Dec 2, 2021 | 11:37 PM UTC | INCIDENT REPORTDATE/TIME OF THE ISSUE (US/Pacific time) Friday, 12 November 2021 00:30 - Friday, 12 November 2021 02:26 Duration: 1 hour, 56 minutes SummaryOn November 12, 2021, the Google Cloud Load Balancing (GCLB) service experienced failures resulting in impact to several downstream Google Cloud services in Europe for a duration of 1 hour, 56 minutes. We understand that this issue has impacted our valued customers and users, and we apologize to those who were affected. BackgroundGoogle Cloud Load Balancing is a collection of software and services that load balance traffic across Google properties. There are two main components: a control plane and a data plane. The control plane provides programming to the data plane on how to handle requests. A key component of the data plane is the Google Front End (GFE). The GFE is an HTTP/TCP reverse proxy, which is used to serve requests to Google properties including Search, Ads, Workspace (Gmail, Chat, Meet, Docs, Drive, etc.), Cloud External HTTP(S) Load Balancing, Proxy/SSL Load Balancing, and many Cloud APIs. Updates are regularly rolled out to GFEs, typically via configuration flags, starting with canary GFEs and gradually expanding to production globally. GFEs support and terminate QUIC(1) connections, before connecting to downstream backend services. QUIC is a general-purpose transport layer network protocol. Upon first connection, QUIC servers supply a source address token to prove that a client has previously used a given address when resuming a future connection. Root CauseOn Friday, 12 November at 00:27, a configuration change modifying the format of the source address token provided to QUIC clients was rolled out to a small set of GFEs. This change resulted in a misconfigured token that could crash GFEs that had not yet received this update. Shortly thereafter, the monitoring service automatically detected a problem with GFEs using this flag and rolled back the change within four minutes. However, clients that had connected to a GFE with the updated configuration during that period received a misconfigured token, which was subsequently shared with other GFEs during reconnection. So despite the rollback, impact remained until additional mitigations were put in place. [1] - https://cloud.google.com/blog/products/gcp/introducing-quic-support-https-load-balancing Remediation and PreventionGoogle engineers were alerted to the issue via automated alerting on Friday, 12 November 2021, at 00:30 US/Pacific and immediately started an investigation. At 00:31, the configuration change was automatically rolled back. However, by 00:42, it was clear the impact remained widespread, and our engineering team continued further investigation. Mitigation began at 01:38, when traffic was redirected away from the impacted GFEs. At 02:12, a flag change was pushed to temporarily disable QUIC support on GFEs, which mitigated all impact by 02:26. In order to prevent this type of outage from happening again we are pursuing the following: We want to apologize for the length and severity of this incident. We are taking immediate steps to prevent recurrence and improve reliability in the future. If your service or application was affected, we apologize. This is not the level of quality and reliability we strive to offer you, and we are taking immediate steps to improve the platform’s performance and availability. Detailed Description of ImpactOn Friday, November 12, 00:30 2021 US/Pacific, the GCLB service experienced failures resulting in impact to several downstream Google Cloud services for 1 hour, 56 minutes. Some customers in Europe were unable to access web and mobile clients for services including Gmail, Groups, Calendar, Tasks, and Chat. Google GmailAffected customers were unable to access web and mobile clients. This resulted in ~2% traffic drop for Gmail services. This mostly affected customers in Europe. The period of impact was between 00:30 and 02:53. Google GroupsAffected customers were unable to access web and mobile clients. This resulted in affected customers in Europe, who were unable to access web and mobile clients. The period of impact was between 1:28 and 3:06, during which time affected customers in Europe were having issues loading the Groups UI. Google TasksGoogle Tasks experienced error rates up to ~.2% in Europe. Affected customers were unable to access web and mobile clients. The period of impact was between 00:30 and 02:10. Google CalendarGoogle Calendar experienced error rates up to ~.5% in Europe. Affected customers were unable to access web and mobile clients. The period of impact was between 00:30 and 02:10. Google Chat14.5% of Chat users could not connect, which impaired functionality in their clients. This affected mostly European users, both web and mobile. The period of impact was between 00:30 and 02:20. Appendix[1] - https://cloud.google.com/blog/products/gcp/introducing-quic-support-https-load-balancing |
| Nov 13, 2021 | 1:04 AM UTC | We apologize for the inconvenience this service disruption/outage may have caused. We would like to provide some information about this incident below. Please note, this information is based on our best knowledge at the time of posting and is subject to change as our investigation continues. If you have experienced impact outside of what is listed below, please reach out to Google Support by opening a case using https://cloud.google.com/support. (All Times US/Pacific) Incident Start: 12 November 2021 00:30 Incident End: 12 November 2021 02:14 Duration: 1 hour 44 minutes Affected Services and Features:
Regions/Zones: Europe Description: Google’s Front End load balancing service experienced failures resulting in impact to several downstream Google Cloud services in Europe. From preliminary analysis, the root cause of the issue was caused by a new infrastructure feature triggering a latent issue within internal network load balancer code. Customer Impact:
Additional details: The error was caught within 4 minutes by automated safety systems, and further spread was slowed at this point. The issue was fully mitigated approximately 1hr 44m later, when our engineering team completed a rollout to disable the vulnerable code path. The issue will be fully prevented going forward via a root cause fix, which will complete rollout by 12 November 2021 21:00 US/Pacific. |
| Nov 12, 2021 | 10:55 AM UTC | The problem with Google Groups has been resolved. We apologize for the inconvenience and thank you for your patience and continued support. |
| Nov 12, 2021 | 10:39 AM UTC | Our team is continuing to investigate this issue. We will provide an update by Nov 12, 2021, 11:00 AM UTC with more information about this problem. Thank you for your patience. The affected users are unable to access Google Groups. Some users in Europe may experience issues when attempting to access services. |
| Nov 12, 2021 | 10:05 AM UTC | We're investigating reports of an issue with Google Groups. We will provide more information shortly. The affected users are unable to access Google Groups. We are investigating an issue which is affecting some users in Europe affecting their ability to access some services. |
- Times are listed in Coordinated Universal Time (UTC)