Summary
On March 26th, 2025, an incident occurred during the deployment of version v4.12.1 on the production cluster, as part of a scheduled maintenance window. The incident resulted in HTTP 500 errors for users due to a defective implementation of a "fallback mode" used temporarily during a database schema change related to tags. These errors persisted for nearly 20 minutes before mitigation and resolution measures were implemented.
Impact
Users experienced intermittent 500 internal server errors when accessing services related to the cloud-api, specifically affecting the resource workspace availability. The incident occurred during working hours but toward the end of the day, minimizing potential peak user impact.
Root Cause Analysis
The issue originated from a temporary mechanism introduced to support database changes related to tags. This approach, while intended to minimize impact, contained bugs that resulted in 500 errors. Testing did not cover all relevant scenarios, and some warning signs were missed during testing. Additionally, a deployment configuration oversight contributed to delays in the rollout process
Resolution
Upon identification of the issue, the engineering team:
- Disabled the Tags plugin, effectively mitigating the 500 errors.
- Resolved the underlying issue by 19:05 local time.
Timeline
March 26, 2025
- 15:00 CEST: scheduled maintenance start
- 15:00 CEST: scheduled maintenance start
- 17:00 CEST: Monitoring system reports incident
- 17:10 CEST: Incident accepted and triaged
- 17:15 CEST: Tag feature disabled
- 17:25 CEST: Last 500 error logged
- 19:05 CEST: Fix rolled out, tag feature re-nabled
- 19:15 CEST: Incident resolved and documentation initiated
March 27, 2025
- 15:00 CEST: Internal documentation completed
April 10, 2025
- 15:00 CEST: External documentation completed