All incidents

Cloud outage June 10th

Summary

Impact

Root Cause Analysis

  • Change of Resource Usage: The deployed applications were experiencing a change in usual resource consumption.
  • Aggressive Scaling Policy: The recently implemented scaling policy was overly aggressive in resource deallocation.
  • Cluster Topology: The specific configuration of the underlying cluster exacerbated the resource constraints.

Resolution

Next steps

  • Scaling Policy Review: Our team is actively reviewing and adjusting the scaling policy to ensure optimal resource allocation and prevent future resource contention.
  • Application Resource Optimization: We are continuing to investigate opportunities to optimize resource utilization within our applications.
  • Cluster Configuration Analysis: The underlying cluster configuration is being evaluated to identify any potential optimizations that could mitigate similar incidents in the future.

Timeline

  • 15:30 CEST: Deployment of a new release that includes performance improvements specific to the cloud.
  • 16:05 CEST: Service degradation began.
  • 16:10 CEST: Service degradation reported by customers and monitoring tools.
  • 16:20 CEST: Incident response team mobilized
  • 16:30 CEST: Incident response team starts rolling back the release
  • 16:45 CEST: Rollback of the release completed
  • 17:10 CEST: Public communication about the incident on public channels
  • 17:15 CEST: Remaining affected components are restored in working state
  • 17:20 CEST: Service is back to nominal performance.
  • 17:40 CEST: End of incident communicated on public channels.
  • 18:00 CEST: Incident response team investigation ends
  • 12:00 CEST: Incident report published on the website.

Current status:

Last updated: 2024-06-12 10:30:00 CET
Flag of European UnionMade in Europe. Privacy by default.