Cloudflare R2 Outage Incident Report - February 6, 2025

(0 reviews)

https://codenamejessica.com/blogs/entry/574-cloudflare-r2-outage-incident-report-february-6-2025/

Overview

On Thursday, February 6, 2025, multiple Cloudflare services, including R2 object storage, experienced a significant outage lasting 59 minutes. This incident resulted in complete operational failures against R2 and disruptions to dependent services such as Stream, Images, Cache Reserve, Vectorize, and Log Delivery. The root cause was traced to human error and inadequate validation safeguards during routine abuse remediation procedures.

Impact Summary

Incident Duration: 08:14 UTC to 09:13 UTC (primary impact), with residual effects until 09:36 UTC.
Primary Issue: Disabling of the R2 Gateway service, responsible for the R2 API.
Data Integrity: No data loss or corruption occurred within R2.

Affected Services

R2: 100% failure of operations (uploads, downloads, metadata) during the outage. Minor residual errors (<1%) post-recovery.
Stream: Complete service disruption during the outage.
Images: Full impact on upload/download; delivery minimally affected (97% success rate).
Cache Reserve: Increased origin requests, impacting <0.049% of cacheable requests.
Log Delivery: Delays and data loss (up to 4.5% for non-R2, 13.6% for R2 jobs).
Durable Objects: 0.09% error rate spike post-recovery.
Cache Purge: 1.8% error rate increase, 10x latency during the incident.
Vectorize: 75% query failures, 100% insert/upsert/delete failures during the outage.
Key Transparency Auditor: Complete failure of publish/read operations.
Workers & Pages: Minimal deployment failures (0.002%) for projects with R2 bindings.

Incident Timeline

08:12 UTC: R2 Gateway service inadvertently disabled.
08:14 UTC: Service degradation begins.
08:25 UTC: Internal incident declared.
08:42 UTC: Root cause identified.
08:57 UTC: Operations team begins re-enabling the R2 Gateway.
09:10 UTC: R2 starts to recover.
09:13 UTC: Primary impact ends.
09:36 UTC: Residual error rates recover.
10:29 UTC: Incident officially closed after monitoring.

Root Cause Analysis

The incident stemmed from human error during a phishing site abuse report remediation. Instead of targeting a specific endpoint, actions mistakenly disabled the entire R2 Gateway service. Contributing factors included:

Lack of system-level safeguards.
Inadequate account tagging and validation.
Limited operator training on critical service disablement risks.

The Risks of CDN Dependencies in Critical Systems

Content Delivery Networks (CDNs) play a vital role in improving website performance, scalability, and security. However, relying heavily on CDNs for critical systems can introduce significant risks when outages occur:

Lost Revenue: Downtime on e-commerce platforms or SaaS services can result in immediate lost sales and financial transactions, directly affecting revenue streams.
Lost Data: Although R2 did not suffer data loss in this incident, disruptions in data transmission processes can lead to lost or incomplete data, especially in logging and analytics services.
Lost Customers: Extended or repeated outages can erode customer trust and satisfaction, leading to churn and damage to brand reputation.
Operational Disruptions: Businesses relying on real-time data processing or automated workflows may face cascading failures when critical CDN services are unavailable.

Remediation Steps

Immediate Actions:

Deployment of additional guardrails in the Admin API.
Disabling high-risk manual actions in the abuse review UI.

In-Progress Measures:

Improved internal account provisioning.
Restricting product disablement permissions.
Implementing two-party approval for critical actions.
Enhancing abuse checks to prevent internal service disruptions.

Cloudflare acknowledges the severity of this incident and the disruption it caused to customers. We are committed to strengthening our systems, implementing robust safeguards, and ensuring that similar incidents are prevented in the future.

For more information about Cloudflare's services or to explore career opportunities, visit our website.

Sign In

Welcome to CodeNameJessica

✨ Welcome to CodeNameJessica! ✨

Cloudflare R2 Outage Incident Report - February 6, 2025

Overview

Impact Summary

Affected Services

Incident Timeline

Root Cause Analysis

The Risks of CDN Dependencies in Critical Systems

Remediation Steps

0 Comments

Recommended Comments

Important Information

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)