
Overview
On Thursday, February 6, 2025, multiple Cloudflare services, including R2 object storage, experienced a significant outage lasting 59 minutes. This incident resulted in complete operational failures against R2 and disruptions to dependent services such as Stream, Images, Cache Reserve, Vectorize, and Log Delivery. The root cause was traced to human error and inadequate validation safeguards during routine abuse remediation procedures.
Impact Summary
Incident Duration: 08:14 UTC to 09:13 UTC (primary impact), with residual effects until 09:36 UTC.
Primary Issue: Disabling of the R2 Gateway service, responsible for the R2 API.
Data Integrity: No data loss or corruption occurred within R2.
Affected Services
R2: 100% failure of operations (uploads, downloads, metadata) during the outage. Minor residual errors (<1%) post-recovery.
Stream: Complete service disruption during the outage.
Images: Full impact on upload/download; delivery minimally affected (97% success rate).
Cache Reserve: Increased origin requests, impacting <0.049% of cacheable requests.
Log Delivery: Delays and data loss (up to 4.5% for non-R2, 13.6% for R2 jobs).
Durable Objects: 0.09% error rate spike post-recovery.
Cache Purge: 1.8% error rate increase, 10x latency during the incident.
Vectorize: 75% query failures, 100% insert/upsert/delete failures during the outage.
Key Transparency Auditor: Complete failure of publish/read operations.
Workers & Pages: Minimal deployment failures (0.002%) for projects with R2 bindings.
Incident Timeline
08:12 UTC: R2 Gateway service inadvertently disabled.
08:14 UTC: Service degradation begins.
08:25 UTC: Internal incident declared.
08:42 UTC: Root cause identified.
08:57 UTC: Operations team begins re-enabling the R2 Gateway.
09:10 UTC: R2 starts to recover.
09:13 UTC: Primary impact ends.
09:36 UTC: Residual error rates recover.
10:29 UTC: Incident officially closed after monitoring.
Root Cause Analysis
The incident stemmed from human error during a phishing site abuse report remediation. Instead of targeting a specific endpoint, actions mistakenly disabled the entire R2 Gateway service. Contributing factors included:
Lack of system-level safeguards.
Inadequate account tagging and validation.
Limited operator training on critical service disablement risks.
The Risks of CDN Dependencies in Critical Systems
Content Delivery Networks (CDNs) play a vital role in improving website performance, scalability, and security. However, relying heavily on CDNs for critical systems can introduce significant risks when outages occur:
Lost Revenue: Downtime on e-commerce platforms or SaaS services can result in immediate lost sales and financial transactions, directly affecting revenue streams.
Lost Data: Although R2 did not suffer data loss in this incident, disruptions in data transmission processes can lead to lost or incomplete data, especially in logging and analytics services.
Lost Customers: Extended or repeated outages can erode customer trust and satisfaction, leading to churn and damage to brand reputation.
Operational Disruptions: Businesses relying on real-time data processing or automated workflows may face cascading failures when critical CDN services are unavailable.
Remediation Steps
Immediate Actions:
Deployment of additional guardrails in the Admin API.
Disabling high-risk manual actions in the abuse review UI.
In-Progress Measures:
Improved internal account provisioning.
Restricting product disablement permissions.
Implementing two-party approval for critical actions.
Enhancing abuse checks to prevent internal service disruptions.
Cloudflare acknowledges the severity of this incident and the disruption it caused to customers. We are committed to strengthening our systems, implementing robust safeguards, and ensuring that similar incidents are prevented in the future.
For more information about Cloudflare's services or to explore career opportunities, visit our website.
Recommended Comments