Cloudflare Outage Traced to Internal Error, Not Cyberattack

Cloudflare is detailing the root cause of a major global outage that disrupted traffic across a large portion of the Internet on November 18, 2025, marking the company’s most severe service incident since 2019. While early internal investigations briefly raised the possibility of a hyper-scale DDoS attack, Cloudflare cofounder and CEO Matthew Prince confirmed that the outage was entirely self-inflicted.

The Cloudflare disruption, which began at 11:20 UTC, produced spikes of HTTP 5xx errors for users attempting to access websites, APIs, security services, and applications running through Cloudflare’s network – an infrastructure layer relied upon by millions of organizations worldwide.

Cloudflare cofounder and CEO Matthew Prince confirmed that the outage was caused by a misconfiguration in a database permissions update.Cloudflare cofounder and CEO Matthew Prince confirmed that the outage was caused by a misconfiguration in a database permissions update, which triggered a cascading failure in the company’s Bot Management system, which in turn caused Cloudflare’s core proxy layer to fail at scale.

The error originated from a ClickHouse database cluster that was in the process of receiving new, more granular permissions. A query designed to generate a ‘feature file’ – a configuration input for Cloudflare’s machine-learning-powered Bot Management classifier – began producing duplicate entries once the permissions change allowed the system to see more metadata than before. The file doubled in size, exceeded the memory pre-allocation limits in Cloudflare’s routing software, and triggered software panics across edge machines globally.

Those feature files are refreshed every five minutes and propagated to all Cloudflare servers worldwide. The intermittent nature of the database rollout meant that some nodes generated a valid file while others created a malformed one, causing the network to oscillate between functional and failing states before collapsing into a persistent failure mode.

The initial symptoms were misleading. Traffic spikes, noisy error logs, intermittent recoveries, and even a coincidental outage of Cloudflare’s independently hosted status page contributed to early suspicion that the company was under attack. Only after correlating file-generation timestamps with error propagation patterns did engineers isolate the issue to the Bot Management configuration file.

By 14:24 UTC, Cloudflare had frozen propagation of new feature files, manually inserted a known-good version into the distribution pipeline, and forced resets of its core proxy service – known internally as FL and FL2. Normal traffic flow began stabilizing around 14:30 UTC, with all downstream services recovering by 17:06 UTC.

The impact was widespread because the faulty configuration hit Cloudflare’s core proxy infrastructure, the traffic-processing layer responsible for TLS termination, request routing, caching, security enforcement, and API calls. When the Bot Management module failed, the proxy returned 5xx errors for all requests relying on that module. On the newer FL2 architecture, this manifested as widespread service errors; on the legacy FL system, Bot scores defaulted to zero, creating potential false positives for customers blocking bot traffic.

Multiple services either failed outright or degraded, including Turnstile (Cloudflare’s authentication challenge), Workers KV (the distributed key-value store underpinning many customer applications), Access (Cloudflare’s Zero Trust authentication layer), and portions of the company’s dashboard. Internal APIs slowed under heavy retry load as customers attempted to log in or refresh configurations during the disruption.

Cloudflare emphasized that email security, DDoS mitigation, and core network connectivity remained operational, although spam-detection accuracy temporarily declined due to the loss of an IP reputation data source.

Prince acknowledged the magnitude of the disruption, noting that Cloudflare’s architecture is intentionally built for fault tolerance and rapid mitigation, and that a failure blocking core proxy traffic is deeply painful to the company’s engineering and operations teams. The outage, he said, violated Cloudflare’s commitment to keeping the Internet reliably accessible for organizations that depend on the company’s global network.

Cloudflare has already begun implementing systemic safeguards. These include hardened validation of internally generated configuration files, global kill switches for key features, more resilient error-handling across proxy modules, and mechanisms to prevent debugging systems or core dumps from consuming excessive CPU or memory during high-failure events.

The full incident timeline reflects a multi-hour race to diagnose symptoms, isolate root causes, contain cascading failures, and bring the network back online. Automated detection triggered alerts within minutes of the first malformed file reaching production, but fluctuating system states and misleading external indicators complicated root-cause analysis. Cloudflare teams deployed incremental mitigations – including bypassing Workers KV’s reliance on the proxy – while working to identify and replace the corrupted feature files.

By the time a fix reached all global data centers, Cloudflare’s network had stabilized, customer services were back online, and downstream errors were cleared.

As AI-driven automation and high-frequency configuration pipelines become fundamental to global cloud networks, the Cloudflare outage underscores how a single flawed assumption – in this case, about metadata visibility in ClickHouse queries — can ripple through distributed systems at Internet scale. The incident serves as a high-profile reminder that resilience engineering, configuration hygiene, and robust rollback mechanisms remain mission-critical in an era where edge networks process trillions of requests daily.

Executive Insights FAQ: Understanding the Cloudflare Outage

What triggered the outage in Cloudflare’s global network?

A database permissions update caused a ClickHouse query to return duplicate metadata, generating a Bot Management feature file twice its expected size. This exceeded memory limits in Cloudflare’s proxy software, causing widespread failures.

Why did Cloudflare initially suspect a DDoS attack?

Systems showed traffic spikes, intermittent recoveries, and even Cloudflare’s external status page went down by coincidence – all patterns resembling a coordinated attack, contributing to early misdiagnosis.

Which services were most affected during the disruption?

Core CDN services, Workers KV, Access, and Turnstile all experienced failures or degraded performance because they depend on the same core proxy layer that ingests the Bot Management configuration.

Why did the issue propagate so quickly across Cloudflare’s global infrastructure?

The feature file responsible for the crash is refreshed every five minutes and distributed to all Cloudflare servers worldwide. Once malformed versions began replicating, the failure rapidly cascaded across regions.

What long-term changes is Cloudflare making to prevent future incidents?

The company is hardening configuration ingestion, adding global kill switches, improving proxy error handling, limiting the impact of debugging systems, and reviewing failure modes across all core traffic-processing modules.