recent cloudflare outage

recently read about a major outage that hit cloudflare’s network recently. it’s a good case study.. let’s break it down layman’s terms. for the full technical post mortem, check out cloudflare blog for the official breakdown.

what happened??

on 18 nov at about 11:20 UTC cloudflare’s network started showing big failures, end users trying to reach sites behind cloudflare got error pages

this wasn’t because of a cyberattack. instead it was triggered by a change to a database system’s permissions. yes, a simple change to a database's access control list triggered this.

that change made the system output a feature file used by their bot management module that became much larger than expected. that oversized file got propagated throughout their network to all machines running the module but the software wasn’t built to handle it it had a size limit BOOM seen failure

why it went wrong

here’s a simpler breakdown of the chain of events

bot management module: cloudflare uses a module called bot management. it uses a feature file that describes traits and features for machine learning to detect bots
feature file generation: that file is generated every few minutes by their database cluster (clickhouse) and then distributed to all proxy machines
unexpected data growth: a change to the query behaviour meant the feature file generator started picking up duplicate rows because metadata visibility was changed. this doubled or more the number of features in the file
memory limit hit: the bot management module had a memory pre allocated limit (~200 features). under normal use they were around 60. when the file went beyond the limit, the module panicked (rust code) and caused 5xx http errors for traffic depending on that module.. the source FL2 rust code that makes the check and was the source of the unhandled error is this:
Click image to open viewer
this then resulted in the below panic which in turn resulted in a 5xx error:

thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

because a core module failed, other dependent services suffered too (cdn, security services, workers kv, access, dashboard login, etc.)

cores of their traffic routing were impacted. initially they suspected a mega ddos attack because of the symptoms (status page down, huge error spike) but it was internal

timeline in short

source

time (utc)	event
11:05	database access control change deployed
11:20ish	first errors seen
13:05	mitigation kicks in (workers kv and access bypass)
14:30	main fix deployed globally (old good feature file inserted)
17:06	all services fully resolved

lessons learned

even such powerful systems fail: small changes can trigger unexpected growth of data/config and hit pre allocated limits
monitoring matters: bounce recover bounce behaviour (sometimes new file, sometimes old file) made root cause tricky
dependency chains are risky: one module failing can drag down many others
high uptime is expensive: moving from 99.99% to 99.999% requires serious effort
start small, plan growth: find the sweet spot between just works now and can scale later
admit mistakes: cloudflare called this their worst outage since 2019

final note:

well almost all huge failures come down to unexpected data growth, broken assumptions, or one module dragging others with it for the most part. you also might wanna read this XD

adios!

Mohammed Ufraan

what happened??

why it went wrong

timeline in short

lessons learned