Mohammed Ufraan is a backend engineer and system architecture specialist currently pursuing Computer Science at LIET. With expertise in distributed systems, microservices architecture, API development, and cloud infrastructure, Mohammed focuses on building scalable backend solutions and optimizing database performance. This portfolio showcases projects and technical expertise in backend development, system design, and software engineering.
recently read about a major outage that hit cloudflare’s network recently. it’s a good case study.. let’s break it down layman’s terms. for the full technical post mortem, check out cloudflare blog for the official breakdown.
what happened??
on 18 nov at about 11:20 UTC cloudflare’s network started showing big failures, end users trying to reach sites behind cloudflare got error pages
Click image to open viewer
this wasn’t because of a cyberattack.
instead it was triggered by a change to a database system’s permissions. yes, a simple change to a database's access control list triggered this.
that change made the system output a feature file used by their bot management module that became much larger than expected. that oversized file got propagated throughout their network to all machines running the module but the software wasn’t built to handle it it had a size limit BOOM seen failure
why it went wrong
here’s a simpler breakdown of the chain of events
bot management module: cloudflare uses a module called bot management. it uses a feature file that describes traits and features for machine learning to detect bots
feature file generation: that file is generated every few minutes by their database cluster (clickhouse) and then distributed to all proxy machines
unexpected data growth: a change to the query behaviour meant the feature file generator started picking up duplicate rows because metadata visibility was changed. this doubled or more the number of features in the file
memory limit hit: the bot management module had a memory pre allocated limit (~200 features). under normal use they were around 60. when the file went beyond the limit, the module panicked (rust code) and caused 5xx http errors for traffic depending on that module.. the source FL2 rust code that makes the check and was the source of the unhandled error is this:
Click image to open viewer
this then resulted in the below panic which in turn resulted in a 5xx error:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err value
because a core module failed, other dependent services suffered too (cdn, security services, workers kv, access, dashboard login, etc.)
cores of their traffic routing were impacted. initially they suspected a mega ddos attack because of the symptoms (status page down, huge error spike) but it was internal
mitigation kicks in (workers kv and access bypass)
14:30
main fix deployed globally (old good feature file inserted)
17:06
all services fully resolved
lessons learned
even such powerful systems fail: small changes can trigger unexpected growth of data/config and hit pre allocated limits
monitoring matters: bounce recover bounce behaviour (sometimes new file, sometimes old file) made root cause tricky
dependency chains are risky: one module failing can drag down many others
high uptime is expensive: moving from 99.99% to 99.999% requires serious effort
start small, plan growth: find the sweet spot between just works now and can scale later
admit mistakes: cloudflare called this their worst outage since 2019
final note:
well almost all huge failures come down to unexpected data growth, broken assumptions, or one module dragging others with it for the most part.
you also might wanna read this XD