Mohammed Ufraan

Mohammed Ufraan is a backend engineer and system architecture specialist currently pursuing Computer Science at LIET. With expertise in distributed systems, microservices architecture, API development, and cloud infrastructure, Mohammed focuses on building scalable backend solutions and optimizing database performance. This portfolio showcases projects and technical expertise in backend development, system design, and software engineering.

← all posts

recent cloudflare outage

·3 min read·...

recently read about a major outage that hit cloudflare’s network recently. it’s a good case study.. let’s break it down layman’s terms. for the full technical post mortem, check out cloudflare blog for the official breakdown.


what happened??

on 18 nov at about 11:20 UTC cloudflare’s network started showing big failures, end users trying to reach sites behind cloudflare got error pages

500
Click image to open viewer

this wasn’t because of a cyberattack. instead it was triggered by a change to a database system’s permissions. yes, a simple change to a database's access control list triggered this.

that change made the system output a feature file used by their bot management module that became much larger than expected. that oversized file got propagated throughout their network to all machines running the module but the software wasn’t built to handle it it had a size limit BOOM seen failure


why it went wrong

here’s a simpler breakdown of the chain of events

  1. bot management module: cloudflare uses a module called bot management. it uses a feature file that describes traits and features for machine learning to detect bots
  2. feature file generation: that file is generated every few minutes by their database cluster (clickhouse) and then distributed to all proxy machines
  3. unexpected data growth: a change to the query behaviour meant the feature file generator started picking up duplicate rows because metadata visibility was changed. this doubled or more the number of features in the file
  4. memory limit hit: the bot management module had a memory pre allocated limit (~200 features). under normal use they were around 60. when the file went beyond the limit, the module panicked (rust code) and caused 5xx http errors for traffic depending on that module.. the source FL2 rust code that makes the check and was the source of the unhandled error is this:
    rust_code_source_image
    Click image to open viewer
    this then resulted in the below panic which in turn resulted in a 5xx error:
thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

because a core module failed, other dependent services suffered too (cdn, security services, workers kv, access, dashboard login, etc.)

cores of their traffic routing were impacted. initially they suspected a mega ddos attack because of the symptoms (status page down, huge error spike) but it was internal


timeline in short

source

time (utc)event
11:05database access control change deployed
11:20ishfirst errors seen
13:05mitigation kicks in (workers kv and access bypass)
14:30main fix deployed globally (old good feature file inserted)
17:06all services fully resolved

lessons learned

  • even such powerful systems fail: small changes can trigger unexpected growth of data/config and hit pre allocated limits
  • monitoring matters: bounce recover bounce behaviour (sometimes new file, sometimes old file) made root cause tricky
  • dependency chains are risky: one module failing can drag down many others
  • high uptime is expensive: moving from 99.99% to 99.999% requires serious effort
  • start small, plan growth: find the sweet spot between just works now and can scale later
  • admit mistakes: cloudflare called this their worst outage since 2019

final note:

well almost all huge failures come down to unexpected data growth, broken assumptions, or one module dragging others with it for the most part. you also might wanna read this XD

adios!

© 2025 mohammed ufraan