Serial dependencies are a catastrophe already starting to happen

Wednesday, November 19, 2025

Leveraging existing tools sure makes the development of new products faster and easier, but it comes with some serious risk propagation problems that can quickly add up to major headaches.

cats.com is down because of cloudflare failure

Cloudflare went down today and an awful lot of the internet, including ChatGPT, Twitter, and Claude, which threw an incorrect error message because the developers assumed, absurdly, that Cloudflare couldn’t be the cause of the fail.

Claude murdered by Cloudflare outage

This sort of thing happens more and more, not just individual sites, but entire ecosystems are knocked offline by a single failure, leading one to consider the growing fragility of an interconnected world. This is a consequence of dependency chains: libraries, objects, modules, gems, packages, etc and their real-world equivalents: services, supply chains. It is more convenient to reuse someone else’s work than to write the code yourself. Want to round a number in your python script? Why use the round() function built into python when you can pull in NumPy to do it, it’s only 19.6 MB, compressed, and has had 100s of releases since 1995, what could possiblai go wrong?

Take FreeBSD, for example. It has an excellent ports system to manage software and each port lists dependencies, so it is fairly straight-forward to write some code to produce a dotgraph of those dependencies so one can clearly visualize exactly what depends on what, and what the consequences of a port failing:

FreeBSD Ports Dotgraph

See, it’s obvious! No worries at all, why are people fussing about this stuff? Who needs static compiles, AppImage, Snap, or Flatpak?

Someday ImageMagick will finally break for good and we'll have a long period of scrambling as we try to reassemble civilization from the rubble.

XKCD dependency_2x

The math behind it is simple: All dependencies, even branching dependencies, are effectively serial risks. Some are non-catastrophic: if a web page trades your visit information, IP address and browser fingerprint to Google or Adobe in exchange for “free” access to some pretty fonts, and their server goes down, the page will still render, albeit with improved privacy and typically reduced aesthetic purity. But when a site relies on a web service, such as Cloudflare, which may itself rely on other services, the serial risk grows acute.

The basic problem is that serial risks multiply. The probability of success of a serial risk function is the product of the individual success probabilities, and even if every step is likely to be successful the net probability of success of a long chain becomes quite low. If the probability of success is Pₜ (0–1) and each individual step, n, has a probability of success of Pₙ, then Pₜ=P₁⨉P₂⨉P₃…Pₙ. If all such steps have the same probability of success then Pₜ=Pⁿ. Even highly reliable steps, say a success probability of 95%, are risky in long dependency chains. 4 such steps and the probability of success is down to 81.5%; 10 steps: 59.9%; 100 steps: 0.6%.

Now it may seem a little pessimistic to consider 100 serial steps, but consider the port dependency tree for FreeBSD (above), 100 serial steps is probably optimistic, even if a 95% build/execution probability is pessimistic, yet even at 99.9% uptime, total system uptime is only 90.5%. “Five nines” uptime, a typical corp BS claim of 99.999% uptime (or 5.26 minutes of downtime per year) becomes a “three nines” system with 100 dependencies.

And 100 dependencies is pretty optimistic a request map for the New York Times shows a “modern” web infrastructure’s fragility.

Request Map For the New York Times

Or to take a supply chain example:

Global supply chain example on a world map.

A typical supply chain from raw materials to consumer

That is, 100 element-level serial complexity isn’t surprising. Even assuming every element in that chain is held to a five nines level of availability, and even if the mean down time (MTTR) is only 2 hours, the mean time to failure starts to get short. I wrote a little Monte Carlo analysis tool to compute the odds. The failure distribution for a single “five nines” device, one device in series, over a million such devices, runs for years.

That’s about what you’d expect: MTTF 22.83 years. But if you have a complex system, like everything is getting to be, 100 such elements in series, we go from years to days.

The system MTTF is only 83 days and the effective system availability is only three nines (which is how that math works: 1,000 series devices would be two nines).

For example, I wrote a little tool to do a basic Monte Carlo analysis of a serial system and configured it for “three nines,” which is low but visually appealing, a 15h MTTR which was the AWS outage time, 250 serial elements which comes out fairly square in the graphic, and a 1 year simulation period.

=== RESULTS ===

Simulation Configuration:
  Simulation Time: 31536000 seconds
  Run Iterations: 1000

Element Configuration:
  Availability: 3 nines (99.900000%)
  MTTR: 15.00h
  MTTF: 1.71y

System Configuration:
  Number of elements: 250
  System Availability: 0.657 nines (77.964102%)
  System MTTR: 16.95h
  System MTTF: 2.50d

A typical state representation is below, the state is overwhelmingly “green” – live, because the MTTF is 1.71 years while the sim period is only 1 year: most elements do not fail. But as the total system state, the fat bottom row, is dependent on all links in the chain being live, the system uptime demonstrates non-intuitive power law behavior that catches people off guard whether it is bacterial growth, neutron capture, or system reliability.

This is why, despite highly reliable systems and everyone’s marketing assurances that they’re all executing at five nines, we’re beset by system failures. Coders need to really internalize that:

Every dependency is a humiliation.

(The only place that isn’t true is crypto code, if you wrote your own, you did a bad thing and you should feel bad, crypto code is an exception because security confidence can only be achieved by accumulated unsuccessful bashing, no amount of unit testing can be substituted for global access to the source and global bragging rights for breaking the algo).

So lets all work toward system independence and parallelism and stop trading PII for code convenience and think twice about linking to a complex image processing library just to scale an image.

Summary

Article Name

Serial dependencies are a catastrophe already starting to happen

Author

David Gessel

M	T	W	T	F	S	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Gessel On…

Serial dependencies are a catastrophe already starting to happen

Leave a Reply

Does nano select the wrong syntax highlighter? But how to know which?

Cleaning up poor quality laser scans with Meshlab