My understanding, paraphrased: "In order to gradually roll out one change, we ha...

jsnell · 2025-12-05T15:53:12 1764949992

That's a bizarre takeaway for them to suggest, when they had exactly the same kind of bug with Rust like three weeks ago. (In both cases they had code implicitly expecting results to be available. When the results weren't available, they terminated processing of the request with an exception-like mechanism. And then they had the upstream services fail closed, despite the failing requests being to optional sidecars rather than on the critical query path.)

littlestymaar · 2025-12-05T16:04:33 1764950673

In fairness, the previous bug (with the Rust unwrap) should never have happened: someone explicitly called the panicking function, the review didn't catch it and the CI didn't catch it.

It required a significant organizational failure to happen. These happen but they ought to be rarer than your average bug (unless your organization is fundamentally malfunctioning, that is)

greatgib · 2025-12-05T16:13:25 1764951205

The issue would also not have happened, if someone did the right code, tests, and the review or CI caught it...

marcosdumay · 2025-12-05T19:05:12 1764961512

It's different to expect somebody to write the correct program every time than to expect somebody not to call the "break_my_system" procedure that was warnings all over it telling people it's there for quick learning-to-use examples or other things you'll never run.

Hamuko · 2025-12-05T18:47:51 1764960471

Yeah, my first thought was that had they used Rust, maybe we would've seen them point out a rule_result.unwrap() as the issue.

pdimitar · 2025-12-05T17:42:52 1764956572

To be precise, the previous problem with Rust was because somebody copped out and used a temporary escape hatch function that absolutely has no place in production code.

It was mostly an amateur mistake. Not Rust's fault. Rust could never gain adoption if it didn't have a few escape hatches.

"Damned if they do, damned if they don't" kind of situation.

There are even lints for the usage of the `unwrap` and `expect` functions.

As the other sibling comment points out, the previous Cloudflare problem was an acute and extensive organizational failure.

zozbot234 · 2025-12-05T19:55:04 1764964504

You can make an argument that .unwrap() should have no place in production code, but .expect("invariant violated: etc. etc.") very much has its place. When the system is in an unpredicted and not-designed-for state it is supposed to shut down promptly, because this makes it easier to troubleshoot the root cause failure whereas not doing so may have even worse consequences.

pdimitar · 2025-12-05T20:31:42 1764966702

I don't disagree but you might as well also manually send an error to f.ex. Sentry and just halt processing of the request.

Though that really depends. In companies where k8s is used the app will be brought back up immediately anyway.

debugnik · 2025-12-05T15:51:12 1764949872

Prevented unless they assert the wrong invariant at runtime like they did last time.

skywhopper · 2025-12-05T15:56:56 1764950216

This is the exact same type of error that happened in their Rust code last time. Strong type systems don’t protect you from lazy programming.

inejge · 2025-12-05T19:09:35 1764961775

It's not remotely the same type of error -- error non-handling is very visible in the Rust code, while the Lua code shows the happy path, with no indication that it could explode at runtime.

Perhaps it's the similar way of not testing the possible error path, which is an organizational problem.