None of those APIs are cheap enough to call in a fast path.

gpderetta · 2025-10-24T08:43:32 1761295412

no syscall will be cheap to call in a fast path. You would need an hardware instruction that tells you if a load or store would fault.

vlovich123 · 2025-10-24T12:07:54 1761307674

Rather than a direct syscall, you could imagine something like rseq where you have a shared userspace / kernel data structure where the userspace code gets aborted and restarted if the page was evicted while being processed. But making this work correctly and actually not have a perf overhead and also be an ergonomic API is super hard. In practice people who care probably are satisfied by direct I/O within io_uring with a custom page cache and a truly optimal implementation where the OS can still manage file pages and evict them but the application still new when it happened isn’t worth it.

bcrl · 2025-10-25T00:29:42 1761352182

Unfortunately, a lot of the shared state with userland became much more difficult to implement securely when the Meltdown and Spectre (and others) exploits became concerns that had to be mitigated. They makes the OS's job a heck of a lot harder.

Sometimes I feel modern technology is basically a delicately balanced house of cards that falls over when breathed upon or looked at incorrectly.

zozbot234 · 2025-10-24T09:05:58 1761296758

> You would need an hardware instruction that tells you if a load or store would fault.

You have MADV_FREE pages/ranges. They get cleared when purged, so reading zeros tells you that the load would have faulted and needs to be populated from storage.

vlovich123 · 2025-10-24T12:10:19 1761307819

MADV_FREE is insufficient - userspace doesn’t get a signal from the OS to know when there’s system wide memory pressure and having userspace try to respond to such a signal would be counter productive and slow in a kernel operation that needs to be a fast path. It’s more that you want MADV (page cache) a memory range and then have some way to have a shared data structure where you are told if it’s still resident and can lock it from being paged out.

bcrl · 2025-10-25T00:40:28 1761352828

MADV_FREE is also extremely expensive. CPU vendors have finally simplified TLB shootdown in recent CPUs with both AMD and Intel now having instructions to broadcast TLB flushes in hardware, which gets rid of one of the worst sources of performance degradation in threaded multicore applications (oh the pain of IPIs mixed with TLB flushing!). However, it's still very expensive to walk page tables and free pages.

Hardware reference counting of memory allocations would be very interesting. It would be shockingly simple to implement compared to many other features hardware already has to tackle.

zozbot234 · 2025-10-25T08:48:32 1761382112

> MADV_FREE is also extremely expensive.

It's quite expensive to free pages under memory pressure (though it's not clear that there's any other choice to be made), but if the pages are never freed it should be cheap, AIUI.

zozbot234 · 2025-10-24T13:56:57 1761314217

> userspace doesn’t get a signal from the OS to know when there’s system wide memory pressure

Memory pressure indicators exist, https://docs.kernel.org/accounting/psi.html

> have some way to have a shared data structure where you are told if it’s still resident and can lock it from being paged out.

What's more efficient than fetching data and comparing it with zero? Any write within the range will then cancel the MADV_FREE property on the written-to page thus "locking" it again, and this is also very efficient.