Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think the model I described is more precise than madvise. I think madvise would usually be called on large sequences of pages, which is why it has `MADV_RANDOM`, `MADV_SEQUENTIAL` etc. You're not specifying which memory/pages are about to be accessed, but the likely access pattern.

If you're just using mmap to read a file from start to finish, then the `hint_read` mechanism is indeed pointless, since multiple `hint_read` calls would do the same thing as a single `madvise(..., MADV_SEQUENTIAL)` call.

The point of `hint_read`, and indeed io_uring or `readv` is the program knows exactly what parts of the file it wants to read first, so it would be best if those are read concurrently, and preferably using a single system call or page fault (ie, one switch to kernel space).

I would expect the `hint_read` function to push to a queue in thread-local storage, so it shouldn't need a switch to kernel space. User/kernel space switches are slow, in the order of a couple of 10s of millions per second. This is why the vDSO exists, and why the libc buffers writes through `fwrite`/`println`/etc, because function calls within userspace can happen at rates of billions per second.



you can do fine grained madvise via io_uring, which indeed uses a queue. But at that point why use mmap at all, just do async reads via io_uring.


The entire point I was trying to make at the beginning of the thread is that mmap gives you memory pages in the page cache that the OS can drop on memory pressure. Io_uring is close on the performance and fine-grained access patterns front. It’s not so good on the system-wide cooperative behavior with memory front and has a higher cost as either you’re still copying it from the page cache into a user buffer (non trivial performance impact vs the read itself) + trashing your CPU caches or you’re doing direct I/O and having to implement a page cache manually (and risks duplicating page data inefficiently in userspace if the same file is accessed by multiple processes.


Right, so zero copy IO but still having the ability to share the pagecache across process and allow the kernel to drop caches on high mempressure. One issue is that when under pressure, a process might not really be able to successfully read a page and keep retyring and failing (with an LRU replacement policy it is unlikely and probably self-limiting, but still...).


To take advantage of zero-copy I/O, which I believe has become much more important since the shift from spinning rust to Flash, I think applications often need to adopt a file format that's amenable to zero-copy access. Examples include Arrow (but not compressed Feather), HDF5, FlatBuffers, Avro, and SBE. A lot of file formats developed during the spinning-rust eon require full parsing before the data in them can be used, which is fine for a 1KB file but suboptimal for a 1GB file.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: