Based on what you've shared, the second version can start reading instantly beca...

vlovich123 · on March 20, 2024

That was just a typo in the comment. The command run locally was just a strait pipe.

Using both invocation variants, I ran:

8a5ecac63e44999e14cdf16d5ed689d5770c101f (before buffered changes)

78188ecbc66af6e5889d14067d4a824081b4f0ad (after buffered changes)

On my machine, they're all equally fast at ~28 us. Clearly the changes only had an impact on machines with a different configuration (kernel version or kernel config or xxd version or hw).

One hypothesis outlined above is that the when you pipeline all 3 applications, the single byte reader version is doing back-to-back syscalls and that's causing contention between your code and xxd on a kernel mutex leading to things going to sleep extra long.

It's not a strong hypothesis though just because of how little data there is and the fact that it doesn't repro on my machine. To get a real explanation, I think you have to actually do some profile measurements on a machine that can repro and dig in to obtain a satisfiable explanation of what exactly is causing the problem.