Based on what you've shared, the second version can start reading instantly because "INFILE" was populated in the previous test. Did you clear it between tests?
Here are the benchmarks before and after fixing the benchmarking code:
On my machine, they're all equally fast at ~28 us. Clearly the changes only had an impact on machines with a different configuration (kernel version or kernel config or xxd version or hw).
One hypothesis outlined above is that the when you pipeline all 3 applications, the single byte reader version is doing back-to-back syscalls and that's causing contention between your code and xxd on a kernel mutex leading to things going to sleep extra long.
It's not a strong hypothesis though just because of how little data there is and the fact that it doesn't repro on my machine. To get a real explanation, I think you have to actually do some profile measurements on a machine that can repro and dig in to obtain a satisfiable explanation of what exactly is causing the problem.
Here are the benchmarks before and after fixing the benchmarking code:
Before: https://output.circle-artifacts.com/output/job/2f6666c1-1165...
After: https://output.circle-artifacts.com/output/job/457cd247-dd7c...
What would explain the drastic performance increase if the pipelining behavior is irrelevant?