If I were trying to optimize my code, I would start with loading the entire benc...

If I were trying to optimize my code, I would start with loading the entire benchmark bytecode to memory, then start the counter. Otherwise I can't be sure how much time is spent reading from a pipe/file to memory, and how much time is spent in my code.

Then I would try to benchmark what happens if it all fits in L1 cache, L2, L3, and main memory.

Of course, if the common use case is reading from a file, network, or pipe, maybe you can optimize that, but I would take it step by step.