Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.
It's not throwing any information away because it can be faithfully reconstructed (via an admittedly arduous process), therefore no entropy has been lost (if you consider the sum of both "input bytes" and "knowledge of utf-8 encoding/decoding").
Yeah, that sounds quite interesting. I'm wondering whether there is a bigger gap in performance (= quality) between text-only<->vision OCR in Chinese language than in English.
There is indeed a lot of semantic information contained in the signs that should help an LLM. E.g. there is a clear visual connection between 木 (wood/tree) and 林 (forest), while an LLM that purely has to draw a connection between "tree" and "forest" would have a much harder time seeing that connection independent of whether it's fed that as text or vision tokens.
I can read Kanji (Japanese) and sometimes I will understand the sentence but can't pronounce it (Japanese Kanji rules are quite arbitrary). Your brain definitely handles information differently with Chinese letters