Chinese writing is logographic. Could this be giving Chinese developers a better...

anabis · 2025-10-23T00:40:29 1761180029

Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.

ComputerGuru · 2025-10-23T18:44:50 1761245090

It's not throwing any information away because it can be faithfully reconstructed (via an admittedly arduous process), therefore no entropy has been lost (if you consider the sum of both "input bytes" and "knowledge of utf-8 encoding/decoding").

hobofan · 2025-10-23T06:32:53 1761201173

Yeah, that sounds quite interesting. I'm wondering whether there is a bigger gap in performance (= quality) between text-only<->vision OCR in Chinese language than in English.

There is indeed a lot of semantic information contained in the signs that should help an LLM. E.g. there is a clear visual connection between 木 (wood/tree) and 林 (forest), while an LLM that purely has to draw a connection between "tree" and "forest" would have a much harder time seeing that connection independent of whether it's fed that as text or vision tokens.

est · 2025-10-23T07:08:29 1761203309

Chinese text == Method of loci

Many Chinese student have good memory to recall a particular paragraph, understand the meaning, but no idea how those words were pronouced.

yandie · 2025-10-23T18:30:08 1761244208

I can read Kanji (Japanese) and sometimes I will understand the sentence but can't pronounce it (Japanese Kanji rules are quite arbitrary). Your brain definitely handles information differently with Chinese letters

est · 2025-10-24T02:22:32 1761272552

and if you master the skill, it will speed up your reading dramatically.

Ideograms could help you establish meanings to graphs directly, skipping the "vocal serialization" single-thread part.