Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Chinese writing is logographic. Could this be giving Chinese developers a better intuition for pixels as input rather than text?


Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.


It's not throwing any information away because it can be faithfully reconstructed (via an admittedly arduous process), therefore no entropy has been lost (if you consider the sum of both "input bytes" and "knowledge of utf-8 encoding/decoding").


Yeah, that sounds quite interesting. I'm wondering whether there is a bigger gap in performance (= quality) between text-only<->vision OCR in Chinese language than in English.

There is indeed a lot of semantic information contained in the signs that should help an LLM. E.g. there is a clear visual connection between 木 (wood/tree) and 林 (forest), while an LLM that purely has to draw a connection between "tree" and "forest" would have a much harder time seeing that connection independent of whether it's fed that as text or vision tokens.


Chinese text == Method of loci

Many Chinese student have good memory to recall a particular paragraph, understand the meaning, but no idea how those words were pronouced.


I can read Kanji (Japanese) and sometimes I will understand the sentence but can't pronounce it (Japanese Kanji rules are quite arbitrary). Your brain definitely handles information differently with Chinese letters


and if you master the skill, it will speed up your reading dramatically.

Ideograms could help you establish meanings to graphs directly, skipping the "vocal serialization" single-thread part.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: