1 English character ≈ 0.3 tokens. 1 Chinese character ≈ 0.6 token.

While reading the deepseek pricing model i came to the realization that China has an inherent advantage in AI due to their own language, the same language that proved to hinder the development of production at some point now it has turned into the opposite (hehe dialectics), a more efficient alternative.

Think of the first printing machines you needed to arrange pieces containing a single character to form words, in latin alphabet languages this meant producing 26 upper case pieces + 26 lower case pieces, for chinese you needed thousands lol. For the same reason, typewriters in chinese were never mass produced and the ones that were produced are ridiculously complex (see the mingkai typewriter that was recently found after being lost for decades).

Coming back to tokens, according to google gemini the average length of an english word is 4.7-5.1 characters, while in chinese a single character is a word by itself, this means that an english word is on average ~1.5 tokens while a chinese word is 0.6 tokens, it’s more than twice as efficient, this could be huge when thinking in scale and could prove to be the difference between needing a 10 data centers instead of 5.

So if you want to reduce your deepseek billing, get ready to learn chinese buddy.

  • 小莱卡OP
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    8 days ago

    It’s a topic that deserves a more in-depth study for sure. Some words can be on the longer end like 爱沙尼亚(estonia) tho they still tend to be shorter. Comparing books sizes in English and chinese could give us a better aproximation.