1 English character ≈ 0.3 tokens. 1 Chinese character ≈ 0.6 token.
While reading the deepseek pricing model i came to the realization that China has an inherent advantage in AI due to their own language, the same language that proved to hinder the development of production at some point now it has turned into the opposite (hehe dialectics), a more efficient alternative.
Think of the first printing machines you needed to arrange pieces containing a single character to form words, in latin alphabet languages this meant producing 26 upper case pieces + 26 lower case pieces, for chinese you needed thousands lol. For the same reason, typewriters in chinese were never mass produced and the ones that were produced are ridiculously complex (see the mingkai typewriter that was recently found after being lost for decades).
Coming back to tokens, according to google gemini the average length of an english word is 4.7-5.1 characters, while in chinese a single character is a word by itself, this means that an english word is on average ~1.5 tokens while a chinese word is 0.6 tokens, it’s more than twice as efficient, this could be huge when thinking in scale and could prove to be the difference between needing a 10 data centers instead of 5.
So if you want to reduce your deepseek billing, get ready to learn chinese buddy.
the average length of an english word is 4.7-5.1
If you exclude auxiliary syntactical words like “a”, “the”, “of” and various other short prepositions, the average length of semantically meaningful words is for sure even higher.
while in chinese a single character is a word by itself
A single character can be a word by itself, and a lot of the most commonly used ones are, but most words i think tend to be an average of two characters, with words expressing more complex or specialized concepts being three or four. Loan words from other languages also tend to be longer.
Of course the shorter words are used more frequently in text and speech than the longer ones. So you probably end up with maybe 1.6-1.7 characters per word in a normal text.
I remember reading somewhere that Chinese is less monosyllabic than English actually, which makes sense because there is a lower number of possible distinct syllables in (spoken) Chinese than English, so you need at least two to avoid confusion.
It’s a topic that deserves a more in-depth study for sure. Some words can be on the longer end like 爱沙尼亚(estonia) tho they still tend to be shorter. Comparing books sizes in English and chinese could give us a better aproximation.

