How NOT to convert Simplified and Traditional Chinese

tl;dr Do not use a random website or tool you find on the internet.

Languages and Scripts

Chinese can be written in two different types of script, Simplified Chinese and Traditional Chinese. I often get asked about these scripts and would like to put together things to know when it comes to handling those scripts.

First off, these are scripts, not languages. These two are somewhat separate concepts. People often get confused by the fact that we often see both “Chinese (Simplified)” and “Chinese (Traditional)” in language menus on websites, but you can write, for example, Mandarin Chinese, in either of those two scripts. Mandarin Chinese, which is the official language for both mainland China and Taiwan, is usually written in Simplified Chinese in Mainland China (and also in Singapore) and in Traditional Chinese (and also in Malaysia) in Taiwan.

N-to-N Mapping

Another pitfall regarding Simplified and Traditional Chinese is that there is often no simple 1-to-1 character mapping between those two scripts. This is something far more complex than e.g., English upper and lower cases. It is true that there is actually 1-to-1 mapping between many characters, most of which involve simplification of radicals as shown below, but for more common and high frequent characters, one simplified form corresponds to multiple traditional form.

Characters with 1-to-1 Mappings

SimplifiedTraditionalPinyin and Meanings
ma3 - horse
kai1 - to open
fan4 - food
yu3 - language

Characters with 1-to-Many Mappings

SimplifiedTraditionalPinyin and Meanings
fa1 - to send out
fa4 - hair
gan1 - dry
gan3 - to do
mian4 - face
mian4 - noodle
hou4 - after
hou4 - queen

There are, though not many, characters with Many-to-1 mappings:

SimplifiedTraditionalPinyin and Meanings
zhe - (an aspect marker)
zhu4 - to write

Wrong Way

The most common type of mistake that you may make when converting between those two scripts is to convert them character by character, by having a mapping table between them and replacing simplified to traditional characters (or the other way around) one by one. If you apply this method to the words below, it will result in wrong conversions.

SimplifiedTraditional (wrong)Traditional (correct)Pinyin and Meanings
头发頭發頭髮tou2fa4 - hair
干面干面乾麵gan1mian4 - dry noodle

Unfortunately, if you use a random conversion tool you find on the Internet (for example, this one, which is currently at the top result of Google search for “Simplified to Traditional Conversion”), there is a good chance that the tool is based on a simple but wrong algorithm like this. If you want to try out a tool for converting those scripts, at least try some simple words like the ones above to see if the tool is created by people who know what they are doing at all.

Correct Way

Then, what is the correct way to convert between those two scripts? I said earlier that these two are scripts, not languages, but the conversion problem is such a complex issue that it is actually a good practice to treat this as a (somewhat simpler form of) machine translation problem between two different languages.

If you’d like to translate between English and Spanish, you wouldn’t just want to simply grab a dictionary and replace one word at a time, would you? In order to “convert” them correctly, you need to actually understand what is written and rewrite the sentence if necessary. Similarly, when you are converting, for example, from traditional to simplified, you need to “understand” if this “著” is actually an aspect marker (which should be simplified to 着) or part of words like “著名” (which should be left as is) and make a decision accordingly.

Correctly converting those two scripts requires far more than just character-based replacement. Actually, my recommendation as of this writing is to not to implement your own conversion. Google translate actually does a decent job converting between those two if you set the source language to Simplified Chinese and the target language “Traditional Chinese” (or vice versa).

If you still need to implement your own conversion, I’d recommend following Wikipedia’s conversion algorithm. I’ll further discuss this in a future post.

Further Reading

A lot of the “pitfalls” I mentioned above are elegantly covered by this amazing paper by Halpern and Kerman: The Pitfalls and Complexities of Chinese to Chinese Conversion. Although the paper is a bit outdated (for example, an average developer doesn’t even need to know different character sets and encodings for Chinese - everyone uses Unicode/UTF nowadays) but is a great starting point if you’d like to understand this issue in further depth.

© Masato Hagiwara. All rights reserved.

Powered by Hydejack v7.5.1