Friday, January 9, 2009

Chinese, Japanese, and Korean Input in Ubuntu

For most European languages (and many other alphabet-based non-European languages), pressing a letter on the keyboard simply prints that letter to the screen. However, Chinese, Japanese, and Korean require a kind of conversion process that is handled by a special application (actually a set of applications) called an Input Method Editor (IME). Of course, this is a Windows-world term, but I will use it here for convenience’s sake. In any case, each of these languages has its own IME, and each is quite different due to the basic differences in the three writing systems.


Chinese
While most people (at least those in the linguistic know) would think that Chinese would be the most complicated system, because the writing system consists of thousands of characters, it is in fact the simplest. The Chinese IME simply takes the romanized keyboard input, known as pinyin, and converts it into Chinese characters, or Hanzi. For the IME, it is essentially a simple dictionary lookup task—big dictionary, simple IME. In the event that there is more than one character for the pinyin input, a list of possible candidates will appear, and the user can then simply select the appropriate character from that list.


Japanese
The Japanese IME has a considerably more complicated task to perform, as it has three writing systems to deal with: Kanji (ideographic characters borrowed long ago from China), hiragana (the phono-alphabetic system used mainly for tense and case endings), and katakana (used mainly for words borrowed from other languages). Still, the standard input method for Japanese is primarily via the standard Roman keyboard layout, plus a few extra specialfunction keys. Thus, typing in Japanese is a two-step process whereby the IME first converts the romanized text into hiragana as it is typed and then converts it to appropriate Kanji, katakana, or hiragana elements after the spacebar is pressed. In the first line, the IME has already converted the romanized input on the fly. It has converted rinakkusdenihongonyuuryokumodekimasu (which means You can also input Japanese in Linux) to hiragana. The fact that line is underlined means that it has not yet been converted beyond that. In the second line, however, the user has subsequently pressed the spacebar, which caused the IME to convert the hiragana string into the appropriate Kanji, hiragana, and katakana elements. The first word, Linux, has been converted to katakana text, as it is a borrowed word, while Japanese input has been converted to Kanji; the rest stays in hiragana.


Korean
The job of the Korean IME is again quite different from that of the Chinese and Japanese IMEs, as the language itself is written in a very different way. Korean is written either entirely in alphabetic letters, called Hangul, or in a combination of Hangul and ideographic characters borrowed from Chinese called Hanja. While the Hanja characters are essentially the same as their Chinese and Japanese counterparts, Hanzi and Kanji, the Korean phonetic alphabet, Hangul, has it own unique appearance, as you can see in the
Korean word for Korea, Hangug(k),.

This seems simple; however, the representation is not quite correct, as Korean is very unique in the way that its alphabetic characters are put to the page. Unlike the usual side-by-side positioning of hiragana, katakana, and most other languages written with an alphabet, Hangul letters are grouped in pairs, triplets, or even quadruplets, which are written, as a general rule, clockwise. The IME, therefore, must take the input (usually based on a Korean alphabetical keyboard layout) while it is being typed, and it must adjust the size, spacing, and positioning of each of the letters as it puts them into appropriate clusters.

Source of Information : Ubuntu for Non-Geeks (2nd Ed)

No comments: