Thursday, December 04, 2003

More Languages

In order to use Simputers everywhere in the world, we need good support for many languages and writing systems. Support for a writing system consists of the following.

For simple writing systems the minimum is just a font containing a glyph (visible form) for each Unicode character, so that information can be displayed as plain text.

Arabic and the alphabets of India and other Asian countries require special rendering software, because the shapes of letters change in combination with others (multiple glyphs for each character), and some combinations have shapes that are not made from the shapes of the separate letters (ligature glyphs). Every writing system requires rendering software to go beyond minimal text display. Among the functions of rendering software are placing accents on letters, letter spacing, word spacing, line breaking, and justification. Mathematics also has special rendering requirements including two-dimensional layout.

Fonts and rendering take care of the output. Keyboards and IMEs handle input. There are two kinds of keys in a keyboard layout, character keys and modifier keys. Pressing a character key inserts a character into the text stream. Modifier keys (shift, alt, ctrl, and possibly others) temporarily change the assignment of characters to keys. Standard keyboards have 47 keys for visible characters, plus Space, Tab, Back Space, and Enter. IMEs are used for languages with character sets too large for a keyboard layout, mainly Chinese, Korean, and Japanese. The Korean Hangul alphabet and the Japanese kana syllabaries map easily to keyboards. However, entering the thousands of Chinese characters used in each language requires other techniques, including phonetic conversion, code-based systems, and shape-based systems. In phonetic conversion the user types the words in the appropriate alphabet or syllabary, and the IME software looks up the words in a dicitionary to find the right characters. Code-based systems require the user to memorize or look up the character codes and type them numerically. Shape-based systems have rules for dividing characters into pieces, and assign the various pieces to a regular keyboard layout. One key typically represents several related shapes, as in the Cangjie IME for Chinese.

Support for a language requires support for one or more writing systems used to write that language. Many languages are written in more than one writing system. Croats write their language in the Latin alphabet, and Serbs write the same language in Cyrillic. Turkish was written in the Arabic alphabet for centuries, but is now written in the Latin alphabet. The Soviet Union mandated Cyrillic for almost all languages. The newly independent republics have in many cases gone back to their traditiional writing system, as in Mongolia (Mongolian) and Azerbaijan (Alrabic), or to the Latin alphabet. In each writing system, there are further requirements.

Several languages are used in more than one country. Typically, they require separate locale support, dictionaries, and style checkers for each country.

Although these needs have been obvious for many years, commercial software platform vendors (Microsoft, Apple, Sun) have been slow to provide support for the writing systems and languages of developing countries. We are in sight of the goal of complete support, however, through several Free Software projects.

Pango (Greeki παν, pan, all; Japanese 語 go, language) comes from the Linux world, where it has been integrated into The GIMP (GNU Image Manipulation Program), and the gedit text editor. Graphite began on Windows, but a Linux version is well under way. The SILA browser is a version of Mozilla for Windows with Graphite built in. There is some discussion going on about merging the Pango and Graphite projects, but no definite plans. Linux User Groups in several countries are working on Unicode fonts for their writing systems, and there are other Unicode font projects.

Mandrake Linux, the version I use on the desktop, comes with some support for the following writing systems. I have supplied a few random characters for each. You will need appropriate fonts and a browser that can display them in order to see them correctly. Mozilla, SILA, Opera, and MS Internet Explorer all have moderately good Unicode support, but none is complete.

The writing systems in the current version of Unicode still missing from Mandrake Linux are

Thaana, Ethiopic, and Cherokee present no rendering problems and could be added quickly. Linguists at SIL and Evertype are working on the others. There are also lots of historical Chinese characters in Unicode that are not provided in readily available Unicode fonts.

You can take a look at the full set of characters for each writing system and find out where it is used in Unicode in PDFs at the Code Charts page of the Unicode site. One of the best browser tests, also available in PDF format, is the Compelling Unicode Examples page, which gives names of famous people from around the world, both in Latin alphabet versions, and also the way they write them at home.

Comments: Post a Comment

This page is powered by Blogger. Isn't yours?