Monthly Archives: June 2007

Unicode on the Console

As Linux spreads, it is necessary to bring it into line with the current standards for internationalization. Arguably the most important of these is Unicode, the standard which allows for representation of virtually all modern languages in plain text. It is important for Linux to support Unicode well, and the console is a core part of a Linux system.

I’ve recently been experimenting with implementing Unicode normalization, largely for learning purposes. In testing my code, I came across a problem with discrepancies in display and lower-level handling of UTF-8 sequences containing combining characters, such as <U+1E0C, U+0307>. Sequences would appear as one glyph per character, rather than one glyph per combining sequence. However, backspacing or moving through the text in question would throw off the display of the cursor and the characters. Backspacing through the example above would delete both characters together, but remove only the glyph for U+0307. Seemingly, the shell was handling combining sequences as single characters, but those sequences were being displayed (and treated) as multiple characters.

I filed a bug with bash and it turns out that the problem comes partially from gnome-terminal, the emulator that I use. (Incidentally, the problem is also present in Xfce’s terminal emulator.) To handle Unicode with bash properly, it should display the characters as composed sequences. zsh did not have this problem, as it treats each character in a sequence as separate. xterm properly composes the characters and thus has no problems with this particular issue.

This isn’t quite the end of the problem, however. Applications usually handle this sequences as separate characters, though they display them as one. Thus, backspacing through the above example would delete first U+0307 and then U+1E0C. This is especially important in certain scripts where a combining sequence can contain three or more individual characters, and deleting full sequences could grow to be quite an inconvenience.

Further, insisting on display of composed sequences could create spacing problems in some scripts, such as Tibetan. With this script, one will often come across characters that, when properly composed, become quite tall. This would mandate that all proper console rendering would need to deal with vertical character spacing in strange and possibly unpredictable ways. Indic scripts or scripts like New Tai Lue, where logical and visual order do not necessarily match, would also have to be handled.

The best option, perhaps, is for community members to gather and discuss the direction of Unicode text on the console. How much support for complex scripts should there be? Will terminals need to be able to do visual reordering for Indic scripts? Will vertical character stacks be supported? Will consoles need to be able to render RTL as well? The terminals we use are relatively simple, and I think many would see this simplicity as a virtue. Full support for Unicode rendering would greatly increase the required complexity, but the current support is mixed and inconsistent at best. I believe it to be imperative that some sort of consensus be reached as to how best to support Unicode and the potential it offers to Linux around the world.