Unicode on the Console

As Linux spreads, it is necessary to bring it into line with the current standards for internationalization. Arguably the most important of these is Unicode, the standard which allows for representation of virtually all modern languages in plain text. It is important for Linux to support Unicode well, and the console is a core part of a Linux system.

I’ve recently been experimenting with implementing Unicode normalization, largely for learning purposes. In testing my code, I came across a problem with discrepancies in display and lower-level handling of UTF-8 sequences containing combining characters, such as <U+1E0C, U+0307>. Sequences would appear as one glyph per character, rather than one glyph per combining sequence. However, backspacing or moving through the text in question would throw off the display of the cursor and the characters. Backspacing through the example above would delete both characters together, but remove only the glyph for U+0307. Seemingly, the shell was handling combining sequences as single characters, but those sequences were being displayed (and treated) as multiple characters.

I filed a bug with bash and it turns out that the problem comes partially from gnome-terminal, the emulator that I use. (Incidentally, the problem is also present in Xfce’s terminal emulator.) To handle Unicode with bash properly, it should display the characters as composed sequences. zsh did not have this problem, as it treats each character in a sequence as separate. xterm properly composes the characters and thus has no problems with this particular issue.

This isn’t quite the end of the problem, however. Applications usually handle this sequences as separate characters, though they display them as one. Thus, backspacing through the above example would delete first U+0307 and then U+1E0C. This is especially important in certain scripts where a combining sequence can contain three or more individual characters, and deleting full sequences could grow to be quite an inconvenience.

Further, insisting on display of composed sequences could create spacing problems in some scripts, such as Tibetan. With this script, one will often come across characters that, when properly composed, become quite tall. This would mandate that all proper console rendering would need to deal with vertical character spacing in strange and possibly unpredictable ways. Indic scripts or scripts like New Tai Lue, where logical and visual order do not necessarily match, would also have to be handled.

The best option, perhaps, is for community members to gather and discuss the direction of Unicode text on the console. How much support for complex scripts should there be? Will terminals need to be able to do visual reordering for Indic scripts? Will vertical character stacks be supported? Will consoles need to be able to render RTL as well? The terminals we use are relatively simple, and I think many would see this simplicity as a virtue. Full support for Unicode rendering would greatly increase the required complexity, but the current support is mixed and inconsistent at best. I believe it to be imperative that some sort of consensus be reached as to how best to support Unicode and the potential it offers to Linux around the world.

Advertisements

4 thoughts on “Unicode on the Console

  1. atg

    Hey, do you have a screenshot of what the tibetan script looks like, properly composed (or not)? And, when people use that script in print, do they have uniform line spacing or what? If this is a bit clueless, sorry, but I just know nothing about vertical character stacks etc.) And, hey, great first post. Waiting for more…

    Reply
  2. Sean Burke

    Well, I've put up an example here. The text is from the Babelstone blog, used as a test case for a certain font. (The font in question doesn't work very well, presumably because of the incompleteness of the Pango Tibetan renderer.) This is that text, however, rendered with Tibetan Machine Uni. I've indicated certain spots where I think (though I'm not sure) that the rendering is a little off. Other than that, it looks fine to me. This isn't a very complex example. A more complex one is available at Babelstone, but I don't have the text he uses so I can't try to replicate it. (I'm not willing to hand-type it all.) As far as typographical features like line spacing, I couldn't say. It's possible that the writer of Babelstone could shed some light on that, though.

    Reply
  3. Beat

    You can have a look at the site above for Tibetan script. Konsole shows simple Tibetan characters, but not complex ones.
    bkra shis bde legs
    will look like
    bka shas bda lags
    Any hint for a pango enabled console welcome
    — Beat

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s