MetaBrainz GSoC – Week 1

This year, I am participating in the Google Summer of Code under the banner of the MetaBrainz Foundation. My project is to implement improved support for collections in MusicBrainz, allowing users to have multiple collections, share their collections with other users, and have more power in interacting with their collections.

The first week of GSoC is over, so it’s time for a week-in-review. Accomplished this week:

  • After a bit of tooth-and-claw, got a working install of the current git version of the server
  • Dug into the documentation of and familiarized myself with Moose, the Template Toolkit, and Catalyst
  • Spent some time reading and comparing current collection code with bits from the rest of the server, to get a feel for how things work in MB

A bit of a slow start, but I feel that I now have a firm understanding of the internal structure of MusicBrainz and have a clear idea of what needs to be changed and how. Bearing that in mind, my goals for the next week:

  • Get a working implementation of multiple user collections
  • Allow for privacy settings on collections and discovery of other users’ collections
  • Get feedback on and smooth out UIs

Arrakis, the planet known as…

I am a frequent contributor at the English Wiktionary, largely contributing Irish translations of English words. Recently, someone added a request for the etymology of the Irish word “dún”, meaning “fort” and a common element in Irish placenames–such as Dún na nGall (“Fort of the Foreigners”), the Irish on Donegal. This is a request easily filled. The earliest attested form of the word is the Old Irish “dún”, carrying the same meaning. I decided I’d see what else I could dig up, though.

There’s an online etymological dictionary for Celtic languages available at It has some apparent encoding problems and does not source its data, but what I’ve been able to compare of its Proto-Indo-European roots usually matches up with other sources. I fed it “dún” to see what it had to say. It traces the word back to the Proto-Celtic root *dūno- and lists a few cognates in other Celtic languages. Then, however, it gets interesting.

Matasović provides both a Proto-Indo-European form *dʰuHno- (“enclosure”) and a cognate in none other than Old English, namely “dūn”. Come the Great Vowel Shift ca. 1500 and you end up with Modern English “down”, a now-archaic word meaning “hill” (and still found in some placenames). The Oxford English Dictionary also suggests that an inflected form meaning “off the hill” is the origin of the adverbial form we know and love.

Things do get more interesting from there. Matasović also suggests that English “town” and German “Zaun” (“fence”) are related to this word through a Proto-Germanic form *tūno-, the Modern English coming from the Old English “tūn”. This would, of course, suggest that in Proto-Germanic the forms *dūno- and *tūno- existed side by side. To explain this, Matasović suggests that *tūno- is borrowed from Proto-Celtic *dūno- somewhere along the line. To explain this, we’ll look more closely at the Proto-Indo-European root and a few sound changes that occurred in languages derived from Proto-Indo-European.

As mentioned above, the Proto-Indo-European form Matasović gives is *dʰuHno-. Proto-Germanic underwent a series of sound changes, including the change from PIE *dʰ > *d. Presumably this change happens in Proto-Celtic as well, though I haven’t read any research on this. (It seems both also have PIE *uH > *ū. I’m not clear on what exactly happens here. The H itself could be one of three proposed sounds in PIE, but I’m too unfamiliar with PIE to speculate much.) However, Proto-Germanic also undergoes a change of PIE *d > *t. If Matasović is correct, this would indicate the following order of events: Proto-Celtic changes *dʰ > *d, Proto-Germanic borrows Proto-Celtic *dūno-, Proto-Germanic changes *d > *t and *dʰ > *d. The first of the Proto-Germanic sound changes would have to happen either earlier than or contemporaneously with the second for the sounds to remain separate.

There’s more, of course. Thinking of Old English “dūn” and the semantically similar Modern English “dune”, I decided to investigate the provenance of that word as well. The OED’s earliest citation for the word is 1790 and its etymology section indicates it comes from Modern French “dune”. It also indicates that this latter is cited in Old French in the 13th century, Old French having borrowed it from Middle Dutch “dûne”. This in turn derives from Old Dutch “dûna”. Wiktionary goes on to suggest that this is a borrowing from a Celtic language, while the Online Etymological Dictionary suggests specifically that it is from Gaulish *dunom (Gaulish being a Celtic language spoken on the Continent). Of note, neither Wiktionary nor the Online Etymological Dictionary give citations for their suggestions and neither is necessarily reliable. (Though a niggling point, that other sources consistently give *dunum as the reconstruction of the Gaulish reflex of *dūno- may demonstrate why I distrust Etymonline.)

It is possible that this form is a borrowing from a Celtic language. It also seems to me (though unbolstered by research or clear knowledge of the various changes that occurred in the languages involved) that Old Dutch could simply have derived its form from the same Proto-Germanic root that yielded Old English “dūn”. Then again, that root could itself be an even earlier borrowing from Proto-Celtic. It’s possible that some research into Indo-European languages outside of Proto-Celtic and Proto-Germanic would reveal a reflex that indicated that the PIE root did exist. Further research is, as always, necessary.

Unicode on the Console

As Linux spreads, it is necessary to bring it into line with the current standards for internationalization. Arguably the most important of these is Unicode, the standard which allows for representation of virtually all modern languages in plain text. It is important for Linux to support Unicode well, and the console is a core part of a Linux system.

I’ve recently been experimenting with implementing Unicode normalization, largely for learning purposes. In testing my code, I came across a problem with discrepancies in display and lower-level handling of UTF-8 sequences containing combining characters, such as <U+1E0C, U+0307>. Sequences would appear as one glyph per character, rather than one glyph per combining sequence. However, backspacing or moving through the text in question would throw off the display of the cursor and the characters. Backspacing through the example above would delete both characters together, but remove only the glyph for U+0307. Seemingly, the shell was handling combining sequences as single characters, but those sequences were being displayed (and treated) as multiple characters.

I filed a bug with bash and it turns out that the problem comes partially from gnome-terminal, the emulator that I use. (Incidentally, the problem is also present in Xfce’s terminal emulator.) To handle Unicode with bash properly, it should display the characters as composed sequences. zsh did not have this problem, as it treats each character in a sequence as separate. xterm properly composes the characters and thus has no problems with this particular issue.

This isn’t quite the end of the problem, however. Applications usually handle this sequences as separate characters, though they display them as one. Thus, backspacing through the above example would delete first U+0307 and then U+1E0C. This is especially important in certain scripts where a combining sequence can contain three or more individual characters, and deleting full sequences could grow to be quite an inconvenience.

Further, insisting on display of composed sequences could create spacing problems in some scripts, such as Tibetan. With this script, one will often come across characters that, when properly composed, become quite tall. This would mandate that all proper console rendering would need to deal with vertical character spacing in strange and possibly unpredictable ways. Indic scripts or scripts like New Tai Lue, where logical and visual order do not necessarily match, would also have to be handled.

The best option, perhaps, is for community members to gather and discuss the direction of Unicode text on the console. How much support for complex scripts should there be? Will terminals need to be able to do visual reordering for Indic scripts? Will vertical character stacks be supported? Will consoles need to be able to render RTL as well? The terminals we use are relatively simple, and I think many would see this simplicity as a virtue. Full support for Unicode rendering would greatly increase the required complexity, but the current support is mixed and inconsistent at best. I believe it to be imperative that some sort of consensus be reached as to how best to support Unicode and the potential it offers to Linux around the world.