News:ICU collation

From CODECS: Online Database and e-Resources for Celtic Studies

ICU collation

12 Apr 2016
Update (Dec 2016): I'm happy to know that this issue is now being targeted by ticket #2065 (GitHub). (May 2017): Better still, it is on the roadmap for SMW 3.0!

CODECS needs coders! This website is in need of an appropriate ICU collation(demo) with which to sort its semantic query results in a more satisfactory alphanumeric order. While MW (MediaWiki) supports various such collations (e.g. uca-default and uca-ga) for page listings in categories and has even begun to take tailorings for natural number sorting more seriously(link), SMW (Semantic MediaWiki) has not yet caught up with these developments. Fortunately, SMW has an open community and an open-source repository at GitHub. If you are a programmer with relevant skills and experience in this field, you are very much encouraged to contribute a fix and help out CODECS as well as many others around the globe! Please see the report in this link.

What should happen
(uca-default)
What actually happens
(uppercase)

[D]

  • de
  • De
  • di
  • Di

[E]

  • Ed
  • Éd
  • Érainn
  • Erc

[F]

  • Fiachu
  • Fíachu
  • Fróech

[Z]

  • Z 2
  • Z 10

[D]

  • De
  • Di

[E]

  • Ed
  • Erc

[F]

  • Fiachu
  • Fróech
  • Fíachu

[Z]

  • Z 10
  • Z 2

[É]

  • Éd
  • Érainn

[d]

  • de
  • di

This call for help is long overdue. In layman’s terms, collation is about comparing different character strings and putting them in a normalised order. Notoriously, default sort behaviour does not always follow the logic that most of us might expect to see and so may seem rather counterintuitive and unnatural. In that respect, SMW and most earlier versions of MW are no different: the uppercase and identity collations dictate a simple binary comparison of values but make insufficient linguistic or strategic – ‘natural’ – sense.

Consider the comparison on the left, which painfully demonstrates at least three key issues:

1.    As shown by the values under [D] / [d], uppercase starts with uppercase A–Z and then proceeds with lowercase a–z. Especially in the case of lengthy lists of lexical items, the distance between results under [D] and those under [d] is unhelpful.

2.    As shown by the values under [E] / [É] and [F], characters with diacritics (accented letters) do not quite end up where you would want them to be. Results beginning with É, for instance, are relegated to a position after [Z]. Again, the problem worsens when the list gets to a certain length. Those familiar with orthographic variation in Irish would not be too pleased to discover that Fiachu and Fíachu, which represent the same name, are now living miles apart.

3.    As shown by the values under [Z], natural numerical sorting is another tricky affair.

For a long time, MediaWiki’s answer to this conundrum has been to tinker with the sortkey of each page: for instance, to add the sortkey Eriu for a page with the title Ériu. This approach has also been adopted here for quite a large number of pages and probably remains a convenient workaround for idiosyncratic cases, but the work is tedious and time-consuming. That is not all. Even if I could muster the number of volunteer forces necessary to take on such work, which is doubtful, I have not even covered the even larger number of page-type valuesthat is, values of a semantic property that has been declared as having the datatype <tt>Page</tt> for which no dedicated page is available.

Of course, there is no one-size-fits-all solution. Because languages and writing systems have different needs, there is no single sort order which can satisfy them all at the same time. As a basis for our mixed, multilingual indexes, however, UCA (Unicode Collation Algorithm) – or its open-source implementation ICU, to be precise – is comfortably adequate: it solves #1 and #2 and offers a tailoring for #3. I have already switched to uca-xx collation in MediaWiki core, but need SMW to follow suit.

Site news