Sunday, October 15, 2006

A great deal of the complexity in MediaWiki's language support appears to be related to PHP's relatively poor support for Unicode. Java's "near-UTF-16" support is a lot less complicated to deal with (although not entirely perfect; it will treat a character represented by a surrogate pair as two characters, at least with 1.4 JREs; fortunately there are relatively few such characters). Another large hunk of it is to deal with transitioning between older MediaWiki databases and current Unicode-based databases; I can reasonably insist that the database be UTF-8, UTF-16, or UCS2. (UCS4/UTF-32, not so much...) and not have to worry about on-the-fly conversion of outdated database formats. I don't expect the schema to survive intact, either, although I've not made any schema changes, yet; the most I promise to provide is some means to make a one-time nonreversible conversion of a MediaWiki database to whatever I end up naming this thing in the end.

The other thing that makes this irritating is that, in MediaWiki at least, the same datatype is used for both byte streams and strings, even though one is a stream of bytes (which may or may not be a string encoded in some character set) and the other is a string (a stream of encoded characters). It's therefore rarely obvious what encoding (if any) the contents of a given PHP string are going to be in in MediaWiki, which means I will probably have some errors in that regard down the road.