Thursday, September 27, 2007

More on my MediaWiki port

So, some people have inquired about my port of MediaWiki to Java.

One commentator implied that this is a foolish/wasted venture because it's "already done". JAMWiki is a pure Java wiki that resembles MediaWiki, but it is not a port; rather, it is a scratch implementation that uses MediaWiki's markup syntax and resembles its behavior. However, it does not use MediaWiki's database schema (a situation that they attribute to licensing issues). I have nothing but respect for what they're doing, but what they're doing is not what I'm doing. There are a number of other Java-implemented wikis out there, but I'm not attempting to compete with any of them. And, given my motivations (see below) even if it had already been done I'd still likely want to do it.

Another intriguing project that has been brought to my attention is Quercus, a native Java implementation of PHP. The Quercus people claim that Quercus runs PHP code significantly faster than the standard mod_php interpreter and is on a par with the performance offered by PHP accelerated by APC. Certainly an option for incremental development would be to run MediaWiki under Quercus, and then incrementally port portions of it to pure Java. It has occurred to me that the Wikimedia Foundation might benefit from doing this (if nothing else, it would likely significantly simplify their interface with Lucene, which at the moment is done with a really grotty .NET interface), but of course it's not my place to advise Brion and company how to run their show. But at the moment I don't want to take the time to immerse myself into another product. Might be something to look at down the road, though.

But, fundamentally, neither of these really serves my interests. As I've noted before, more than once, I don't like PHP very much. On the other hand, I would like to better understand the MediaWiki code. What better way to understand a body of code than to port it to another language? At the end of this, not only will I have a better understanding of MediaWiki's codebase (I've already submitted three bug reports, all for admittedly minor items) as well as even more reasons to hate PHP, but hopefully also I will have a product that is a drop-in replacement for MediaWiki that also outperforms it and can be more easily modified, to boot.

As to why Java? At the moment, Java is my favorite language for large applications. I actually prefer C#'s generics to Java's, but I do not trust Microsoft enough to feel comfortable committing my labors to a language and runtime with dubious intellectual property issues. Java is far more open than C#, and so I'm much more comfortable (philosophically) developing to that environment. Furthermore, there is an open source Web engine for Java already (Tomcat) as well as well-known compatible high-performance enterprise products from Sun and IBM for someone who might want to use this product in an enterprise environment. I haven't worked much with Java in quite a while (I was a Java programmer back in 2001 for a while, but not much since) so this is also a chance to resharpen my skills.

Syntactically, PHP and Java are actually pretty close. I can mechanically convert PHP code to pretty-close-to-Java with just a handful of Perl scripts. I'm probably about half-way done with the first pass of rough code conversion (although admittedly the parser has not yet been done, which will likely demand considerably more attention than many of the other modules I've already done). Certainly it will be quite a long time before an actually executable product comes out of this, but I'm not on any timeline here. (Fortunately, MediaWiki makes very little use of the parts of PHP that are especially hard to port.)

The curious may observe my progress via SVN. I chose "Myrtle" because I am fond of wood (my other main hobby these days is woodworking) and because my full name anagrams to "Kill Nanny Myrtle".

Once I've finished this project, I am very likely to work on automated PHP-to-Java porting tools. That's going to require writing a PHP parser, but that can't be terribly hard, now, can it?

4 comments:

  1. Wow, cool. (Your myrtle link doesn't work, by the way.)

    When you're done with this, do you want to talk about other possible projects? I have a million ideas, but a severe dearth of programming talent. (Though there was that one time I wrote a terrible wikisyntax parser in ruby and applescript ...) For starters, there are a lot of brilliant client-side animation tools -- apple's Core Animation comes to mind - that could make an utterly kick-ass wikipedia client possible.

    ReplyDelete
  2. I'm not adverse to talking about ideas, although this project will probably take me quite a while yet. (I've been working on it, on and off, for nearly a year, mind you.)

    I also fixed the link, my bad.

    ReplyDelete
  3. Relevant to this old post: serious work is now proceeding on writing a reimplementable syntax for MediaWiki. See the new wikitext-l mailing list. Steve Bennett is doing the heavy lifting in ANTLR (rather than attempting it with EBNF). See http://www.mediawiki.org/wiki/Markup_spec . The hard part appears to be distinguishing quirks that are mere artifacts of the current parser from quirks that are linguistically important in languages other than English.

    ReplyDelete
  4. Hey I heard my name mentioned*. Yeah, I'm doing a grammar in ANTLR, which coincidentally is producing a parser in Java. So in a sense I'm also producing a Java port of MediaWiki, though my focus is more on understanding and documenting the language.

    The biggest problems are really:

    1) ANTLR 3 is flaky and poorly documented. I hope it improves, or a lot of this effort is going to be wasted by switching to something else.
    2) The wikitext language is poorly defined, and there is little separation of "the parser renders input X as Y" and "X is an error which the parser happens to render as Y". Endless debate required...
    3) The wikitext language is hard to parse. Coming up with sensible tokens is a serious pain in the arse.

    Steve
    *Seriously. I was self-googling.

    ReplyDelete