Nonbovine Ruminations: September 2007

Sunday, September 30, 2007

The seething idiocy continues

For the last several months, I've had the occasional person ask me if I would run for admin again on the English Wikipedia. I've rebuffed those requests, feeling that it hadn't been long enough. The one year anniversary of my resignation just passed, however, and in recognition of that I decided to set the conditions under which I'd allow a request for adminship to take place: three separate people (none of them anyone I consider a troll), or "fools", would have to jointly nominate me. I put notice of this on my user page; within 24 hours a group of my more ardent "fans" predictably popped into my talk page to tell me what a vile person I was to even consider the possibility of ever being an admin again, prompting me to issue a clarification. (It's impressive how obsessed some Wikipedians can be with their petty feuds.) And only a few days after that, I was contacted on IRC to inform me that three fools had come together and intended to nominate me. A few days later, their nomination was ready, and the RfA was off.

Unsurprisingly, it's failing, and failing badly, after less than 24 hours. As I fully expected; my "fans" hate me too much to miss this. Newyorkbrad (having previously tried to convince me not to allow the RfA) is now trying to convince me to allow it to be withdrawn. Apparently, it would be "bad" for Wikipedia for this RfA to run the usual seven days, since it has "no chance of passing". As I feared, the community is unwilling to actually discuss anything. I had a sliver of hope that a meaningful discussion would arise from the RfA, and there has been a small smattering of comments that I think might have led in that direction. Newyorkbrad's long, thoughtful comment could have been a step in that direction, but it was immediately met by a "tl;dr" from Nishkid64.

It would be my preference that the RfA be allowed to run the full seven days. I doubt it will; too many people will decide that having this "discussion" will do more harm than good and will cut it off. I'm going to save my comments for individual "opinions" expressed in the RfA until after it's over. The only thing I will say is that the almost continuous assumptions of bad faith are just astounding.

The outcome of this RfA notwithstanding, my "three fools" offer will remain; the community can repeat this show as many times as three members wish it. You all know how to reach me.

Friday, September 28, 2007

Open Letter to Rod Blagojevich, the Illinois Toll Highway Authority, and the Illinois Department of Transportation

Dear Governor Blagojevich:

Thank you ever so much for making the newly-renamed Jane Addams Memorial Tollway so much less usable by implementing open-road tolling. Ever since the eastbound O'Hare toll plaza was restructured, the congestion there has been horrible. And it's entirely because of open-road tolling.

The entire open-road tolling project has operated on the assumption that the doubled tolls for cash payers would incent frequent users of the tollway system to purchase I-Pass units, because I-Pass holders can use the rolling toll facilities which permit tolls to be collected at full highway speed, meaning less delay for travelers, as well as a significant discount on the tolls themselves. A nice side effect is the Toll Highway Authority gets a ton of personalized data on the usage of the tollway system and saves on labor costs because it costs a lot less to collect tolls electronically than manually. I'm not arguing this at all. And, from what I've seen, it actually seems to have helped on the Tri-State, which is mostly commuter traffic. But, it's made things worse on the Jane Addams. Here's why.

The Jane Addams is, of course, also Interstate 90. Those of us who live in Chicago and the near suburbs probably think of this as "the road you use to get to Woodfield". When I lived in Niles, it was one of our routes to visit friends in Palatine, especially on weekends when the congestion would be relatively light. But for anybody living west of the Mississippi and north of the 40th parallel, it's "the road you use to get to New York City". Interstate 90, the nation's longest interstate, runs from Seattle all the way to Boston, and is heavily used by cross-country travelers (not to mention truckers). As a result, there are a lot of people on this road -- at all hours of the day and night -- who are not residents of northern Illinois and have no reason to obtain an I-Pass. These people will all have to go through the manual toll plazas and pay the penalty toll that the Toll Highway Authority charges to try to convince people to buy I-Passes. There are so many of these travelers that the congestion backs up from the small number of toll plazas that remain here (the conversion to open road tolling cut the number of manual and automatic toll collection units at least in half, if not more so) past the point where the high-speed open road tolling lanes separate off from the mainstream. As a result, I-Pass holders have to wait for non-I-Pass holders to pay their tolls, just compounding the congestion. I've gone through here at 11pm on a Saturday and encountered substantial congestion at this plaza. There's simply no excuse for that.

I don't have a problem with the Toll Highway Authority charging penalty tolls to out-of-area travelers; that means more money for the Tollway Authority taken mainly from people who aren't Illinois taxpayers, which brings money into the state and gives the system more cash to keep the roads improved. But can we please do one little thing? How about advising through traffic heading to Indiana and points east to divert from Interstate 90 onto Interstate 290 (the Eisenhower Extension) in Schaumburg, and then from there, onto Interstate 294 (the southbound Tri-State) in Hillside and then Interstate 80 into Indiana. They can then pick up Interstate 90 again when 80 and 90 interchange in northern Indiana. This will actually increase toll revenues for the Toll Highway Authority because there are either two or three toll plazas on the Tri-State that this traffic will pass through, instead of the single plaza at the end of the Addams. It will, of course, reduce revenue on the Skyway (sorry, Mayor Daley) and the Indiana Toll Road, but at the same time it'll also reduce congestion on the Kennedy and Dan Ryan.

Another thing that might help is if you tried to convince the route mapping services to recommend the bypass routes around Chicago instead of the direct through route. I ran test routes from Minneapolis to New York on both Google Maps and Mapquest. Both advises me to take I-90 all the way through Illinois, which means taking the Kennedy and Dan Ryan straight through the city. That is certainly the shorter route, but it's almost never the faster route.

How hard would it be to put up "THROUGH TRAFFIC TO INDIANA USE I-290 EAST" signs approaching the I-90/I-290 interchange in Schaumberg (and then "THROUGH TRAFFIC TO INDIANA USE I-294 SOUTH" on the Eisenhower near Elmhurst)? Doesn't seem like it would be that hard. And it might make the Jane Addams more useful for the people who actually live in and around Chicagoland -- and vote for things like Governor.

Just a thought. I suppose this might make congestion on the Ike and the Tri-State worse. Maybe we should just get rid of the tollways entirely. That would be better for the environment, too, you know....

Thursday, September 27, 2007

More on my MediaWiki port

So, some people have inquired about my port of MediaWiki to Java.

One commentator implied that this is a foolish/wasted venture because it's "already done". JAMWiki is a pure Java wiki that resembles MediaWiki, but it is not a port; rather, it is a scratch implementation that uses MediaWiki's markup syntax and resembles its behavior. However, it does not use MediaWiki's database schema (a situation that they attribute to licensing issues). I have nothing but respect for what they're doing, but what they're doing is not what I'm doing. There are a number of other Java-implemented wikis out there, but I'm not attempting to compete with any of them. And, given my motivations (see below) even if it had already been done I'd still likely want to do it.

Another intriguing project that has been brought to my attention is Quercus, a native Java implementation of PHP. The Quercus people claim that Quercus runs PHP code significantly faster than the standard mod_php interpreter and is on a par with the performance offered by PHP accelerated by APC. Certainly an option for incremental development would be to run MediaWiki under Quercus, and then incrementally port portions of it to pure Java. It has occurred to me that the Wikimedia Foundation might benefit from doing this (if nothing else, it would likely significantly simplify their interface with Lucene, which at the moment is done with a really grotty .NET interface), but of course it's not my place to advise Brion and company how to run their show. But at the moment I don't want to take the time to immerse myself into another product. Might be something to look at down the road, though.

But, fundamentally, neither of these really serves my interests. As I've noted before, more than once, I don't like PHP very much. On the other hand, I would like to better understand the MediaWiki code. What better way to understand a body of code than to port it to another language? At the end of this, not only will I have a better understanding of MediaWiki's codebase (I've already submitted three bug reports, all for admittedly minor items) as well as even more reasons to hate PHP, but hopefully also I will have a product that is a drop-in replacement for MediaWiki that also outperforms it and can be more easily modified, to boot.

As to why Java? At the moment, Java is my favorite language for large applications. I actually prefer C#'s generics to Java's, but I do not trust Microsoft enough to feel comfortable committing my labors to a language and runtime with dubious intellectual property issues. Java is far more open than C#, and so I'm much more comfortable (philosophically) developing to that environment. Furthermore, there is an open source Web engine for Java already (Tomcat) as well as well-known compatible high-performance enterprise products from Sun and IBM for someone who might want to use this product in an enterprise environment. I haven't worked much with Java in quite a while (I was a Java programmer back in 2001 for a while, but not much since) so this is also a chance to resharpen my skills.

Syntactically, PHP and Java are actually pretty close. I can mechanically convert PHP code to pretty-close-to-Java with just a handful of Perl scripts. I'm probably about half-way done with the first pass of rough code conversion (although admittedly the parser has not yet been done, which will likely demand considerably more attention than many of the other modules I've already done). Certainly it will be quite a long time before an actually executable product comes out of this, but I'm not on any timeline here. (Fortunately, MediaWiki makes very little use of the parts of PHP that are especially hard to port.)

The curious may observe my progress via SVN. I chose "Myrtle" because I am fond of wood (my other main hobby these days is woodworking) and because my full name anagrams to "Kill Nanny Myrtle".

Once I've finished this project, I am very likely to work on automated PHP-to-Java porting tools. That's going to require writing a PHP parser, but that can't be terribly hard, now, can it?

Saturday, September 08, 2007

Devolving power

One of Wikipedia's major problems today is the role of administrator. Administrators have substantially more power than non-administrators; they and only they get to decide what stays and what goes, and for that matter who stays and who goes. Although Jimbo has long tried to impose the notion that "administrators are no big deal", the simple fact is that being an administrator is a big deal and people will go to considerable lengths to gain that power.

Devolving power from administrators seems an obvious thing to do. The problem is finding ways to diffuse the power currently held by administrators and devolve it into the community without creating a lot of bureaucracy. The community already votes on deletions, and it's quite obvious that this isn't working too terribly well. Voting on blocks is even more problematic, because a block decision amounts practically to a trial, and really trials by popular vote are a really bad idea.

Deletion is especially easy to devolve, so easy in fact that there's been a proposal to do it since October 2003. The so-called "pure wiki deletion system" (which is labeled on the English Wikipedia as a "rejected policy") effectively devolves the deletion decision away from administrators by allowing any editors to delete any article. It also resolves one of my personal gripes with deletion on Wikipedia: the fact that most editors cannot retrieve deleted articles, even when those articles have been deleted for being "nonnotable". Right now, deletion on Wikipedia is used for two totally different purposes: one is to remove articles deemed "unworthy of inclusion in the encyclopedia", and the other is to remove content which is legally, morally, or ethically problematic. The problem with using the same mechanism for both is that the first leaves too few people with the ability to see the "unworthy" article, while the second leaves the dangerous content where too many can see it -- any of 1300+ admins, who we really have no reason to trust because the selection process does a piss-poor job of screening for trustworthiness.

The pure wiki deletion system allows for "ordinary" deletions to be made, and reversed, by anyone; anyone can also come along, examine the prior content, and either reuse it in some other article (or for some other purpose entirely), restore any prior version of the article, or write a better version. The current practice loses the history, which encourages people to repeat the same mistakes that came before. "Extraordinary" deletions, the sort that are required by copyright, libel, or other legal, moral, or ethical standards, would be exercised by a much smaller, more carefully chosen group with a deletion power similar to the current badly-named "oversight" privilege. I don't see any way around that power being held by a small group; the risk of abuse is too high, as we already well know. Devolving this power is one of the ones I haven't figured out how to deal with.... fundamentally, some things are going to have to be held only by a limited number of people; the trick becomes choosing those people wisely, something which the Wikipedia community has not shown a great deal of capability of to date.

So the real question is, why hasn't the pure wiki deletion system been implemented? It's not technically very challenging; I suspect I could make the necessary code changes in a few days, and someone more familiar with the MediaWiki codebase could probably do it faster than that. No, the problem here is that it devolves power from those who have it. And since the people who would be losing the power are the people who at least have a large say in whether or not they will lose that power, they naturally resist it. Entrenchment's a bitch, isn't it?

So, while it's certainly quite possible to think of ways to make Wikipedia run better or more reasonably, there's no hope of getting them implemented, precisely because the people who stand to lose the most through such changes are in a position to prevent them from happening. There is no leadership, and in fact a very strong community attitude against having leaders of any sort, to push any real efforts at reform.

So at this point I don't have much hope that any of my proposals, suggestions, or ideas will ever see the light of day on Wikipedia, but I do hope that they will inform the people who build the next online encyclopedia project -- the one that will eventually replace Wikipedia, so that they at least do not mindlessly repeat the same mistakes that Wikipedia made.

Friday, September 07, 2007

Why I hate PHP

I've been working quite a bit lately on my project to rewrite MediaWiki in Java. Doing so has reminded me of why I dislike languages like PHP. There's a couple of specific things that I'm referring to here, so I'll elaborate.

First, PHP is a weakly typed language. Any variable can hold a value of any type supported by the language, and the runtime will gleefully convert between types nilly-willy in the even that one accidentially tries to, for example, use a variable containing an array to something that more reasonably expects an integer. A strongly-typed language like Java (or, to some degree, C) will refuse to let you do something like this; either the compiler will throw an error or the runtime (in Java, at least) will throw an exception. PHP, though, assumes that the programmer intended to do this and provides a result, although not necessarily a useful one. While this is ocasionally useful (and in fact many PHP and Perl programmers take advantage of implicit type conversions to save typing and make their code "more clever"), it also introduces many opportunities for bugs. The same is true of PHP's lazy handling of function/method arguments; putting the wrong number or type of arguments on a function call is an error in Java, but is gleefully ignored in PHP. I've found at least two instances in MediaWiki 1.11 where there's a mismatch between a function's definition and its invocation, although both are in low-probability codepaths as far as I can tell.

Another aspect that makes PHP a less desirable language for me is that data structures in PHP are way too fluid. An object's class only determines what methods are available to it; they don't determine what the class instance variables are. A object can have members that aren't even mentioned directly anywhere, because PHP lets you both create and access instance variables indirectly. Java allows indirect access using reflection, but there's no way to create members at run time. The programmer can also create local (and global) variables with names that could be completely arbitrary (using $$var or extract) and totally unable to be predicted at compile time. This also introduces great opportunity for bugs and other forms of unexpected behavior, and also results in terribly cryptic code at times. The "mysql_fetch_object" function is an especial pet peeve.

But the real problem I have with PHP (and also with Perl and to a lesser degree with Python) is the use of the regular expression as the universal solution to virtually any problem. MediaWiki is an especially fine example of this particular syndrome: MediaWiki's parser is written basically as a very complicated series of regular expressions. This leads to a number of related problems. First, regular expressions are typically not cheap to use, and especially not PCRE extended regular expressions. What looks to be a simple regexp match may actually be a very expensive operation. Regular expressions are also cryptic, which makes them difficult to understand and to modify. This is yet another factor that adds to the unmaintainability of PHP code.

Another pet peeve (which I've blogged about before) that shows up in PHP and Perl code a lot is the use of array
types as anonymous structure types. This is bad for a bunch of different reasons; first, it decreases maintainability for all the same reasons already mentioned above, and it can also impair performance because of the cost of pulling data out of a general purpose array structure (although really you end up paying those costs all the time in PHP, because all objects are a "general purpose array structure", instead of using quick offsets into a fixed structure the way you can with a dedicated data structure object).

I'm doing my port to Java not specifically because I like Java but because Java is the best general purpose language for which I have a good IDE and compiler. I actually like C# better (C#'s generics are better than Java's), but there's no C# environment that isn't Microsoft-encumbered and so I don't use it, in part because I don't want to pay for it and in part because I don't want to deal with Microsoft's licensing. C++ has its own set of problems; one of my major gripe is with operator overloading, which allows a programmer to create counterintuitive definitions for primitive operations and frequently conceals expensive operations from the programmer, which leads to inefficient code.

Some time ago, Greg Maxwell shared with me the results of profiling one of Wikimedia's Apache servers (running, of course, MediaWiki). I seem to recall that the server spent a good chunk of its time in either the regular expression handler or various PHP symbol table manipulation functions. (See, for example, these profiling results and note that 8.8% of the CPU is being used either to find things in hash tables or update things in hash tables, most of which is symbol table and array index manipulation, and almost 2.5% is being used to run match, which is PHP's primary regexp processing routine.) There has to be a way to eliminate at least some of that overhead.

Nonbovine Ruminations