Friday, September 07, 2007

Why I hate PHP

I've been working quite a bit lately on my project to rewrite MediaWiki in Java. Doing so has reminded me of why I dislike languages like PHP. There's a couple of specific things that I'm referring to here, so I'll elaborate.

First, PHP is a weakly typed language. Any variable can hold a value of any type supported by the language, and the runtime will gleefully convert between types nilly-willy in the even that one accidentially tries to, for example, use a variable containing an array to something that more reasonably expects an integer. A strongly-typed language like Java (or, to some degree, C) will refuse to let you do something like this; either the compiler will throw an error or the runtime (in Java, at least) will throw an exception. PHP, though, assumes that the programmer intended to do this and provides a result, although not necessarily a useful one. While this is ocasionally useful (and in fact many PHP and Perl programmers take advantage of implicit type conversions to save typing and make their code "more clever"), it also introduces many opportunities for bugs. The same is true of PHP's lazy handling of function/method arguments; putting the wrong number or type of arguments on a function call is an error in Java, but is gleefully ignored in PHP. I've found at least two instances in MediaWiki 1.11 where there's a mismatch between a function's definition and its invocation, although both are in low-probability codepaths as far as I can tell.

Another aspect that makes PHP a less desirable language for me is that data structures in PHP are way too fluid. An object's class only determines what methods are available to it; they don't determine what the class instance variables are. A object can have members that aren't even mentioned directly anywhere, because PHP lets you both create and access instance variables indirectly. Java allows indirect access using reflection, but there's no way to create members at run time. The programmer can also create local (and global) variables with names that could be completely arbitrary (using $$var or extract) and totally unable to be predicted at compile time. This also introduces great opportunity for bugs and other forms of unexpected behavior, and also results in terribly cryptic code at times. The "mysql_fetch_object" function is an especial pet peeve.

But the real problem I have with PHP (and also with Perl and to a lesser degree with Python) is the use of the regular expression as the universal solution to virtually any problem. MediaWiki is an especially fine example of this particular syndrome: MediaWiki's parser is written basically as a very complicated series of regular expressions. This leads to a number of related problems. First, regular expressions are typically not cheap to use, and especially not PCRE extended regular expressions. What looks to be a simple regexp match may actually be a very expensive operation. Regular expressions are also cryptic, which makes them difficult to understand and to modify. This is yet another factor that adds to the unmaintainability of PHP code.

Another pet peeve (which I've blogged about before) that shows up in PHP and Perl code a lot is the use of array
types as anonymous structure types. This is bad for a bunch of different reasons; first, it decreases maintainability for all the same reasons already mentioned above, and it can also impair performance because of the cost of pulling data out of a general purpose array structure (although really you end up paying those costs all the time in PHP, because all objects are a "general purpose array structure", instead of using quick offsets into a fixed structure the way you can with a dedicated data structure object).

I'm doing my port to Java not specifically because I like Java but because Java is the best general purpose language for which I have a good IDE and compiler. I actually like C# better (C#'s generics are better than Java's), but there's no C# environment that isn't Microsoft-encumbered and so I don't use it, in part because I don't want to pay for it and in part because I don't want to deal with Microsoft's licensing. C++ has its own set of problems; one of my major gripe is with operator overloading, which allows a programmer to create counterintuitive definitions for primitive operations and frequently conceals expensive operations from the programmer, which leads to inefficient code.

Some time ago, Greg Maxwell shared with me the results of profiling one of Wikimedia's Apache servers (running, of course, MediaWiki). I seem to recall that the server spent a good chunk of its time in either the regular expression handler or various PHP symbol table manipulation functions. (See, for example, these profiling results and note that 8.8% of the CPU is being used either to find things in hash tables or update things in hash tables, most of which is symbol table and array index manipulation, and almost 2.5% is being used to run match, which is PHP's primary regexp processing routine.) There has to be a way to eliminate at least some of that overhead.