I've been working quite a bit lately on my project to rewrite MediaWiki in Java. Doing so has reminded me of why I dislike languages like PHP. There's a couple of specific things that I'm referring to here, so I'll elaborate.
First, PHP is a weakly typed language. Any variable can hold a value of any type supported by the language, and the runtime will gleefully convert between types nilly-willy in the even that one accidentially tries to, for example, use a variable containing an array to something that more reasonably expects an integer. A strongly-typed language like Java (or, to some degree, C) will refuse to let you do something like this; either the compiler will throw an error or the runtime (in Java, at least) will throw an exception. PHP, though, assumes that the programmer intended to do this and provides a result, although not necessarily a useful one. While this is ocasionally useful (and in fact many PHP and Perl programmers take advantage of implicit type conversions to save typing and make their code "more clever"), it also introduces many opportunities for bugs. The same is true of PHP's lazy handling of function/method arguments; putting the wrong number or type of arguments on a function call is an error in Java, but is gleefully ignored in PHP. I've found at least two instances in MediaWiki 1.11 where there's a mismatch between a function's definition and its invocation, although both are in low-probability codepaths as far as I can tell.
Another aspect that makes PHP a less desirable language for me is that data structures in PHP are way too fluid. An object's class only determines what methods are available to it; they don't determine what the class instance variables are. A object can have members that aren't even mentioned directly anywhere, because PHP lets you both create and access instance variables indirectly. Java allows indirect access using reflection, but there's no way to create members at run time. The programmer can also create local (and global) variables with names that could be completely arbitrary (using $$var or extract) and totally unable to be predicted at compile time. This also introduces great opportunity for bugs and other forms of unexpected behavior, and also results in terribly cryptic code at times. The "mysql_fetch_object" function is an especial pet peeve.
But the real problem I have with PHP (and also with Perl and to a lesser degree with Python) is the use of the regular expression as the universal solution to virtually any problem. MediaWiki is an especially fine example of this particular syndrome: MediaWiki's parser is written basically as a very complicated series of regular expressions. This leads to a number of related problems. First, regular expressions are typically not cheap to use, and especially not PCRE extended regular expressions. What looks to be a simple regexp match may actually be a very expensive operation. Regular expressions are also cryptic, which makes them difficult to understand and to modify. This is yet another factor that adds to the unmaintainability of PHP code.
Another pet peeve (which I've blogged about before) that shows up in PHP and Perl code a lot is the use of array
types as anonymous structure types. This is bad for a bunch of different reasons; first, it decreases maintainability for all the same reasons already mentioned above, and it can also impair performance because of the cost of pulling data out of a general purpose array structure (although really you end up paying those costs all the time in PHP, because all objects are a "general purpose array structure", instead of using quick offsets into a fixed structure the way you can with a dedicated data structure object).
I'm doing my port to Java not specifically because I like Java but because Java is the best general purpose language for which I have a good IDE and compiler. I actually like C# better (C#'s generics are better than Java's), but there's no C# environment that isn't Microsoft-encumbered and so I don't use it, in part because I don't want to pay for it and in part because I don't want to deal with Microsoft's licensing. C++ has its own set of problems; one of my major gripe is with operator overloading, which allows a programmer to create counterintuitive definitions for primitive operations and frequently conceals expensive operations from the programmer, which leads to inefficient code.
Some time ago, Greg Maxwell shared with me the results of profiling one of Wikimedia's Apache servers (running, of course, MediaWiki). I seem to recall that the server spent a good chunk of its time in either the regular expression handler or various PHP symbol table manipulation functions. (See, for example, these profiling results and note that 8.8% of the CPU is being used either to find things in hash tables or update things in hash tables, most of which is symbol table and array index manipulation, and almost 2.5% is being used to run match, which is PHP's primary regexp processing routine.) There has to be a way to eliminate at least some of that overhead.
First, PHP is a weakly typed language. Any variable can hold a value of any type supported by the language, and the runtime will gleefully convert between types nilly-willy in the even that one accidentially tries to, for example, use a variable containing an array to something that more reasonably expects an integer. A strongly-typed language like Java (or, to some degree, C) will refuse to let you do something like this; either the compiler will throw an error or the runtime (in Java, at least) will throw an exception. PHP, though, assumes that the programmer intended to do this and provides a result, although not necessarily a useful one. While this is ocasionally useful (and in fact many PHP and Perl programmers take advantage of implicit type conversions to save typing and make their code "more clever"), it also introduces many opportunities for bugs. The same is true of PHP's lazy handling of function/method arguments; putting the wrong number or type of arguments on a function call is an error in Java, but is gleefully ignored in PHP. I've found at least two instances in MediaWiki 1.11 where there's a mismatch between a function's definition and its invocation, although both are in low-probability codepaths as far as I can tell.
Another aspect that makes PHP a less desirable language for me is that data structures in PHP are way too fluid. An object's class only determines what methods are available to it; they don't determine what the class instance variables are. A object can have members that aren't even mentioned directly anywhere, because PHP lets you both create and access instance variables indirectly. Java allows indirect access using reflection, but there's no way to create members at run time. The programmer can also create local (and global) variables with names that could be completely arbitrary (using $$var or extract) and totally unable to be predicted at compile time. This also introduces great opportunity for bugs and other forms of unexpected behavior, and also results in terribly cryptic code at times. The "mysql_fetch_object" function is an especial pet peeve.
But the real problem I have with PHP (and also with Perl and to a lesser degree with Python) is the use of the regular expression as the universal solution to virtually any problem. MediaWiki is an especially fine example of this particular syndrome: MediaWiki's parser is written basically as a very complicated series of regular expressions. This leads to a number of related problems. First, regular expressions are typically not cheap to use, and especially not PCRE extended regular expressions. What looks to be a simple regexp match may actually be a very expensive operation. Regular expressions are also cryptic, which makes them difficult to understand and to modify. This is yet another factor that adds to the unmaintainability of PHP code.
Another pet peeve (which I've blogged about before) that shows up in PHP and Perl code a lot is the use of array
types as anonymous structure types. This is bad for a bunch of different reasons; first, it decreases maintainability for all the same reasons already mentioned above, and it can also impair performance because of the cost of pulling data out of a general purpose array structure (although really you end up paying those costs all the time in PHP, because all objects are a "general purpose array structure", instead of using quick offsets into a fixed structure the way you can with a dedicated data structure object).
I'm doing my port to Java not specifically because I like Java but because Java is the best general purpose language for which I have a good IDE and compiler. I actually like C# better (C#'s generics are better than Java's), but there's no C# environment that isn't Microsoft-encumbered and so I don't use it, in part because I don't want to pay for it and in part because I don't want to deal with Microsoft's licensing. C++ has its own set of problems; one of my major gripe is with operator overloading, which allows a programmer to create counterintuitive definitions for primitive operations and frequently conceals expensive operations from the programmer, which leads to inefficient code.
Some time ago, Greg Maxwell shared with me the results of profiling one of Wikimedia's Apache servers (running, of course, MediaWiki). I seem to recall that the server spent a good chunk of its time in either the regular expression handler or various PHP symbol table manipulation functions. (See, for example, these profiling results and note that 8.8% of the CPU is being used either to find things in hash tables or update things in hash tables, most of which is symbol table and array index manipulation, and almost 2.5% is being used to run match, which is PHP's primary regexp processing routine.) There has to be a way to eliminate at least some of that overhead.
I agree with a whole lot of this, though personally I wouldn't rewrite it in Java; I can't stand Java. But that's personal.
ReplyDeletePHP and Perl are great for banging out quick and dirty solutions to problems that frankly don't deserve a polished tool.
In most cases, Perl actually allows a good programmer to write pretty good code; you can make Perl be your own language, fundamentally, with your own rules. Of course, this all breaks down if you have to use anyone else's code... PHP, on the other hand, seems to make it almost impossible to work in a more disciplined manner, although that could be from lack of experience in it.
And I wholly agree about the regular expressions. Not that they're not useful in many circumstances, but they lead one down wrong paths - e.g. Wikitext parsing, that should have been a dedicated grammar from the beginning, and now probably can't be implemented in a specifiable way at all.
Do you think the developers will switch to your Java site engine once it's finished and a significant improvement over the PHP one?
ReplyDeleteNo, I have no reason to believe that the developers would just decide to use my version. It would be unwise for them to do so, in fact, at least not until after my version has proven itself elsewhere.
ReplyDeleteMediaWiki, for all its warts, has proven itself to be a very robust platform that serves the Foundation reasonably well. My main reasons for doing this has always been to get more practice with Java and to learn more about MediaWiki. I do plan to eventually replace MediaWiki with Myrtle on my own wikis, and maybe someday I'll even release it to the public and all that (I haven't even cleaned up the copyrights yet, so while you can get it from my SVN serverand it is, of course, available under the GPL, it's not been formally released), but I have no intention of actually competing with MediaWiki for market share or anything like that.
Re: "I've found at least two instances in MediaWiki 1.11 where there's a mismatch between a function's definition and its invocation, although both are in low-probability codepaths as far as I can tell." - please feel free to email me (nickpj at gmail) the details of these places so that they can be corrected in SVN, if you or someone else has not already corrected this.
ReplyDeleteNick, if I could remember where they were I would have already emailed a patch to Brion. Unfortunately, I don't, not specifically enough. I've been working on this project for quite some time now, and my focus has always been on completing and improving my project, not on vetting MediaWiki for bugs.
ReplyDeleteUnderstood. Well, if you come across any more things, and if it's quick & easy for you to do so, then please email details to Brion / log it in bugzilla / email wikitech / email just about any dev, and it'll be investigated and fixed. And best of luck with your project! Also if you haven't done so, maybe check out Jamwiki ( http://jamwiki.org/ ), which I think is trying to be a mediawiki-compatible wiki written in java. -- All the best, Nick.
ReplyDeleteI think almost all the things you don't like about PHP are some of it's greatest strengths. Properly written PHP is easy to read, elegant, and powerful.
ReplyDeleteThere are many examples of good PHP, but it's also true that they are few and far between.
I hate php so much.
ReplyDeleteI can't say i agree at all! thank god for independent thought otherwise we'd all sound the same :)
ReplyDeleteI've recently blogged about just the opposite effect: I've been working on a Java project and i'm pining to get back to PHP.
Your blog shows an interesting point about background and how it has a bearing on how we choose to develop, I love RegExs because of my original UN*X background, they're (usually) pretty easy to read IMHO and comments can be embedded into them, they're self-documenting and portable and that's a major boon to those of us who have to swap languages pretty often.
Secondly, the problems you mention with PHP are also bonuses - it's quick to write (and admittedly it's quick to write junk too) but PHP forces you to use unit testing (try SimpleTest - i love it) so those issues you mention don't come up.
Performance: fair point - PHP needs to improve. I can't argue with that PCRE regexs are pretty bad.
What i'm surprised at though is that you don't hate PHP because it's so unpredictable (is this function arguments ordered haystack then needle or vice-versa? is this a function that starts with str_ or just str?) within the language itself. Those are the things that I think need sorting out ASAP.
Anyway thanks for a good article, I always like to read stuff with contrary views :)
If you get a chance then take a read of my own moaning and whining at Code Wizards' Blog and Ewelike's Blog and see how strongly you disagree.
Yes PHP is such a damned mess.
ReplyDeleteHaving lose types in PHP is a complete advantage. If you need to validate that a variable is a number, float or string, there are functions to check for this.
ReplyDeleteHaving mixed types in an array is a really flexible way to code.
My only complaint about PHP are that some function names are not consistent and the arrangement of arguments can also be inconsistent. Albeit this is inconvenient, but that is why I always have the php manual handy, which happens to be the best well documented coding language, of all time.
PHP abstrats away a lot of the bullshit, so you can focus on coding.
With php5 and now namespaces, php is a full blown OOP language with even GTK, QT and other graphics tool kits to come.
I love PHP.
One aspect about runtime creation of methods and members in a class should be removed, as this may create 'spaghetti' code.
I hate PHP. It's so inconsistent, slow and messy. I always feel dirty when working in PHP...
ReplyDeleteI know this article is old but I have to ask this. You mention the shortcomings of a lot of languages here and reasons why you would not use them so what language do you consider to be good?
ReplyDeleteBTW, on PHP, I think it has it's warts like any language but it's good for what it was made for - building web apps/sites. Besides, like it or not, it's so ubiquitous that not being competent in PHP probably means that you are doing yourself out of at least some work.
I agree. PHP is cheap and ugly. When I have to close VisualStudio and open NetBean i'm so sad I could cry. The debugger is so cheap and weak. Yeurk.
ReplyDeleteAfter using PHP as main language for 8 years I can say: I hate it so much that I even cannot shape my feeling into proper words. There are beautiful dynamic languages like python, ruby with its rails, groovy, also statically-typed: Scala with powerful frameworks like Play or Lift. Don't waste your time with PHP.
ReplyDelete