Wednesday, September 06, 2006

Aaron Swartz: Why is he getting so much attention?

I'm sure we've all seen Aaron's blog entry about contributors to the Alan Alda article. For those of us who went to Wikimania, this is old news: Seth's study not only reported on this, but in more detail (as finally Ross Mayfield has noticed on his blog).

The community has long known that edit count is a poor measure of contributions, although at the same time the community is also quite addicted to edit counting. Aaron's letter-counting metric also incidentially heavily rewards people who revert pageblanking vandalism (which is actually quite common on high-traffic articles). Aaron's metric is better than edit count, but isn't that good.

Quite simply, we need a way to classify edits. This isn't that easy; if it were we'd have the software do it automatically, and reject those that fall into the classifications of "edits we don't want". That would be really nice -- and certainly requires natural language processing that doesn't exist and won't for a long time. Greg and I discussed (after Seth's presentation) the idea of a research portal that would present edits to research assistants who have been trained to classify them and store the classifications in its database. Mechanisms could be added to facilitate quality checking to ensure classification is being done correctly. This portal could then be made available to researchers interested in this sort of thing. The problem is that it takes from 5 to 30 seconds to classify an edit, and there are currently 76,732,244 of them on the English Wikipedia alone. Classifying all of them would take over 50 years of full-time labor -- or in dollar terms, about three quarters of a million dollars in labor value (and that assumes 5 seconds per edit classified and no allowances for quality issues). Furthermore, classifying edits is boring: it's not going to be easy to incent people to do it, and certainly not in large volume.

I appreciate Aaron bringing this issue up -- again -- but I think he needs to work more on talking to the people who are already in the field instead of trying to use his unoriginal discovery as a justification for his own board candidacy -- which is quite clearly the real reason for his blog post. I'd be very curious to know how many more votes he got after his blog story got picked up by the media. I can't imagine it's none....