Sunday, March 25, 2007

Today's paradigm of quality on Wikipedia

Today's punching bag article is "List of ministers of the environment". This is potentially interesting information, perhaps, but it is presented in such a mindwrenchingly bad way that it should probably just be deleted.

The main problem is the use of multiple unaligned tables. Each country's listings are in independent tables, which means the columns do not line up from section to section. This largely defeats the purpose of having columns.

In addition, significant amounts of information are being conveyed using shaded pastel colors. (This is actually how this article came to my attention: it abuses the {{yes2}} and {{no2}} templates to get the coloring. Formerly it used the {{but-yes}} and {{but-no}} templates, but that was recently changed when those templates were deleted.) This is unfriendly to color-blind readers; the red and green colors used for those boxes are visually indistinguishable to someone with protanopy.

There is an argument that concluding that some of these people qualify as "ministers of the environment" constitutes original research. At the very least, the sources that support identifying any particular position as a "minister of the environment" need to be more clearly associated with the individual claims.

Ideally, all of the information on this list would be moved into relevant articles as metadata and the list generated automatically from the metadata. This is a persistent problem with Wikipedia lists.

Citizendium plagiarizes?

It seems that Citizendium, despite its fancy claims, is not above plagiarism: compare this image on Citizendium with this image at Commons.  Note the stunning lack at Citizendium of any credit to Magnus as the source.  Not only is this copyright infringement, but it's morally dishonest for Citizendium and its contributors to take credit for work they did not do.

Shame on you, Larry.  I would have expected better of you.

The demographics of Wikipedia

Reading Geoff Burling's recent post on "Age and Wikipedia" got me thinking about age, and about other demographics, in the Wikipedia communities. However, I think the behavior Geoff is writing about is not entirely a symptom of age (although certainly many of the people expressing the behavior in question are teenaged boys, as I've commented on before), but rather of psychological characteristics that Wikipedia selects for, combined with the simple fact that there are a lot of teenagers (boys, mainly) with scads of free time to burn on the Internet.

A few days later, I then saw this, from Language Log, about a college freshman who credits Wikipedia for his passion for linguistics. And certainly Wikipedia does probably attract significant numbers of people who first discover Wikipedia while scratching some knowledge itch. I seem to recall that my first encounter with Wikipedia was due to an interest in mathematics. Of course, not nearly everyone who finds Wikipedia as a reference source goes on to be even so much as a casual editor, let alone a dedicated editor, and I doubt that anyone has anything better than a wild guess as to the conversion rates there.

And this brings me to the major annoyance in discussing demographics of Wikipedians: there are no meaningful demographics about Wikipedians. About the only subgroup of Wikipedians for which there is even a hope of meaningful demographics is that group which goes to Wikimania. The culture of anonymity there is so strong that many admins, and quite probably a majority of editors, do not reveal even basic demographic information about themselves. There are a number of voluntary surveys (e.g. the "list of Wikimedians by age" on meta), but any statistician knows that self-selected surveys are problematic at best and the response rate on these surveys is generally so low as to be useless for any meaningful purpose. Even so much as estimating the number of distinct editors on Wikimedia projects is hard, because of anonymous editing (which results in multiple people being difficult to distinguish) and sockpuppets (which results in a single person appearing to be multiple people). The English Wikipedia has millions of users, but a rather large percentage of them were created for the sole purpose of vandalism. The number of true, non-anonymous editors is simply not known, and is likely unknowable as well. And nobody can really agree on the best method to find reasonable estimates (most of which have to do with only counting editors who make more than some number of edits in some fixed length of time).

So, while it is "common knowledge" that "Wikipedia is run by high schoolers", there really is not any objective basis for this statement. At best, it's an intuitive guess extrapolated from very limited information. Certainly there are high school students involved with Wikipedia, but I think the above-linked article about the passionate linguist is proof of why this can be a good thing. Extrapolating from a few instances to the general case, however, is fallacious. I would love to see real, meaningful statistical data on the demographics of Wikipedia readers, contributors, and community members instead of the current mishmash of wild guesses, extrapolations, and outright hyperbole that is sadly passing for fact in such discussions.

Friday, March 23, 2007

On moving

I do not enjoy moving.  It is a very timeconsuming process.  It is a process that does not leave one time to write interesting things in one's blog.

Thursday, March 15, 2007

Notability, maintainability, and quality

Sage Ross reports (in his blog, at "Wikipedia and Notability") that the community is unhappy with the current definition of notability. I've touched on notability before, in the limited context of webcomics (see Webcomics and Wikipedia and On Webcomics, again). As Sage notes, "notability" has always been a contentious issue in Wikipedia, and there is indeed currently a dispute over what, if anything, "notability" should mean.

In the interest of disclosure, I will reveal that I am an eventualist inclusionist mergist. (I do not consider the latter two mutally contradictory.) My experience, in the somewhat over two years that I've been involved with Wikipedia, is that the scope of what constitutes "acceptable content" for inclusion in the encyclopedia has consistently broadened over time, although certainly in some areas (such as webcomics) there have been pushbacks. A good example of this trend must necessarily be high schools. When I first started at Wikipedia, in late 2004, very few high schools had articles, and most attempts to create one were met with a rather quick deletion on the basis of being "not notable". By 2006, it was generally accepted that high school articles were not subject to being deleted on the basis that they were "insufficiently notable", and today nobody (except for the most hardcore deletionist) contemplates deleting a high school article for very long. Similar trends have seen individual articles on every Pokemon, articles on individual episodes of various television shows, and all sorts of other content that would likely have been summarily deleted in 2004 become generally accepted as appropriate content in 2007.

This is, in my belief, largely due to the fact that the people who feel the urge to remove what they feel is meritless content are simply outnumbered by the people who would create such content. There has not, in most cases, been any conscious decision by the Wikipedia community (if in fact that entity is capable of making decisions, which I rather highly doubt) that articles on individual episodes of the Simpsons are appropriate for inclusion; rather, the articles were created by dedicated Simpsons fans, and nobody with an eye for trimming the encyclopedia got to them quickly enough to effectively resist their presence, and so they, by default, became part of the accepted corpus. I see no reason why this trend would not continue, and so I therefore expect that over time the margins of notability will continue to be pushed further and further back. I don't think that the margins will ever be pushed out completely to the point that (e.g.) the serial number of the dollar bills in my purse will merit their own articles (although it's not entirely out of the question, as many of them are catalogued already at wheresgeorge.com), but I think there's still a great deal of room for expansion and I expect to see Wikipedia expand into that space over the long haul.

The ongoing battle over webcomics seems to be the current exception to this trend, and I don't expect it to continue. Assuming that they don't give up, the webcomics fans will eventually win, as they simply outnumber the notability pruners. At the moment, the pruners are organized against webcomics, and they are assiduously defending that territory. However, the pruners are more subject to attrition in the ranks than the webcomics fans, and it is likely inevitable that too many of their faction will leave Wikipedia or be drawn off into some other battle (say, amateur sports leagues, or radio towers, or some other equally borderline area) and the resulting loss of active focus will let the webcomics fans win out. It's far easier, in most cases, to recruit people in favor of keeping content than it is to recruit those opposed to it.

So, rather than spending a lot of time refining the definition of notability, I would advise discarding it entirely. Notability is, in practice, is a proxy for a large number of largely personal beliefs about what should be in an encyclopedia for which there is no consensus within the Wikipedia community. Furthermore, those beliefs shift over time, and I believe that shift will tend toward broader inclusion over time. The problem with broad inclusionism is that it will inevitably lead to more articles than the Wikipedia community can effectively maintain. (It is difficult to deny that this has already happened.)

The problem with discarding notability is that immediately people will scream "But then we will have articles about what you had for breakfast yesterday". Well, no, we won't. (Although it might be interesting to have that data; I'm sure that there will be people in 2150 who will be interested in knowing about the dietary habits of early 21st century IT professionals. There are probably people in 2007 with that interest, for that matter.) I am not advocating having no standards at all; that would be irrational. Instead, the standards must reflect maintainability as the main consideration. A record of my breakfast yesterday (for the record, two glazed Dunkin Donuts and a bottle of Aquafina) is unverifiable, and thus unmaintainable, and thus unfit for inclusion in Wikipedia. Verifiability isn't enough for maintainability, but it's definitely a minimum characteristic.

This seems to be the general direction of the discussion that Sage refers to, although they're not characterizing it as maintainability, but instead attributability. I don't think attributability is enough. One of Wikipedia's largest problems right now is that it's larger than its community can effectively tend to. Wikipedia needs to aggressively limit its growth, at least in the short term, to give its community enough time to structure itself better to be able to handle the content it has now, to say nothing of the content it will acquire in the future. The problem that adopting attributability (or verifiability) as a minimum criterion for inclusion is that someone is going to have to check the cited sources for accuracy. Nobody is doing that now, except on a haphazard basis. Wikipedia has no process now for any sort of organized maintenance of the encyclopedia; even vandalism management is done haphazardly.

Quite frankly, I think it would be appropriate for Wikipedia to disable new page creation (except for admins, to deal with special cases) for an entire month and spend that month developing the infrastructure to better maintain both the articles it currently has and the new articles it'll gain once new page creation is reenabled. New page review needs to be systematic, not haphazard, and there need to be systems to ensure that every new page is looked at by at least one and preferably several experienced editors promptly after creation, both to properly categorize it (the stub sorters already sorta do this, but they do so in a far less useful way than they could) and to evaluate the article for what action the community needs to take with respect to it. And then the community needs to actually do those things.

There are currently 21,598 articles tagged as needing cleanup and 55,928 tagged as lacking sources. And I suspect that only represents about 20% of the articles that actually belong in those respective categories. These numbers are not falling with time; they are growing (a month ago, there were only 49,607 tagged as lacking sources). These backlogs reflect the rapidly declining overall quality of Wikipedia. The situation may already be out of control; if it is not yet, it likely will be soon. The problem is that the community largely seems not to care, and that really bothers me.

Deleting all unverified articles would be a good start. Not all at once, but a deliberate, systematic process to either source or delete those 55,928 articles would be a great start. Proper use of automation is critical to this, and I really think that's where Wikipedia needs to be concentrating its activities in the next year. It would be great if the Foundation would help to recruit the volunteers needed for this effort; the problem with the current community is that there don't seem to be enough people interested in this sort of work to get it done.