Thursday, March 15, 2007

Notability, maintainability, and quality

Sage Ross reports (in his blog, at "Wikipedia and Notability") that the community is unhappy with the current definition of notability. I've touched on notability before, in the limited context of webcomics (see Webcomics and Wikipedia and On Webcomics, again). As Sage notes, "notability" has always been a contentious issue in Wikipedia, and there is indeed currently a dispute over what, if anything, "notability" should mean.

In the interest of disclosure, I will reveal that I am an eventualist inclusionist mergist. (I do not consider the latter two mutally contradictory.) My experience, in the somewhat over two years that I've been involved with Wikipedia, is that the scope of what constitutes "acceptable content" for inclusion in the encyclopedia has consistently broadened over time, although certainly in some areas (such as webcomics) there have been pushbacks. A good example of this trend must necessarily be high schools. When I first started at Wikipedia, in late 2004, very few high schools had articles, and most attempts to create one were met with a rather quick deletion on the basis of being "not notable". By 2006, it was generally accepted that high school articles were not subject to being deleted on the basis that they were "insufficiently notable", and today nobody (except for the most hardcore deletionist) contemplates deleting a high school article for very long. Similar trends have seen individual articles on every Pokemon, articles on individual episodes of various television shows, and all sorts of other content that would likely have been summarily deleted in 2004 become generally accepted as appropriate content in 2007.

This is, in my belief, largely due to the fact that the people who feel the urge to remove what they feel is meritless content are simply outnumbered by the people who would create such content. There has not, in most cases, been any conscious decision by the Wikipedia community (if in fact that entity is capable of making decisions, which I rather highly doubt) that articles on individual episodes of the Simpsons are appropriate for inclusion; rather, the articles were created by dedicated Simpsons fans, and nobody with an eye for trimming the encyclopedia got to them quickly enough to effectively resist their presence, and so they, by default, became part of the accepted corpus. I see no reason why this trend would not continue, and so I therefore expect that over time the margins of notability will continue to be pushed further and further back. I don't think that the margins will ever be pushed out completely to the point that (e.g.) the serial number of the dollar bills in my purse will merit their own articles (although it's not entirely out of the question, as many of them are catalogued already at, but I think there's still a great deal of room for expansion and I expect to see Wikipedia expand into that space over the long haul.

The ongoing battle over webcomics seems to be the current exception to this trend, and I don't expect it to continue. Assuming that they don't give up, the webcomics fans will eventually win, as they simply outnumber the notability pruners. At the moment, the pruners are organized against webcomics, and they are assiduously defending that territory. However, the pruners are more subject to attrition in the ranks than the webcomics fans, and it is likely inevitable that too many of their faction will leave Wikipedia or be drawn off into some other battle (say, amateur sports leagues, or radio towers, or some other equally borderline area) and the resulting loss of active focus will let the webcomics fans win out. It's far easier, in most cases, to recruit people in favor of keeping content than it is to recruit those opposed to it.

So, rather than spending a lot of time refining the definition of notability, I would advise discarding it entirely. Notability is, in practice, is a proxy for a large number of largely personal beliefs about what should be in an encyclopedia for which there is no consensus within the Wikipedia community. Furthermore, those beliefs shift over time, and I believe that shift will tend toward broader inclusion over time. The problem with broad inclusionism is that it will inevitably lead to more articles than the Wikipedia community can effectively maintain. (It is difficult to deny that this has already happened.)

The problem with discarding notability is that immediately people will scream "But then we will have articles about what you had for breakfast yesterday". Well, no, we won't. (Although it might be interesting to have that data; I'm sure that there will be people in 2150 who will be interested in knowing about the dietary habits of early 21st century IT professionals. There are probably people in 2007 with that interest, for that matter.) I am not advocating having no standards at all; that would be irrational. Instead, the standards must reflect maintainability as the main consideration. A record of my breakfast yesterday (for the record, two glazed Dunkin Donuts and a bottle of Aquafina) is unverifiable, and thus unmaintainable, and thus unfit for inclusion in Wikipedia. Verifiability isn't enough for maintainability, but it's definitely a minimum characteristic.

This seems to be the general direction of the discussion that Sage refers to, although they're not characterizing it as maintainability, but instead attributability. I don't think attributability is enough. One of Wikipedia's largest problems right now is that it's larger than its community can effectively tend to. Wikipedia needs to aggressively limit its growth, at least in the short term, to give its community enough time to structure itself better to be able to handle the content it has now, to say nothing of the content it will acquire in the future. The problem that adopting attributability (or verifiability) as a minimum criterion for inclusion is that someone is going to have to check the cited sources for accuracy. Nobody is doing that now, except on a haphazard basis. Wikipedia has no process now for any sort of organized maintenance of the encyclopedia; even vandalism management is done haphazardly.

Quite frankly, I think it would be appropriate for Wikipedia to disable new page creation (except for admins, to deal with special cases) for an entire month and spend that month developing the infrastructure to better maintain both the articles it currently has and the new articles it'll gain once new page creation is reenabled. New page review needs to be systematic, not haphazard, and there need to be systems to ensure that every new page is looked at by at least one and preferably several experienced editors promptly after creation, both to properly categorize it (the stub sorters already sorta do this, but they do so in a far less useful way than they could) and to evaluate the article for what action the community needs to take with respect to it. And then the community needs to actually do those things.

There are currently 21,598 articles tagged as needing cleanup and 55,928 tagged as lacking sources. And I suspect that only represents about 20% of the articles that actually belong in those respective categories. These numbers are not falling with time; they are growing (a month ago, there were only 49,607 tagged as lacking sources). These backlogs reflect the rapidly declining overall quality of Wikipedia. The situation may already be out of control; if it is not yet, it likely will be soon. The problem is that the community largely seems not to care, and that really bothers me.

Deleting all unverified articles would be a good start. Not all at once, but a deliberate, systematic process to either source or delete those 55,928 articles would be a great start. Proper use of automation is critical to this, and I really think that's where Wikipedia needs to be concentrating its activities in the next year. It would be great if the Foundation would help to recruit the volunteers needed for this effort; the problem with the current community is that there don't seem to be enough people interested in this sort of work to get it done.


  1. Hi,
    as someone whose been following a large open source project (Debian), I find myself thinking of parallels to Wikipedia. Here are a few:
    1. problems related to scalability and maintainability
    Debian grew from a handful to now about 1000 developers with others contributing patchs, translation, qa. and infrastructure. There is now 16,000+ binary packages. Similarly, Wikipedia now has a million english articles and some large number of 'developers' with a large number of folks who do the equivelant of patches, translation, infrastrure and qa.And you think that there should be a temporary halt on articles so that you can regroup the people and infrastrure.
    2. need to determine policy for inclusion
    Debian has its DFSG and -devel list to see if something new should be added. But after a while, folks check to see if the packages is still ok or obselete. So Wikipedia give folks a change to make an article but it is then scrutinized initially and after a while you check back and see if its still 'noteworthy'.
    3. need to have checks on sources
    Debian uses the DFSG and folks check source for attribution and freeness. Wikipedia also has to make sure an article is not plagerized, is free and check its attribution.
    4. need to have teams to do non-fun bits like qa
    Debian has folks doing translation, proofing man pages, documention, technical committee and keeping the servers running. Wikipedia has folks who do some of this and more. I guess the arbitration stuff.
    5. need folks to work on translations
    Not much to say here. Both translate things.
    6. building a method to ensure certain members are trusted.
    Debian has a new maintainer process and a web of trust. Wipedia has a certain trust metric related to number of articles and such and IIRC maybe a new credential verification.

  2. You describe the problem of growing cleanup backlogs and an increasing proportion of low-quality articles as an indication of "rapidly declining overall quality of Wikipedia". I don't think that's the case. It's more a sign of rising standards (i.e., less tolerance of lack of sources and other problems) and rapidly increasing number of articles on increasingly narrow topics.

    A better way to look at the quality gradient of Wikipedia is to consider a fixed set of articles and see how they change over time. And if that's the criterion, I think it's clear that Wikipedia's quality is rising (if slowly). The biggest barrier to faster progress is the lack of experts with relevant mastery of the literature of big topics (particularly social scientists and historians).

  3. There's a RfC/straw poll in the German language Wikipedia that emerged from the same insights that you have written down here. The idea is to stop creating new articles for a week every month:

    The poll ends on March 21st. Today there are 91 Wikipedians in favor to test his idea and 164 against it.

  4. The dynamic you describe is like local government. There are boards for libraries, parks, fire protection districts, etc. A guy gets elected to one of these having promised to 'say no'|'stop xyz'|'resist expansion'. And he does say no, but eventually the others wear him down; he leaves the board or joins the expansionists. It's more fun to do stuff than prevent stuff. The result is a bias toward expanding everything, and a strong bias toward preserving anything with an established constituency. It's not that it's good or bad, just a part of human nature to take into account.

  5. I agree completely with what Sage said.

  6. In some ways, the articles that people get all worked up about are the ones likely to be maintained. Most of them are 'fanboy' topics that people want deleted because 'they shouldn't be in a serious encyclopedia'. But the fanboys are numerous, and hard-working, and over time will obsessively improve the articles in their area of obsession. Original research is a problem, but fanboy arguments will over time make sure that everything is nicely sourced, just to win the arguments.