Thursday, April 26, 2007

How to raise a fur coat

"Members of the Society for the Prevention of Cruelty to Animals hav lobbied, successfully in many places, for bylaws that make it more expensive to keep a breeding animal than to sterilize Man's closest enemy. If you liked the stray, then you'll love the kitten. It'll be coloured the way you want. It won't shed. With the right breed you can wear it in fifteen years."

The foregoing text was taken from Wikipedia's article on the domestic cat. I wish we had "blame" functionality so I could see how long this text has been here and who added it. Oh well.

Friday, April 20, 2007

Does vandalism make Wikipedia less of an encyclopedia?

This is, of course, old news; it's from March 12th, and the fact that I'm just getting around to blogging about it is a reflection of how busy I've been. (March 12th was three days after we closed on the new house, and so I was understandably busy with more important things than amusing y'all about the insanity that is Wikipedia.)

By now everyone -- including Sabre Publishing Ltd., who, amusingly enough, does not have a Wikipedia article -- knows that anything you find on Wikipedia might contain random nonsense that, if used without inspection, could prove quite embarrassing. And certainly this might have had something to do with this; the raw possibility that a Wikipedia page might contain anything will certainly put many people off.

This is both Wikipedia's problem, and not Wikipedia's problem.

It is Wikipedia's problem insofar as people are putting garbage into articles; the Wikipedia community needs to find better ways to deal with this than it has to date. The much-talked-about but yet-to-be-seen "stable versions" would likely help a great deal, but I've been saying that now since last summer and Wikipedia still don't have them and (as far as I know) there isn't even a timetable for when Wikipedia will have them. Apparently the problem has something to do with page moves; I looked at this briefly after talking to someone on IRC and recognized that, yes, that would be difficult with the current database schema. But I already knew that MediaWiki's database schema had issues. But even setting aside stable versions as a solution, there are other possibilities: coordinated vandalism patrolling tools, for example. A coordinated vandalism management tool would ensure that each edit was examined by only one or two patrollers. This can be done by putting each new edit into a queue, and having patrollers pop edits off the queue and review them driven not by the march of the RC stream, but instead their own availability. The queue might back up if not enough patrollers are on duty at any time, but the software can probably deal with this as well by merging multiple edits to the same article (and dropping them when a subsequent edit is from a trustworthy user). This system could also implement quality control and metric collection, both areas that are currently not addressed in the vandalism management arena.

It is not Wikipedia's problem insofar as anyone reading Wikipedia, or in fact any reference source, is expected to use their common sense in evaluating what they find, and not merely accepting it unquestioned. So, for example, even if Wikipedia says that Sioux Lookout is "full of drunks and a dirty little town", or that chickens can "fly into magic dragon helecopters", even a relatively undiscriminating reader would likely to take a moment to consider whether these remarkable facts are in fact true. However, it's somewhat more difficult with respect to the Jesuit problem cited by Noam Cohen's NYT article (mentioned above). The average person is not likely to have any reason to believe that the statement "the rebels themselves were backed by the foreign power of the Jesuits and the Roman Catholic Church" (as found in this edition of the Wikipedia article in question) is wrong. One would have to be independently knowledgeable of the Shimabara Rebellion (which I do not recall even hearing of prior to today despite taking several Asian history classes in high school and college) and of the Jesuit's role in it. Errors like this can be due to honest mistakes in fact by editors, to deliberate bias by editors, or to so-called "sneaky vandalism", where an editor deliberately introduces false but plausible information to degrade the quality of Wikipedia. I can't say which of the three is responsible for this particular error here; the point, however, is that "caveat lector" clearly applies here. One ought not to rely on anything in Wikipedia for anything more than casual purposes without verifying it independently, lest one find oneself accidentially claiming that artificial poultry incubation was invented by George W. Bush.

Clearly the Sioux Lookout incident was primarily Sabre Publications fault: they were abjectly negligent to send copy to press without looking it over even superficially. At the same time, the Wikipedia community was negligent for letting the article say such unpleasantly nasty things about Sioux Lookout for nearly two days back in January (the vandalism was in place from January 27 at 2310Z until January 29 at 1515Z). A more comprehensive, more responsible activity monitoring system would have discovered this edit more promptly. (Tellingly, the vandalism was reverted by an anonymous editor, not by one of Wikipedia's vaunted vandal patrollers. One wonders how often this happens.)

The Jesuits in Japan incident, though, is more the result of a structural defect in Wikipedia, combined with a perceptual problem with Wikipedia unfairly calling itself an "encyclopedia". People expect an encyclopedia to be accurate. Wikipedia is not systematically fact-checked, which means that its accuracy falls well short of what most people probably expect, including, apparently, Professor Waters' students. Wikipedia could do a lot by implementing systematic fact-checking methods, but in my experience attempts to suggest this to the Wikipedia community are generally met with "That's too much bother, people won't do it". And that attitude, quite frankly, is why I really question the commitment of the Wikipedia community to writing an encyclopedia.

Thursday, April 19, 2007

Motivation and non-English versions of Wikipedia

One of the things I noticed when Greg and I did our study of edits by project and country of origin last year is that the Indian language projects are very lightly trafficked, and that the traffic patterns didn't correspond to what we saw on other projects. For languages where most of the speakers of the language are to be found in one country, we saw that typically 65 to 95 percent of the edits to that language's project came from that country. This pattern covers German (76.2% from Germany), French (67.3% from France), Japanese (93.2% from Japan), Polish (80.1% from Poland), Italian (85.7% from Italy), Dutch (73.0% from the Netherlands), Swedish (72.2% from Sweden), Finnish (77.4% from Finland), Hebrew (79.2% from Israel), Czech (79.5% from the Czech Republic), Norwegian (66.1% from Norway), and Korean (67.6% from South Korea).



Several languages did not meet this pattern: English, Spanish, Portuguese, Russian, Chinese, Slovak, Danish, Romanian, Esperanto, Bengali, Tamil, Kannada, Telugu, and Marathi. English is widely spoken worldwide, and this shows in the edit source patterns. There are a large number of Russian speakers throughout Europe and North America, and this also shows. Spanish is the national language of a large number of countries in South and Central America; if the edits from these were all collected together and counted along with those from Spain, they would easily exceed the 65% cutoff. Similarly, Portugal and Brazil together account for 89% of edits to the Portuguese edition. Edits from China, Hong Kong and Taiwan together amount to about 60.6% of edits to the Chinese edition, and it's likely that many editors to the Chinese edition show up as being elsewhere due to the use of tunneling to evade the Great Firewall of China. Edits to Slovak from the Czech Republic and the Slovak Republic combine to 89.4% of that project's total. Denmark plus the Netherlands amount to 81.5% of edits to Danish. Only Romania lacks a good excuse for its low native participation, and I suspect that the repressive politics there in the not too distant past accounts for some of that. That leave just Esperanto (an artificial language with no native land), and several Indian native languages.



All of the Indian languages in question are spoken mainly in India (Kannada, Telugu, Marathi, Tamil), Bangladesh (Bengali) or Sri Lanka (Tamil). Yet none of these has a majority of its contributors in any of these countries, except for Bengali (60.6%). In the cases of Kannada and Marathi, a majority of the edits come from the United States. In general, Indian languages seem to get as much participation from outside the country as inside it. Now, some of this could be explained by a lack of access in the country. However, Bangladesh, which is likely to have Internet access levels comparable (if not inferior) to India, manages to score 60.6% of the edits to its project. I think something more is going on.



Utkarshraj Atmaram's comments (from his blog) explain why Americans seem more interested in Kannada than Indians. And I've heard similar comments elsewhere. Indians (and this appears to be an Indian thing not shared by Bangladeshi; the Bangladeshi appears to have a much stronger linguistic national identity than the Indians) just do not see their native languages as languages of learning.



Add to that the motivational aspects of participating in Wikipedia. One can either write articles for, say, Kannada, where relatively few people will read them, or for English, where far more people will read them. I suspect a lot of people are at least somewhat motivated by the desire to have their content read, and writing in on an internet backwater like the Kannada Wikipedia doesn't stroke that quite so much as writing it on the English Wikipedia. On the other hand, at least some Americans editing Kannada are likely doing so because it makes them feel like they're "empowering a disadvantaged group" (Kannada speaker) or something.



In any case, it seems to me that Jimbo's frequent focus on Indian native languages is not well-placed. Wikipedia isn't going to change the attitudes of the Indian people toward English or their native languages. A far more fertile area to target would be Middle Eastern languages, some African languages (although English is nearly as pervasive in Africa as in India, due to the same sort of colonial histories), and North (and South) American aboriginal languages (the last area being one that Jeff Merkey has apparently put a lot of effort into). In general, languages which have a lot of national identity tied up into them (e.g. Belorussian, which was recently the subject of a great deal of division) are going to be more fruitful than languages that the speakers of which don't seem to care a whole lot about.

Wikipedia as primary source: just what is original research, anyway?

Erik Moeller, back on March 13th, wrote this email to the WikiEN-L mailing list, entitled "Wikipedia as a primary source". It seems mainly to address the topic of when the subjects of articles wish to interact with the Wikipedia editorial process and change the content of their articles.



It is quite common for Wikipedia to have articles with factual errors. Disturbingly common, in fact. It is also not surprising that the people who have articles about them that are in error would like to have them corrected. What is surprising, at least to me, is that many Wikipedia volunteers have been refusing to correct errors in such articles upon complaint from the subject until and unless the subject goes to the hassle of publishing a statement (online, of course, since these people can't be bothered to read any sort of offline source) correcting the factually false information on Wikipedia. Erik's comments seem intended to address this quite silly behavior, and to the extent that they do so they're on point. (Incidentially, I also agree that the Foundation should not be involved in this process any more than necessary.)



However, the whole role of primary sources in encyclopedic authorship is still underdeveloped at Wikipedia. There seems to be an categorical attitude that primary sources should never be used, not even to verify (or discredit) claims from other sources. In general, this reflects back to the broad misunderstandings of Wikipedia's No original research policy. Wikipedia should not have original research, yes, this is true. However, verifying facts is not original research; it's verification. Similarly, removing alleged facts which cannot be verified or which are actually found to be false is also not original research; it's, once again, verification. Verification is such a critical part of the encyclopedic authoring process that one must wonder at the sanity of those who propose with a straight face, that Wikipedians should not do it.



And this is why I oppose the deletion of this template. A private email might not be a valid encyclopedic source to assert a point, but it is definitely a valid source to confirm one. It makes sense to provide the means to record that a fact has been confirmed, by whatever means were used. I imagine it'll be used infrequently, but it might still end up being used from time to time. Deleting a citation format won't prevent people from doing original research, and it'll just make it harder for people doing real, proper, work to document what they've done.

New blogging software

Blogger seems to have changed something that broke Semagic completely. I'm trying this post with ScribeFire, just to see if it works. If so, then so, if not, then not.



Seems to work. I'm about a month behind on topics, so my next several posts may seem oldish while I clear my topic backlog. You'll just have to live with that. :)