Thursday, April 19, 2007

Motivation and non-English versions of Wikipedia

One of the things I noticed when Greg and I did our study of edits by project and country of origin last year is that the Indian language projects are very lightly trafficked, and that the traffic patterns didn't correspond to what we saw on other projects. For languages where most of the speakers of the language are to be found in one country, we saw that typically 65 to 95 percent of the edits to that language's project came from that country. This pattern covers German (76.2% from Germany), French (67.3% from France), Japanese (93.2% from Japan), Polish (80.1% from Poland), Italian (85.7% from Italy), Dutch (73.0% from the Netherlands), Swedish (72.2% from Sweden), Finnish (77.4% from Finland), Hebrew (79.2% from Israel), Czech (79.5% from the Czech Republic), Norwegian (66.1% from Norway), and Korean (67.6% from South Korea).

Several languages did not meet this pattern: English, Spanish, Portuguese, Russian, Chinese, Slovak, Danish, Romanian, Esperanto, Bengali, Tamil, Kannada, Telugu, and Marathi. English is widely spoken worldwide, and this shows in the edit source patterns. There are a large number of Russian speakers throughout Europe and North America, and this also shows. Spanish is the national language of a large number of countries in South and Central America; if the edits from these were all collected together and counted along with those from Spain, they would easily exceed the 65% cutoff. Similarly, Portugal and Brazil together account for 89% of edits to the Portuguese edition. Edits from China, Hong Kong and Taiwan together amount to about 60.6% of edits to the Chinese edition, and it's likely that many editors to the Chinese edition show up as being elsewhere due to the use of tunneling to evade the Great Firewall of China. Edits to Slovak from the Czech Republic and the Slovak Republic combine to 89.4% of that project's total. Denmark plus the Netherlands amount to 81.5% of edits to Danish. Only Romania lacks a good excuse for its low native participation, and I suspect that the repressive politics there in the not too distant past accounts for some of that. That leave just Esperanto (an artificial language with no native land), and several Indian native languages.

All of the Indian languages in question are spoken mainly in India (Kannada, Telugu, Marathi, Tamil), Bangladesh (Bengali) or Sri Lanka (Tamil). Yet none of these has a majority of its contributors in any of these countries, except for Bengali (60.6%). In the cases of Kannada and Marathi, a majority of the edits come from the United States. In general, Indian languages seem to get as much participation from outside the country as inside it. Now, some of this could be explained by a lack of access in the country. However, Bangladesh, which is likely to have Internet access levels comparable (if not inferior) to India, manages to score 60.6% of the edits to its project. I think something more is going on.

Utkarshraj Atmaram's comments (from his blog) explain why Americans seem more interested in Kannada than Indians. And I've heard similar comments elsewhere. Indians (and this appears to be an Indian thing not shared by Bangladeshi; the Bangladeshi appears to have a much stronger linguistic national identity than the Indians) just do not see their native languages as languages of learning.

Add to that the motivational aspects of participating in Wikipedia. One can either write articles for, say, Kannada, where relatively few people will read them, or for English, where far more people will read them. I suspect a lot of people are at least somewhat motivated by the desire to have their content read, and writing in on an internet backwater like the Kannada Wikipedia doesn't stroke that quite so much as writing it on the English Wikipedia. On the other hand, at least some Americans editing Kannada are likely doing so because it makes them feel like they're "empowering a disadvantaged group" (Kannada speaker) or something.

In any case, it seems to me that Jimbo's frequent focus on Indian native languages is not well-placed. Wikipedia isn't going to change the attitudes of the Indian people toward English or their native languages. A far more fertile area to target would be Middle Eastern languages, some African languages (although English is nearly as pervasive in Africa as in India, due to the same sort of colonial histories), and North (and South) American aboriginal languages (the last area being one that Jeff Merkey has apparently put a lot of effort into). In general, languages which have a lot of national identity tied up into them (e.g. Belorussian, which was recently the subject of a great deal of division) are going to be more fruitful than languages that the speakers of which don't seem to care a whole lot about.


  1. a) Expat Indians have this whole "native" thing, where they want to show contributions to the native culture.

    b) Internet access in India is restricted to pretty much the richer classes, who are usually more comfortable writing in English than their own mother tongue (I include myself in this, because I tend to correspond more with non-local-language speakers than locals). Computers are still fairly expensive in India, where a cheap PC can be as priced at two to three months of family income.

    Most access from India is from cyber cafes, where you get billed by the hour, or from a broadband connection, where you get billed by the byte.

    c) Indic language computing lags far behind computing in most other languages. Both Microsoft Windows and Linux have gotten usable local language interfaces fairly recently. According to the first version of Windows with Indic support was Windows 2000.

    Linux is getting usable thanks to the Indlinux project, but still lacks good fonts.

    As computer penetration increases in the Indian hinterland, we should start seeing changes (and content).

  2. To anonymouse:

    That still doesn't explain bnwiki. The conditions in Bangladesh would seem to be as bad for a Wikipedia, if not worse, than the ones you described.

  3. Good analysis. yes, nationality attached to a particular language is very important. second, the language in which the highly elite, upper class people educate themselves. as far as india, most of the higher, technical eduction is in english. so, they don't feel a need 2 write in local language. even if people wish to write in local language, they lack lot of technical words, practice in writing local languages.

    regarding edits from outside the land of the language spoken - these people have the comfort of unlimited internet access, time to spend and an urge to contribute in some way to their mother tongue