Obsolete:Wiki farm

From Wikitech
This page contains historical information. It may be outdated or unreliable.

The configuration that Wikimedia uses for MediaWiki is quite different to the one that is documented for the purposes of external use. I thought I'd give a brief explanation of how running 374 wikis on 7 top level domains is different to running one wiki, and how we overcome the technical challenges encountered.

The Matrix

There are currently four multi-subdomain "sites" operated by Wikimedia: Wikipedia, Wiktionary, Wikibooks and Wikiquote. Our setup is unusual in that instead of using a database prefix to indicate which site the wiki belongs to, we use a database suffix. This is for historical reasons. Following is a list of Wikimedia wikis. Wikis which actually exist are shown in bold.

Wikipedia
w
Wiktionary
wikt
Wikibooks
b
Wikiquote
q
aaaaaaaa
abababab
afafafaf
akakakak
alsalsalsals
amamamam
anananan
arararar
arcarcarcarc
asasasas
astastastast
avavavav
ayayayay
azazazaz
babababa
bebebebe
bgbgbgbg
bhbhbhbh
bibibibi
bmbmbmbm
bnbnbnbn
bobobobo
brbrbrbr
bsbsbsbs
cacacaca
cececece
chchchch
chochochocho
chrchrchrchr
chychychychy
cocococo
crcrcrcr
cscscscs
csbcsbcsbcsb
cvcvcvcv
cycycycy
dadadada
dededede
dvdvdvdv
dzdzdzdz
eeeeeeee
elelelel
enenenen
eoeoeoeo
eseseses
etetetet
eueueueu
fafafafa
ffffffff
fifififi
fjfjfjfj
fofofofo
frfrfrfr
fyfyfyfy
gagagaga
gdgdgdgd
glglglgl
gngngngn
gugugugu
gvgvgvgv
hahahaha
hawhawhawhaw
hehehehe
hihihihi
hohohoho
hrhrhrhr
hthththt
huhuhuhu
hyhyhyhy
hzhzhzhz
iaiaiaia
idididid
ieieieie
igigigig
iiiiiiii
ikikikik
ioioioio
isisisis
itititit
iuiuiuiu
jajajaja
jvjvjvjv
kakakaka
kgkgkgkg
kikikiki
kjkjkjkj
kkkkkkkk
klklklkl
kmkmkmkm
knknknkn
kokokoko
krkrkrkr
ksksksks
kukukuku
kvkvkvkv
kwkwkwkw
kykykyky
lalalala
lblblblb
lglglglg
lililili
lnlnlnln
lolololo
ltltltlt
lvlvlvlv
mgmgmgmg
mhmhmhmh
mimimimi
minnanminnanminnanminnan
minnanminnanminnanminnan
mkmkmkmk
mlmlmlml
mnmnmnmn
momomomo
mrmrmrmr
msmsmsms
mtmtmtmt
musmusmusmus
mymymymy
nananana
nahnahnahnah
nbnbnbnb
ndsndsndsnds
nenenene
ngngngng
nlnlnlnl
nnnnnnnn
nononono
nvnvnvnv
nynynyny
ococococ
omomomom
orororor
papapapa
pipipipi
plplplpl
pspspsps
ptptptpt
ququququ
rmrmrmrm
rnrnrnrn
rorororo
roa-ruproa-ruproa-ruproa-rup
rurururu
rwrwrwrw
sasasasa
scscscsc
sdsdsdsd
sesesese
sgsgsgsg
shshshsh
shshshsh
sisisisi
simplesimplesimplesimple
sksksksk
slslslsl
smsmsmsm
snsnsnsn
sosososo
sqsqsqsq
srsrsrsr
ssssssss
stststst
susususu
svsvsvsv
swswswsw
tatatata
tetetete
tgtgtgtg
thththth
titititi
tktktktk
tltltltl
tlhtlhtlhtlh
tlhtlhtlhtlh
tntntntn
totototo
tokiponatokiponatokiponatokipona
tpitpitpitpi
trtrtrtr
tstststs
tttttttt
twtwtwtw
tytytyty
ugugugug
ukukukuk
urururur
uzuzuzuz
veveveve
vivivivi
vovovovo
wawawawa
wowowowo
xhxhxhxh
yiyiyiyi
yoyoyoyo
zazazaza
zhzhzhzh
zh-cfrzh-cfrzh-cfrzh-cfr
zuzuzuzu

There are also a number of "special" wikis:

  • sources (Wikisource)
  • Wikinews
  • meta
  • sep11 (September 11 Memorial)
  • wikimedia (experimental)
  • mediawiki

There's also a few experimental wikis that have their own script directories and so don't need to be listed in all.dblist. They aren't backed up by the normal process, and won't be included in maintenance operations:

  • test
  • rel12test
  • code.wikimedia.org

History

In the beginning, all wikis had database names ending in "wiki". For example, frwiki for the French Wikipedia, metawiki for Meta, textbookwiki for Wikibooks. This scheme was broken when, on popular demand, Brion added French and Polish Wiktionaries with the database names "frwiktionary" and "plwiktionary". These were the first language-specific subdomains outside Wikipedia. Unfortunately this didn't fit in too well with various maintenance scripts, which assumed that the database name could be obtained by concatenating the "language" (from /home/wikipedia/common/langlist) with "wiki". This was a rather loose definition of language, including things such as meta.

At this time, every wiki had its own directory in htdocs, containing a "skeleton" LocalSettings.php. This skeleton file set the $lang variable appropriately and then passed on processing to CommonSettings.php. Also, every wiki had a separate <VirtualHost *> entry in the Apache configuration, and a separate MySQL GRANT to wikiuser. This was difficult to maintain. On demand for more Wiktionaries, I decided to make some changes.

I decided to create companion Wiktionaries for all existing Wikipedias. I did this by moving to a shared document root layout. A single VirtualHost section was created with a ServerAlias of *.wiktionary.org. All wiktionaries had the same document root. In CommonSettings.php, the language was detected by retrieving the hostname from Apache. At the time I couldn't work out how to keep the same URLs for the upload directories, so I set them up with the /upload/en/0/0/Thing.png URL style, that is, including the language. I later realised that a rewrite rule could be used to rewrite traditional upload URLs to language-specific URLs. This involves a little trick with a RewriteCond that always matches. I also converted the MySQL permissions to use database wildcards, removing the need to add grants for every added wiki.

Auto-creation

This was all very well, but it became obvious that the sheer number of wikis was making maintenance difficult. Each of the 300 wikis had its own MediaWiki namespace with a copy of about 750 messages. Updating these messages took a long time. Other kinds of maintenance tasks were also tedious. There was a lot of demand from the users for a multi-subdomain layout in other projects. Adding languages was a tedious, error-prone, time-consuming process, which developers had to perform on a very regular basis. I decided that I needed to automate the process. At first I wrote a command-line script to add languages, but the script was complicated and needed developer involvement due to the unwieldy legacy layout of the Wikipedias. For a shared document-root layout, the only thing a script needs to do is to set up the database. Armed with my new upload rewriting trick, I decided to convert Wikipedia to a shared document root layout. Instead of creating 150 new wikis for Wikibooks and 150 for Wikiquote, I decided to make an on-demand system, with a script invoked by the user to create new wikis. This consists of the following components.

missing.php

/home/wikipedia/common/php-new/missing.php

This script is invoked by CommonSettings.php if the detected hostname does not correspond to an existing wiki. "Existing wikis" are those listed in /home/wikipedia/common/all.dblist. This script displays some nice-looking HTML. If the subdomain is in $wgLanguageNames (from Names.php), it also displays a "create wiki" button. Clicking on this button adds a line to /home/wikipedia/logs/addwiki_requests. Since security restrictions do not allow the apache user to create tables, the requests are fulfilled by an hourly cron job running as tstarling. A commmand-line script is invoked called addwiki.php

addwiki.php

/home/wikipedia/common/php-new/maintenance/addwiki.php

This script creates wikis based on requests filed in addwiki_requests. To prevent an attack by a script automatically requesting creation of all wikis, at most one request per hour is fulfilled. A particularly difficult part of writing this script (and indeed a difficult part of adding wikis before the script was written) is handling interwiki links. I gave up on trying to write a script to incrementally add links, and instead used rebuildInterwiki.inc.

rebuildInterwiki.inc

/home/wikipedia/common/php-new/maintenance/rebuildInterwiki.inc

This script rebuilds all interwiki tables by looping through all.dblist. For each database, it truncates the interwiki table, and then reinserts all necessary entries in a multi-row insert statement. Actually it doesn't do anything, it just returns the SQL to do things. The SQL is executed by addwiki.php using dbsource(). There's about 4.6 MB of SQL altogether, and it takes a few minutes to run.

Special wikis

There are always special cases left over, and these come under the "special wiki" banner. Special wikis such as sep11 were absorbed into the *.wikipedia.org handling. Special wikis which are not subdomains of wikipedia.org were left at their original locations in the htdocs directory, each with their own document root. Skeleton LocalSettings.php files were done away with some time ago, instead CommonSettings.php constructs the database name by concatenating the document root with "wiki". So meta has a document root of /home/wikipedia/htdocs/meta, and hence is assigned a database name of metawiki. For the purposes of CommonSettings.php, such wikis are considered to be Wikipedias ($site is set to "wikipedia"). The hostname used by MediaWiki needs to be overridden explicitly so that self-referential URLs can be constructed.

MediaWiki configuration

Much to our chagrin, the communities of the individual wikis like to have their own individual settings, resisting our attempts to homogenise them all with great tenacity. CommonSettings.php used to be a mess of switch($lang) structures and special cases. I decided that we needed to move the settings from code to data. With so many wikis, things were steadily getting uglier.

This problem is resolved by the SiteConfiguration object. This object stores a two dimensional array, with the names of the settings as the first index, and the names of the wikis as the second index. The keys in the second index can be of three types, checked in this order:

  1. Database name
  2. Site name (wikipedia, wiktionary, etc.)
  3. "default"

The object provides a method to extract all defined settings into the global scope. That is, it sets global variables. If no database-specific setting exists for a given variable, it will check to see if there is a site default. If there is no site default, it will check for a global default. If there is no global default, it will not set the variable, and hence the value set in DefaultSettings.php will be used. At some stage I intend to add language-wide settings between #1 and #2.

The initial idea was to construct this SiteConfiguration object only occasionally, and to store it in serialised form in NFS. But I decided this would make changing settings difficult, so a new object is constructed every time. Caching is still possible in principle. The object provides for delayed variable expansion, so that strings such as "$lang" can be stored in the cache and then expanded on each invocation.

Backup

Backups occur on manual request, by running /home/wikipedia/bin/backup-all. It uses the site-specific database name lists, e.g. wiktionary.dblist and wikibooks.dblist, so that the HTML pages on backup.wikipedia.org are of a manageable size. The backup script dumps SQL, compresses it and makes MD5 checksums.

How-To

Some How-To related to the Wiki farm: