Obsolete:Compress old revisions
There is a script to compress individual old revisions. Two modes, single revision compression (50% space use) and multiple (25% use). Needs to be run as root to create the log files.
Concatenated multiple revision compression
This reduces the size of old records to about 25% of the original by combining multiple revisions and compressing them all into one record. Not available as a configuration setting so you need to apply it as a batch job.
- cd /home/wikipedia/common/php-new/maintenance
- nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " | tee -a /home/wikipedia/logs/compressOld/20050108enwiki
If the preceding run was interrupted after getting as far as Burke it would be resumed with nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Burke | tee -a /home/wikipedia/logs/compressOld/20050108enwiki.
The -q " cur_namespace not in (10,11,14,15) " part is optional but should be used at present for Wikimedia hosted projects, while deletion and undeletion of articles with concatenated compressed revisions is unavailable. It disables concatenated compression of template and category pages and their talk pages, which are currently being changed at a high rate.
Not a problem to apply concatenated compression to records which are already compressed.
Normal operation looks like this:
[user@zwinger:/home/wikipedia/common/php-1.4/maintenance]$ nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Cleanthes | tee -a /home/wikipedia/logs/compressOld/20050108enwiki Depending on the size of your database this may take a while! If you abort the script while it's running it shouldn't harm anything, but if you haven't backed up your data, you SHOULD abort now! Press control-c to abort first (will proceed automatically in 5 seconds) Starting from Cleanthes Starting article selection query cur_title >= 'Cleanthes' AND cur_namespace not in (10,11,14,15) ... Cleanthes Talk:Cleanthes Wikipedia_talk:Cleanup ..................../...................././ Waiting for 10.0.0.2 10.0.0.1 10.0.0.3 10.0.0.24 10.0.0.23 Cleanup MediaWiki_talk:Cleanup Wikipedia:Cleanup ........../........../........../........../ .........../........../.........../............./............../ ............../............./............./............./............/ ............/............/............./.............../
When there are a large number of revisions for an article it's possible that you'll lose connection (timeout) to one of the database servers. Restarting after that is harmless, if irritating:
Waiting for 10.0.0.2 10.0.0.1 A database error has occurred Query: COMMIT Function: Database::immediateCommit Error: 2013 Lost connection to MySQL server during query (10.0.0.1) Backtrace: Database.php line 345 calls wfdebugdiebacktrace() Database.php line 297 calls databasemysql::reportqueryerror() Database.php line 1345 calls databasemysql::query() Database.php line 1262 calls databasemysql::immediatecommit() compressOld.inc line 249 calls databasemysql::masterposwait() compressOld.inc line 226 calls waitforslaves() compressOld.php line 74 calls compresswithconcat()
Cause is being investigated - may be servmon kills of the idle slave threads during long master operations, since it can take a long time to retrieve all old records sometimes, perhaps 400 seconds for 20,000 on a lightly loaded master.
Single revision compression
This produces about a 50% reduction and is also available automatically via a config file setting. Use the batch job either to apply the compression if it wasn't on before.
- cd /home/wikipedia/common/php-new/maintenance
- nice php compressOld.php en wikipedia -t 1 -c 100 5467442
- -t 1 : the time to sleep between batches, in seconds
- -c 100: the number of old records per batch
- 5467442: the old_id to start at, usually 1 to start. Displayed as it runs, if you stop the job, note the last value reached and use it to resume the job later. You get a warning for every record which has already been converted, so don't start much below the point you need.
- batch size of 5000 is OK off peak
Completed. Left about 40GB lost to fragmentation. Will take a table rebuilt to free it but that can't be done on Ariel using an InnoDB table because it will add 40GB of space to the tablespace for the copy.
Full options
* Usage: * * Non-wikimedia * php compressOld.php [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>] * [-a <first-article>] [--exclude-ns0] * * Wikimedia * php compressOld.php <database> [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>] * [-f <max-factor>] [-h <factor-threshold>] [--exclude-ns0] [-q <query condition>] * * <type> is either: * gzip: compress revisions independently * concat: concatenate revisions and compress in chunks (default) * * <start-id> is the old_id to start from * * The following options apply only to the concat type: * <begin-date> is the earliest date to check for uncompressed revisions * <end-date> is the latest revision date to compress * <chunk-size> is the maximum number of revisions in a concat chunk * <max-factor> is the maximum ratio of compressed chunk bytes to uncompressed avg. revision bytes * <factor-threshold> is a minimum number of KB, where <max-factor> cuts in * <first-article> is the title of the first article to process * <query-condition> is an extra set of SQL query conditions for the article selection query
Database fragmentation
Because the compression reduces record sizes it can result in substantial database record fragmentation. In the case of English language Wikipedia the old text started at 80GB and was reduced to 40GB but the MySQL InnoDB storage engine didn't make the space free for reuse by other tables in the tablespace.
The space can be fully freed by using alter table old engine=InnoDB but this requires as much extra free space in the tablespace as the complete new copy of the table requuires. If the space isn't available in the tablespace, the tablespace will be enlarged to make room. If you're short of disk space that can be impossible or could leave insufficient space for temporary files and logs. In a multiple wiki situation it's best to apply the compression to the smallest wikis first, alter them to free the space, and move on up to larger sizes. By the time you get to the largest you'll have freed much of the space they will need.
Alternatively, you can temporarily convert some tables to MyISAM using alter table tablename engine=MyISAM to move them out of the tablespace and into the normal free space, freeing space in the tablespace. Once the alter table for the big projects has completed you can use alter table tablename engine=InnoDB to convert them back to InnoDB.
A combination of both doing smaller wikis first and converting some tables in some wikis to MyISAM may be necessary if space is very tight. For Wikimedia, the minimum safe free disk space is between 9 and 10GB. Even at 10GB there's the risk that a large set of temporary files can leave the server without sufficient log space and break replication.
If using MySQL version 4.1 there's also the option of putting each database into its own tablespace. You'll still need enough free space for the copy of the table but won't have the main tablespace size expanded.
Compression results
Some raw data for the Wikimedia compression in the week preceding 18 February 2005. where present, after compression datra is in FlagCount format while before is currently in countFlag format. Times are the run time on ariel for the alter table to free the space. Space free in the Ariel tablespace went from 8GB to 15.7GB.
changes for this set: -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " meta 97517 rec size pre 471465984 post 446283776 94.66% ariel 4:13 min 570 no;3 0;945gzip;79672object;15956utf-8,gzip. 543/1/841/84438/12034 commons 79446 rec size pre 129253376 post 94978048 73.48% ariel 0:40 min 904 no;4 0;22889gzip;54666utf-8,gzip. 896/2/5181/38078/36125 sources 38486 rec size pre 765607936 post 590938112 77.19% ariel 12:32 min 6247 no;1 0;21184gzip;10966utf-8,gzip. 287/2653/29698/5994 hewiki 230649 rec size pre 905347072 post 453722112 50.12% ariel 7:12 min 1st 1630 no;29 0;175031gzip;52771utf-8,gzip. 83/8/6684/189966/34989 etwiki 61803 rec size pre 195248128 post 70844416 36.28% ariel 1:40 min 12145 no;1 0;39846gzip;9575utf-8,gzip. 413/1/2115/53765/5747 cawiki 65178 rec size pre 147292160 post 75038720 50.95% ariel 2:56 min 2699 no;2 0;52742gzip;9543utf-8,gzip. 50/4210/55272/5796 huwiki 67255 rec size pre 425934848 post 212418560 49.87% ariel 7:20 min 7174 no;1 0;43199gzip;16524utf-8,gzip. 222/1572/56323/9402 slwiki 80759 rec size pre 199589888 post 93929472 47.06% ariel 2:56 min 1235 no;63942gzip;14541utf-8,gzip. 77/2893/69033/9138 nowiki 105901 rec size pre 228016128 post 129597440 56.84% ariel 2:52 min 1581 no;72071gzip;31435utf-8,gzip. 60/5933/80965/19431 bgwiki 108585 rec size pre 219398144 post 117014528 53.33% ariel 2:39 min 3036 no;1 0;83734gzip;21563utf-8,gzip. 328g5978o93974u8451 ruwiki 111898 rec size pre 382042112 post 210354176 55.06% ariel 3:51 min 1119 no;12 0;78024gzip;32120utf-8,gzip. 81/8g3476o89080u18929 eowiki 120872 rec size pre 203833344 post 97091584 47.63% ariel 3:13 min 1631 no;106402gzip;12641utf-8,gzip. 45g5167o108343u7476 fiwiki 127004 rec size pre 376274944 post 310099968 82.24% ariel 4:14 min 1st 1697 no,9 0,96339 gzip,28367utf-8,gzip. dawiki 168081 rec size pre 537673728 post 122273792 22.74% ariel 3:52 min 125298 no;14 0;42385gzip. 5406/1g10781o152094 enwikiquote 37614 rec size pre 279101440 post 115933184 41.54% ariel 3:13 min 10331 no;20167gzip;7004utf-8,gzip. 341g842o32676u3850 enwikibooks 91536 rec size pre 618070016 post 238649344 38.61% ariel 3:32 min 25797 no;5 0;49515gzip;15850utf-8,gzip. 805g2852o77213u10923 enwikinews 24421 rec size pre 117719040 post 77135872 65.53% ariel 0:56 min 1st 1420 no,5496gzip,17097utf-8,gzip enwiktionary 153900 rec size pre 389431296 post 217677824 55.90% ariel 2:26 min 1st 7048 no,3 0,119327gzip,26851utf-8,gzip zhwiki 297194 rec size pre 586022912 post 481017856 82.08% ariel 11 min 38 no;6 0;6565gzip;243090object;49149utf-9,gzip. 38/6g6562o243090u55444 eswiki 457025 rec size pre 1615773696 post 1339899904 82.93% ariel 22.5 min 5 no;13949gzip;391189object;53472utf-8,gzip. 5g13910o391179u57245 itwiki 396178 rec size pre 1466810368 post 1101873152 75.12% ariel 19.5 min 61 no;16 0;17262gzip;329671object;50929utf-8,gzip svwiki 416266 rec size pre 797802496 post 475807744 59.64% ariel 8 min 11692 no;3 0;55069gzip;350841object nlwiki 792247 rec size pre 2745171968 post 1545601024 56.30% ariel 26 min 12693 no;12 0;94620gzip;687934object plwiki 551211 rec size pre 733937664 post 650002432 88.56% ariel 9 min 54 no;10 0;12825gzip;473302object;67028utf-8,gzip frwiki 1428554 rec size pre 5574230016 post 4383047680 78.63% ariel 47 min 1303 no;44 0;105870gzip;1148205object;177333utf-8,gzip jawiki 1390023 rec size pre 3605004288 post 2899312640 80.43% ariel 46 min 1320 no;432 0;135797gzip;1039149object;217480utf-8,gzip dewiki 4327741 rec size pre 15771467776 post 13693353984 86.84% ariel 159:46 min (count time 2518 sec = 42 minutes)
Compression for en is ongoing. Currently needs to be resumed with:
nice php compressOld.php en wiki -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Surfers | tee -a /home/wikipedia/logs/compressOld/20050108enwiki