Text storage data

From Wikitech
Jump to: navigation, search

Raw Data

Text row types as of 2010-02-18. All databases.

Count       Type
------------------------------------------------
9           0,external/simple pointer
435         0/[none]
1482941     [none]/[none]
74069       external,object/simple pointer
56103275    external,utf-8/CGZ pointer
329027392   external,utf-8/DHB pointer
472766      external,utf8/CGZ pointer
7409721     external,utf8/DHB pointer
2890300     external/CGZ pointer
12218       external/simple pointer
4113780     gzip,external/simple pointer
968957      gzip/[none]
178234      object,external/simple pointer
387         object,utf-8/ConcatenatedGzipHistoryBlob
1413        object,utf-8/HistoryBlobStub
216694      object/concatenatedgziphistoryblob
5842435     object/historyblobcurstub
1121994     object/historyblobstub
1           utf-8,external/simple pointer
464549188   utf-8,gzip,external/simple pointer
17076928    utf-8,gzip/[none]
1269        utf-8/[none]

Text row types as of 2011-08-31. All wikis: (see RT:1300 for details on how these numbers were generated.)

       Count    Type
------------------------------------------
           9    0,external/simple pointer
         437    0/[none] 
     1473028    [none]/[none] 
    64751783    external,utf-8/CGZ pointer
   363080206    external,utf-8/DHB pointer
      484000    external,utf8/CGZ pointer
     7579959    external,utf8/DHB pointer
     1180404    external/CGZ pointer
       12218    external/simple pointer
     9905103    gzip,external/simple pointer
      968337    gzip/[none] 
      181328    object,external/simple pointer
         387    object,utf-8/ConcatenatedGzipHistoryBlob 
        1413    object,utf-8/HistoryBlobStub 
      219570    object/concatenatedgziphistoryblob 
     5866400    object/historyblobcurstub 
     1046202    object/historyblobstub 
           1    utf-8,external/simple pointer
   797612169    utf-8,gzip,external/simple pointer
    17173393    utf-8,gzip/[none] 
        1269    utf-8/[none] 

Change from 2010-02-18 to 2011-08-31:

  2010-02-18   2011-08-31         Diff   Type
---------------------------------------------------------------------
           9            9            0   0,external/simple pointer
         435          437            2   0/[none] 
     1482941      1473028        -9913   [none]/[none] 
       74069          n/a          n/a   external,object/simple pointer
    56103275     64751783      8648508   external,utf-8/CGZ pointer
   329027392    363080206     34052814   external,utf-8/DHB pointer
      472766       484000        11234   external,utf8/CGZ pointer
     7409721      7579959       170238   external,utf8/DHB pointer
     2890300      1180404     -1709896   external/CGZ pointer
       12218        12218            0   external/simple pointer
     4113780      9905103      5791323   gzip,external/simple pointer
      968957       968337         -620   gzip/[none] 
      178234       181328         3094   object,external/simple pointer
         n/a          387          n/a   object,utf-8/ConcatenatedGzipHistoryBlob 
         n/a         1413          n/a   object,utf-8/HistoryBlobStub 
        1800          n/a          n/a   object,utf-8/[none] 
      216694       219570         2876   object/concatenatedgziphistoryblob 
     5842435      5866400        23965   object/historyblobcurstub 
     1121994      1046202       -75792   object/historyblobstub 
           1            1            0   utf-8,external/simple pointer
   464549188    797612169    333062981   utf-8,gzip,external/simple pointer
    17076928     17173393        96465   utf-8,gzip/[none] 
        1269         1269            0   utf-8/[none] 

Analysis

On the changes from 2010-2011

The rise in "object/historyblobcurstub" doesn't really make sense. The rise in "gzip,external/simple pointer" is concerning.


Description of fields and values

[none]/[none] 
Uncompressed text, legacy encoding
0/[none] 
Uncompressed text, wrong flags due to short-lived bug, never cleaned up
0,external/simple pointer 
As above plus MTE
gzip/[none] 
Compressed text with legacy encoding. Possibly created with Brion's original CO.
gzip,external/simple pointer 
As above plus MTE
utf-8/[none] 
Uncompressed MW 1.5+
utf-8,external/simple pointer 
As above plus MTE
utf-8,gzip/[none] 
Compressed MW 1.5+, probably generated directly by MW
utf-8,gzip,external/simple pointer 
Either as above plus MTE, or directly generated by MW (predominant non-recompressed type)
object,utf-8/ConcatenatedGzipHistoryBlob 
Presumably created by a brief enwiki-only run of CO, in MW 1.5+.
object,utf-8/HistoryBlobStub 
Stubs for the above CO run
object/concatenatedgziphistoryblob 
Object created by CO, MW<1.5
object,external/simple pointer 
As above plus MTE
object/historyblobcurstub 
Created by the 1.5 upgrade script, a reference to the cur table.
object/historyblobstub 
Pointer to a CGZ object, created by CO, MW<1.5
external,object/simple pointer 
Possibly JOMTE
external/simple pointer 
JOMTE?
external,utf-8/CGZ pointer 
Late CO, RS or RCT
external,utf-8/DHB pointer 
RCT
external,utf8/CGZ pointer 
RCT with buggy encoding name, <r45205
external,utf8/DHB pointer 
RCT <r45205
external/CGZ pointer 
RS. Perhaps CO in MW<1.5 also created these.

Legend

CO 
compressOld.php.
MTE 
moveToExternal.php.
MW 
MediaWiki
JOMTE
JeLuF's original move to external. I think there was an SQL script or something that he used to move some text when external storage was set up initially, I can't find it now.
RCT 
recompressTracked.php. The latest and greatest recompression script.
RS 
resolveStubs.php

How these stats were generated

storageTypeStatsDiff.py and storageTypeStatsSum.py exist in svn

To collect the stats, gather info for every wiki db (this step takes about 24 hours):

 ben@hume:~$ cd /home/w/bin/
 ben@hume:bin$ ./foreachwiki maintenance/storage/storageTypeStats.php > /tmp/storageTypeStats.log
 ben@hume:bin$ scp /tmp/storageTypeStats.log fenari:

To sum the stats for each wiki, this output is sent through storageTypeStatsSum.py:

 ben@fenari:~$ cd svn/extensions/WikimediaMaintenance/storage/
 ben@hume:storage$ ./storageTypeStatsSum.py ~/storageTypeStats.log > current-YYYY-MM-DD

To calculate the differences, grab the previous stats from this page, store them in a date-named file and compare them:

 ben@hume:storage$ cat <<EOOLDSTATS > <old-date>
 <paste in content from this wiki page>
 EOOLDSTATS
 ben@hume:storage$ ./storageTypeStatsDiff.py <old-date> <current-date> > /tmp/storageDiffs.log
 ben@hume:storage$ rm <old-date> <current-date>

paste the new values and the diff into this wiki page

Bugs

  • Bug 950: botched conversion from latin1 to UTF-8 on es.wiktionary.org. See the historical worksheet compression corruption.
  • bug 22624 compressOld.inc with CGZ may have been run as early as October 2004. It wasn't until December 2004 that r6640 was committed, which prevented CGZ blobs from being moved to the archive table. The English Wikipedia archive table now has 892 CGZ blobs, 1541 HistoryBlobStub objects, and 510 "external,object" rows. These all need to be fixed urgently, since RCT will destroy them.