Jump to content

Data Platform/Data Lake/Edits/Public

From Wikitech

Right now, this page is a draft where we will work out the best way to publish this dataset. With some compression, we have roughly five billion events adding up to one terabyte of data.

Ideas for splitting

Split by wiki with grouping

Split by wiki, but group all wikis with fewer than ten million events. This results in about 50 separate files, which is nice and manageable. These may be further split into 3 separate files for user, page, and revision histories, depending on the size and ease of working with the data. The down side is that as wikis get over ten million events, they will move their own separate file, potentially causing some confusion. Possible mitigation is a machine-readable index of where each wiki is. Dan Andreescu is currently investigating this approach.

  • grouping wikis with less than 10 million events results in about 50 output files -> 150 if further split by entity
  • grouping less than 30 million events means 25 output files, but does increase the size of the "all others" group to almost the same size as English and wikidata wikis, and doesn't leave any individual "small" wikis which could be useful if people want to test their analysis before downloading a bigger set.
with with_count as (
 select wiki_db,
        sum(events) t
   from milimetric.history_count_by_wiki
  group by wiki_db

), with_label as (
 select if(t > 10000000, wiki_db, 'all others') wiki,
        t
   from with_count

)

 select wiki,
        sum(t) / 5031314059 as percent
   from with_label
  group by wiki
  order by percent desc
  limit 1000
;
Wiki Ratio of total events
wikidatawiki 0.206
enwiki 0.203
all others 0.100
commonswiki 0.090
dewiki 0.041
frwiki 0.036
eswiki 0.027
itwiki 0.024
ruwiki 0.023
jawiki 0.016
viwiki 0.014
zhwiki 0.013
ptwiki 0.013
enwiktionary 0.013
plwiki 0.013
nlwiki 0.012
svwiki 0.011
metawiki 0.011
arwiki 0.009
shwiki 0.009
cebwiki 0.007
mgwiktionary 0.007
fawiki 0.007
frwiktionary 0.006
ukwiki 0.006
hewiki 0.006
kowiki 0.006
srwiki 0.005
trwiki 0.005
loginwiki 0.005
huwiki 0.005
cawiki 0.005
nowiki 0.004
mediawikiwiki 0.004
fiwiki 0.004
cswiki 0.004
idwiki 0.004
rowiki 0.003
enwikisource 0.003
frwikisource 0.003
ruwiktionary 0.002
dawiki 0.002
bgwiki 0.002
incubatorwiki 0.002
enwikinews 0.002
specieswiki 0.002
thwiki 0.002

Split by wiki, data set and time in GZipped TSVs

In these splitting idea the directory structure is:

base_path/<wiki_or_wikigroup>/<data_set>/<time_range_1>.tsv.gz
                                        /<time_range_2>.tsv.gz
                                        /...
  • Where <wiki_or_wikigroup> is: enwiki, dewiki, etc. for the top 30 wikis, or the name of a wiki group for smaller wikis, i.e.: medium_wikis (5M < events < 25M) and small_wikis (events < 5M) [thresholds are a guess, haven't checked them, just to present the idea]. Based on Dan's idea of grouping the smaller wikis, but having two groups, so that people interested in one single smaller wiki, don't have to download all wikis except the top 30.
  • Where <data_set> is: mediawiki_history, mediawiki_user_history or mediawiki_page_history.
  • Where <time_range> is either the year (YYYY) or the year and month (YYYY-MM) the events belong to. The idea is to partition dump files by time range, so that files for larger wikis are not so large. By our ballpark calculations enwiki mediawiki_history (full) would be a 200+GB file. One year (2019) of enwiki would be around 16GB and one month a bit more than 1GB. Depending on the size of the wiki/wiki-group we could use YYYY or YYYY-MM partitioning. Or maybe use YYYY always (and accept the 16GB enwiki files).

We thought TSV would be a good data format, because it doesn't contain the field names, like json or yaml would, and it's a bit better than CSV, because commas are more likely to appear on page titles and user names than tabs (so we'd have to escape less with TSV).

Finally I think Gzip is a good format for the dumps, because it's a pretty standard algorithm and you can unzip the files separately and then join them, or also you can join them into a single compressed file and then unzip it. I think Parquet is too technology-specific for the main user of the data lake dumps, no?