Right now, this page is a draft where we will work out the best way to publish this dataset. With some compression, we have roughly five billion events adding up to one terabyte of data.
Ideas for splitting
Split by wiki with grouping
Split by wiki, but group all wikis with fewer than ten million events. This results in about 50 separate files, which is nice and manageable. These may be further split into 3 separate files for user, page, and revision histories, depending on the size and ease of working with the data. The down side is that as wikis get over ten million events, they will move their own separate file, potentially causing some confusion. Possible mitigation is a machine-readable index of where each wiki is. Dan Andreescu is currently investigating this approach.
- grouping wikis with less than 10 million events results in about 50 output files -> 150 if further split by entity
- grouping less than 30 million events means 25 output files, but does increase the size of the "all others" group to almost the same size as English and wikidata wikis, and doesn't leave any individual "small" wikis which could be useful if people want to test their analysis before downloading a bigger set.
with with_count as ( select wiki_db, sum(events) t from milimetric.history_count_by_wiki group by wiki_db ), with_label as ( select if(t > 10000000, wiki_db, 'all others') wiki, t from with_count ) select wiki, sum(t) / 5031314059 as percent from with_label group by wiki order by percent desc limit 1000 ;
|Wiki||Ratio of total events|
Split by wiki, data set and time in GZipped TSVs
In these splitting idea the directory structure is:
base_path/<wiki_or_wikigroup>/<data_set>/<time_range_1>.tsv.gz /<time_range_2>.tsv.gz /...
- Where <wiki_or_wikigroup> is: enwiki, dewiki, etc. for the top 30 wikis, or the name of a wiki group for smaller wikis, i.e.: medium_wikis (5M < events < 25M) and small_wikis (events < 5M) [thresholds are a guess, haven't checked them, just to present the idea]. Based on Dan's idea of grouping the smaller wikis, but having two groups, so that people interested in one single smaller wiki, don't have to download all wikis except the top 30.
- Where <data_set> is: mediawiki_history, mediawiki_user_history or mediawiki_page_history.
- Where <time_range> is either the year (YYYY) or the year and month (YYYY-MM) the events belong to. The idea is to partition dump files by time range, so that files for larger wikis are not so large. By our ballpark calculations enwiki mediawiki_history (full) would be a 200+GB file. One year (2019) of enwiki would be around 16GB and one month a bit more than 1GB. Depending on the size of the wiki/wiki-group we could use YYYY or YYYY-MM partitioning. Or maybe use YYYY always (and accept the 16GB enwiki files).
We thought TSV would be a good data format, because it doesn't contain the field names, like json or yaml would, and it's a bit better than CSV, because commas are more likely to appear on page titles and user names than tabs (so we'd have to escape less with TSV).
Finally I think Gzip is a good format for the dumps, because it's a pretty standard algorithm and you can unzip the files separately and then join them, or also you can join them into a single compressed file and then unzip it. I think Parquet is too technology-specific for the main user of the data lake dumps, no?