Jump to content

MediaWiki Content File Exports

From Wikitech
This is a draft, and this data is not available yet. This work is paused.

The MediaWiki Content File Exports are datasets available for download that include the unparsed content of the public wikis hosted by the Wikimedia Foundation.

These datasets are provided on a per wiki basis, and in a compressed XML format. This XML format is compatible with MediaWiki's Special:Export, and with the legacy XML Dumps.

Project rationale

"Dumps" of the content of Wikimedia wikis in XML format have been in production for many years, and can be obtained from https://dumps.wikimedia.org/backup-index.html. These dumps enable reuse, repurposing, and analysis by both the community and internally by the Wikimedia Foundation.

However, the infrastructure that produces those XML files can no longer reliably produce the bigger wikis, and has been unmaintained for an extended period of time. Although we will continue to attempt to generate the legacy XML files, we are now deprecating that legacy path.

The Data Engineering team has reimplemented how this data is produced, making it reliably accessible internally. We are now in a position in which we feel confident to make it available publicly as well.

Content

The MediaWiki Content File Exports consists of two datasets:

mediawiki_content_history

Contains the unparsed content of all revisions, past and present, from all Wikimedia wikis. This dataset is exported per wiki, once per month, on the 1st of the month.

mediawiki_content_current

Contains the unparsed content of the current revisions from all Wikimedia wikis. This dataset is exported per wiki, twice per month, on the 1st and on the 15th of the month.


How to download

  1. Identify the wiki to download.
  2. Identify whether you need the full history, or if the latest revision per page is sufficient.
  3. Attempt to fetch /wmf/data/exports/{dataset}/{wiki_id}/{date}/xml/{compression}/SHA256SUMS
  4. If the file is available, that particular file export is done and available. If not, retry later.

Example

We want to download the current content of the English Wikipedia, that is, enwiki.

We figure that the URL to check is /wmf/data/exports/mediawiki_content_history/enwiki/2025-01-01/xml/bzip2/SHA256SUMS

If the files for a particular date are not ready, the SHA256SUMS file will not exist.

This file contains the sha256 fingerprint as well as the relative path of each file that composes the file export.

We can then iterate over each relative path in that file to download them all.

After downloading all files, you are highly encouraged to use the SHA256SUMS file again to verify each downloaded file via a command such as sha256sum --check.


FAQ

How do I know which URL corresponds to the wiki that I wish to download?

The URLs are indexed with what we call the wiki_id of each wiki. For example, for the English Wikipedia, that id is enwiki. For the Spanish Wikipedia, it is eswiki.

You can find a mapping between wiki_id's and the corresponding web address, site name and language at MediaWiki Content File Exports/WikiId Mappings.

I have downloaded all of the files for a specific wiki. What do the file names of individual files mean?

Most filenames look like this:

wikidatawiki-2025-08-01-p86829598p86830295.xml.bz2

The first part is the wiki_id, the second part is the publication date, and the third part is the range of page_ids contained in the file. Thus, the above file will contain all revisions of pages in the page_id range [86829598, 86830295], both inclusive.

Some files, however, look like this:

wikidatawiki-2025-08-01-p86829600r1134810922r1134810922.xml.bz2 wikidatawiki-2025-08-01-p86829600r1877403859r1877403859.xml.bz2

In these cases, a page was found to have too many revisions, and so to keep the file size and corresponding computation cost manageable, we chose to export it on its own set of files. In the specific example above, page_id = 86829600 was found to be big. Thus, the algorithm exports it in two files, with one of them including any revisions belonging to this page in the range [1134810922, 1134810922], and the second file includes any revisions in the range [1877403859, 1877403859]. Between those two files, you can find the entirety of page_id = 86829600.