Jump to content

Talk:MediaWiki Content File Exports

From Wikitech
Latest comment: 27 days ago by Amire80 in topic parent_id None or 0

How to Download/Example: Apparent semantic error

The example given under How to Download demonstrates this with the idea that we wish to download the current dump of the English Wikipedia (Implying mediawiki_content_current) but in the example itself the string mediawiki_content_history is used, which would download the English Wikipedia dump with complete page and edit histories. The enwiki content_current (2026-03-01) appears to be about 40GiB in size, while content_history (Which hasn't completed exporting yet) appears to run north of 420GiB.

Obviously we don't want people following this example in that way (I daresay many will) and downloading a lot more data than they'll actually want, so I'm going to edit this accordingly to try to help out. If reverting this edit, please add a reply here with rationale and a heads-up on my Talk page. Thankyou.

ᛒᚱᛟᚴᛂᚾ ᚢᛁᚴᛁᚾᚷ (Broken Viking|T·C) 21:01, 7 March 2026 (UTC)Reply

Whoops, thanks for the fix! XCollazo-WMF (talk) 19:14, 19 March 2026 (UTC)Reply

An issue in the Wikifunctions dump

I've been doing some analysis on the 2026-03-01 Wikifunctions dump. I found an odd thing there. The page f:Z20064 was deleted from the actual wiki in 19 December 2025. However, it does appear in the March 2026 dump. When I read it using the mwxml Python package, the value of .deleted.text for this page's only revision is False, even though the value of this property for deleted pages is supposed to be True. And I might be wrong about it, but I thought that deleted pages are not supposed to be in the dump at all.

Am I understanding or doing something wrong, or is there a bug in the dump infrastructure somewhere?

You can see the code I use to process it on Gitlab.

(Tagging @XCollazo-WMF, who has recently been editing this page. I'm not sure how else to report it.) Amir E. Aharoni (talk) 17:48, 19 March 2026 (UTC)Reply

@Amire80: I have confirmed that the file export you mention does indeed contain Z20064.This is indeed a consistency bug.
Our internal dataset from which these file exports are done is 'eventually consistent'. I just checked, and this particular inconsistency has been resolved via ongoing work on https://phabricator.wikimedia.org/T415311, thus next months export you should not see this anymore.
If you see other inconsistencies though, feel free to open tickets at https://phabricator.wikimedia.org and tag the #Data-Engineering team. XCollazo-WMF (talk) 19:58, 19 March 2026 (UTC)Reply
Thank you! Amir E. Aharoni (talk) 20:06, 19 March 2026 (UTC)Reply

parent_id None or 0

Another question about the Wikifunctions dump. When I iterate over revisions in it using the mwxml Python package, the value of parent_id in page creation revisions is sometimes 0 (type int) and sometimes None. In the database, it's 0, so it would make more sense if it was consistently 0. Or maybe consistently None. But not something different. Is there a reason for this difference?

Tagging @XCollazo-WMF, who was very helpful a few days ago, again :) Amir E. Aharoni (talk) 14:33, 22 March 2026 (UTC)Reply

This sounds like a consistency bug. Can you please open a ticket at https://phabricator.wikimedia.org and tag the #Data-Engineering team? XCollazo-WMF (talk) 14:35, 23 March 2026 (UTC)Reply
Thanks, T420974. Amir E. Aharoni (talk) 17:38, 23 March 2026 (UTC)Reply