Obsolete talk:Dumps/Development status 2011

Performance-friendly dump formats?

Latest comment: 13 years ago1 comment1 person in discussion

Has anybody seriously considered switching to a new dump format that's easier to both generate and consume in parallel?

The compressed XML stream was something I whipped up several years ago as something that would be:

reasonably straightforward to generate
reasonably self-explanatory to someone trying to parse it
reasonably independent of implementation details in the database schema and compressed text storage

And for that, it works ok. But we've definitely seen the limitations of this over the years:

serial nature of the stream makes it hard to parallelize dump generation
compressed XML means making a file that's only a couple % different from the last file still requires decompressing and recompressing the ENTIRE original stream (prefetch is still expensive)
serial nature of the stream makes it hard to parallelize dump consumption
compressed XML makes random-access for consumers very difficult -- though have been some very clever hacks with index files
pulling an XML dump back into a wiki takes a long time

So a general crazy idea off the top of my head: drop the XML for a structured data format with compression inside the file structure, not around it...

multiple writers can add data in parallel (even if that has to be moderated by one process with a file lock, there's a lot less central overhead if things like text compression are handled individually for each page)
writers can copy unchanged data without decompressing/recompressing it
consumers can parallelize very aggressively: no decompression & XML parsing bottleneck, so multiple processes can jump in at any point of the stream
consumers doing a full scan can skip over uninteresting pages without decompressing them
makes it easier to carve out a partial dump without recompressing everything
structure can include built-in indexes for random access
if sufficiently crazy, could be used as a sort of additional external storage for mediawiki -- importing a dump could use the *actual dump* for text storage rather than copying it all into the database. madness!

Obvious questions then become:

stick with giant standalone files or use a directory structure at the macro level? (eg, store some metadata in this file, indexes in another file, maybe compressed pages over there?)
what surrounding structure? adapt something existing? (giant .zips? the reader dump formats?)
what page/rev metadata structure? drop the existing XML in in smaller chunks, or use a fixed-size binary format or something?
can the file format still be self-documenting in some way?
will this serve the needs of consumers or am I barking up the crazy tree? ;)
are incremental dumps still needed? would this format be rsync-friendly for updating?

-- Brion 19:00, 2 February 2011 (UTC)Reply