Obsolete talk:Dumps/Development status 2011
Performance-friendly dump formats?
Has anybody seriously considered switching to a new dump format that's easier to both generate and consume in parallel?
The compressed XML stream was something I whipped up several years ago as something that would be:
- reasonably straightforward to generate
- reasonably self-explanatory to someone trying to parse it
- reasonably independent of implementation details in the database schema and compressed text storage
And for that, it works ok. But we've definitely seen the limitations of this over the years:
- serial nature of the stream makes it hard to parallelize dump generation
- compressed XML means making a file that's only a couple % different from the last file still requires decompressing and recompressing the ENTIRE original stream (prefetch is still expensive)
- serial nature of the stream makes it hard to parallelize dump consumption
- compressed XML makes random-access for consumers very difficult -- though have been some very clever hacks with index files
- pulling an XML dump back into a wiki takes a long time
So a general crazy idea off the top of my head: drop the XML for a structured data format with compression inside the file structure, not around it...
- multiple writers can add data in parallel (even if that has to be moderated by one process with a file lock, there's a lot less central overhead if things like text compression are handled individually for each page)
- writers can copy unchanged data without decompressing/recompressing it
- consumers can parallelize very aggressively: no decompression & XML parsing bottleneck, so multiple processes can jump in at any point of the stream
- consumers doing a full scan can skip over uninteresting pages without decompressing them
- makes it easier to carve out a partial dump without recompressing everything
- structure can include built-in indexes for random access
- if sufficiently crazy, could be used as a sort of additional external storage for mediawiki -- importing a dump could use the *actual dump* for text storage rather than copying it all into the database. madness!
Obvious questions then become:
- stick with giant standalone files or use a directory structure at the macro level? (eg, store some metadata in this file, indexes in another file, maybe compressed pages over there?)
- what surrounding structure? adapt something existing? (giant .zips? the reader dump formats?)
- what page/rev metadata structure? drop the existing XML in in smaller chunks, or use a fixed-size binary format or something?
- can the file format still be self-documenting in some way?
- will this serve the needs of consumers or am I barking up the crazy tree? ;)
- are incremental dumps still needed? would this format be rsync-friendly for updating?
-- Brion 19:00, 2 February 2011 (UTC)