Obsolete talk:Dumps/Development status 2011

From Wikitech
Jump to navigation Jump to search

Performance-friendly dump formats?

Has anybody seriously considered switching to a new dump format that's easier to both generate and consume in parallel?

The compressed XML stream was something I whipped up several years ago as something that would be:

  • reasonably straightforward to generate
  • reasonably self-explanatory to someone trying to parse it
  • reasonably independent of implementation details in the database schema and compressed text storage

And for that, it works ok. But we've definitely seen the limitations of this over the years:

  • serial nature of the stream makes it hard to parallelize dump generation
  • compressed XML means making a file that's only a couple % different from the last file still requires decompressing and recompressing the ENTIRE original stream (prefetch is still expensive)
  • serial nature of the stream makes it hard to parallelize dump consumption
  • compressed XML makes random-access for consumers very difficult -- though have been some very clever hacks with index files
  • pulling an XML dump back into a wiki takes a long time

So a general crazy idea off the top of my head: drop the XML for a structured data format with compression inside the file structure, not around it...

  • multiple writers can add data in parallel (even if that has to be moderated by one process with a file lock, there's a lot less central overhead if things like text compression are handled individually for each page)
  • writers can copy unchanged data without decompressing/recompressing it
  • consumers can parallelize very aggressively: no decompression & XML parsing bottleneck, so multiple processes can jump in at any point of the stream
  • consumers doing a full scan can skip over uninteresting pages without decompressing them
  • makes it easier to carve out a partial dump without recompressing everything
  • structure can include built-in indexes for random access
  • if sufficiently crazy, could be used as a sort of additional external storage for mediawiki -- importing a dump could use the *actual dump* for text storage rather than copying it all into the database. madness!

Obvious questions then become:

  • stick with giant standalone files or use a directory structure at the macro level? (eg, store some metadata in this file, indexes in another file, maybe compressed pages over there?)
  • what surrounding structure? adapt something existing? (giant .zips? the reader dump formats?)
  • what page/rev metadata structure? drop the existing XML in in smaller chunks, or use a fixed-size binary format or something?
  • can the file format still be self-documenting in some way?
  • will this serve the needs of consumers or am I barking up the crazy tree? ;)
  • are incremental dumps still needed? would this format be rsync-friendly for updating?

-- Brion 19:00, 2 February 2011 (UTC)[reply]