Dumps/Drafts/Using bitstream

From Wikitech

Notes about replacing (parts of) mwbzutils with python's bitstream module

To look into:

  • speed of python bitstream module
  • proper determination of block boundaries
  • flexibility of python's bz2 module
  • avoiding edge cases in decompression
  • building bz2 files from arbitrary blocks (object store??)
  • ...?

Speed of bitstream module

Some uses don't need the world's best optimization. Need to categorize all our use cases.

Bzip2 block boundary determination

Just searching for the bit-aligned start of block marker is not enough. Decompressing the first 8 or 16k is not enough. What is enough to be sure we have a good block?

Cases in which this is not an issue:

  • decompressing first block in a file
  • decompressing last block in a file
  • decompressing streams that are known to have a maximum (small) size
  • decompressing in serial and keeping track of block boundaries as we go along

Avoiding edge cases in decompression

We don't use lbzip2 for decompression anywhere, because it is possible, when it does not read blocks serially, that it can get the block boundaries wrong.

It uses two methods to find block boundaries:

  • scan - naively find block markers, possibly bit-shifted
  • parse - TBD

(More notes soon)

Flexibility of python's bz2 module

The python routines, even those that incrementally decompress data from a buffer, expect a bzip2 stream header. We can construct a stream of bytes with that header at the front. we can calculate a stream crc in cases where we need one, and stuff it on the end, if we are clever. This should cover our uses.

CRC calculation of bz2 files written from arbitrary blocks

If we can get good block boundaries, this can be done; a proof of concept exists in C this very minute. The if is the hard part.


See also