Compression Techniques in Columnar Storage

This page describes the investigations into improving the compression techniques used in columnar storage, for instance, with RCFile.

Hadoop already provides BZip2 and Gzip codecs which can be used for columnar compression in hive/hcatalog.

There are other compressors available which are better at resource usage (snappy,...), or provide better compression (paq,...) but we should be able to exploit certain aspects of the usage scenario here to provide a compression which is both better and faster than the options available.

the fact that the data is of the same type
in some cases, that data may be ordered
block compression is always possible
record order may not matter in most situations

Initially, I am going to look into integer compression but the learnings here should translate to other types as well.

Here is the compression metrics with some simulated data. The number of bits per symbol, where the symbol is a 32-bit integer is shown here. In general, bzip2 does well in terms of compression efficiency, but still has a higher bits-per-symbol than 1st order entropy so we could exploit this to produce a better compressor using a 1-level ppm compression.

Note that the first two are not actually compressors but entropy calculations for the datasets.

data set	entropy	1st-order entropy	arithmetic	diff+ppm/1	continuation bit	no-op	elias delta	elias gamma	elias omega	golomb	bzip2	default	gzip	huffman	1st order huffman	ppm/1	vint
col1	4.038	3.962	4.053	5.396	13.184	32	12.131	13.247	13.357	9.069	4.75	6.438	6.439	4.088	4.552	4.084	10.289
col2	2.497	2.483	2.503	3.065	8.521	32	10.877	11.761	12.699	7.943	3.013	4.353	4.353	2.565	2.614	2.507	8
col3	3.559	1.932	3.573	2.842	39.489	32	39.347	60.694	42.347	32.63	2.57	3.446	3.447	3.625	2.348	1.995	39.997
col4d	8.417	0.027	8.514	0.049	40	32	40	62	43	33	0.32	0.11	0.111	8.675	1.335	0.269	40

To be done: Analysis of cpu/memory

Issues to be sorted out:

The SerDe/Compression codecs split - where serdes convert from cells to byte arrays and compression codecs compress these byte arrays later, means that we need to (a) either deserialize before compression or (b) allow bit boundaries for cell values.

Child pages

Compression Techniques in Columnar Storage