This page describes the investigations into improving the compression techniques used in columnar storage, for instance, with RCFile.

Hadoop already provides BZip2 and Gzip codecs which can be used for columnar compression in hive/hcatalog.

There are other compressors available which are better at resource usage (snappy,...), or provide better compression (paq,...) but we should be able to exploit certain aspects of the usage scenario here to provide a compression which is both better and faster than the options available.

  • the fact that the data is of the same type
  • in some cases, that data may be ordered
  • block compression is always possible
  • record order may not matter in most situations

Initially, I am going to look into integer compression but the learnings here should translate to other types as well.

Here is the compression metrics with some simulated data. The number of bits per symbol, where the symbol is a 32-bit integer is shown here. In general, bzip2 does well in terms of compression efficiency, but still has a higher bits-per-symbol than 1st order entropy so we could exploit this to produce a better compressor using a 1-level ppm compression.

Note that the first two are not actually compressors but entropy calculations for the datasets.

data set

entropy

1st-order entropy

arithmetic

diff+ppm/1

continuation bit

no-op

elias delta

elias gamma

elias omega

golomb

bzip2

default

gzip

huffman

1st order huffman

ppm/1

vint

col1

4.038

3.962

4.053

5.396

13.184

32

12.131

13.247

13.357

9.069

4.75

6.438

6.439

4.088

4.552

4.084

10.289

col2

2.497

2.483

2.503

3.065

8.521

32

10.877

11.761

12.699

7.943

3.013

4.353

4.353

2.565

2.614

2.507

8

col3

3.559

1.932

3.573

2.842

39.489

32

39.347

60.694

42.347

32.63

2.57

3.446

3.447

3.625

2.348

1.995

39.997

col4d

8.417

0.027

8.514

0.049

40

32

40

62

43

33

0.32

0.11

0.111

8.675

1.335

0.269

40

To be done: Analysis of cpu/memory

Issues to be sorted out:

The SerDe/Compression codecs split - where serdes convert from cells to byte arrays and compression codecs compress these byte arrays later, means that we need to (a) either deserialize before compression or (b) allow bit boundaries for cell values.

  • No labels