Performance Measurements - round 2

Setup:

Here are some sample measurements taken with a single agent.

Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes).

Machine Config: 24 cores – Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM, 7200 rpm Hard Drive.

1. File channel with HDFS Sink (Sequence File):

Flume version: 1.4

Source: 4 x Exec Source, 100k batchSize

HDFS Sink Batch size: 500,000

Event Size: 500 byte events.

Channel: File

Sinks	1 data dirs	2 data dirs	4 data dirs	6 data dirs	8 data dirs	10 data dirs
	Events/Sec
1	14.3k
2	21.9k
4		35.8k
8			72.5k	77k	78.6 (37 MB/s)	76.6k
10			58k
12			49.3	49k

Measurements were taken to get an idea around the configuration that yields best performance. So took measurements only for all data points in the grid that made sense. For example it was not necessary to take measurements for multiple data dirs at single sink, as it was evident more sinks is better.

2. HDFS Sink:

Flume version: 1.4

Channel: Memory

Event Size: 500 byte events.

# of HDFS Sinks	Snappy BatchSz:1.2mill	Snappy BatchSz:1.4mill	Sequence File BatchSz:1.2mill
1	34.3 k	33 k	33 k
2	71 k	75 k	69 k
4	141 k	145 k	141 k
8	271 k	273 k	251 k
12	382 k	380 k	370 k
16	478 k	538 k (240MB/s)	486 k (232 MB/s)

Some simple observations:

increasing number of dataDirs helps FC perf even on single disk systems
Increasing number of sinks helps

3. Hive Sink:

Flume version: 1.5 & 1.6

Channel: Memory

BatchSz:1million

Event Size: 500 byte events.

	Flume 1.5		Flume 1.6
	Event/s	MBps	Event/s	MBps
	1 Sink
DELIMITED Text	36,885	18	138,461	66
JSON	12,735	6


	16 sinks (agent maxed out)
DELIMITED Text	209,600	100	348,214	166
JSON	25,751	12	31,135	14

Observation: JSON data much slower potentially due to higher parsing overhead of JSON in part.

4. HBase Sink:

Flume version: 1.4

Channel: Memory

Serializer: RegexHbaseEventSerializer

Total Sinks: 1

Event Size

(bytes)

BatchSz: 1

BatchSz: 100

BatchSz: 1000

BatchSz: 10,000

500

11 MB/s

1000

0.5 MB/s

14 MB/s

22 MB/s

27 MB/s

5. ASync HBase Sink:

Channel: Memory

Serializer: SimpleAsyncHbaseEventSerializer

Total Sinks: 1

Event Size

(bytes)

BatchSz: 1

BatchSz: 100

BatchSz: 1000

500

0.4 MB/s

0.5 MB/s

1000

0.8 MB/s

0.9 MB/s

Child pages

Performance Measurements - round 2

Setup:

1. File channel with HDFS Sink (Sequence File):

2. HDFS Sink:

3. Hive Sink:

4. HBase Sink:

5. ASync HBase Sink: