HCatMix is the performance testing framework for hcatalog.
The objective is:
In order to meet the above objective following would be measured:
PigLoader/PigStorer
for various data size and number of partitions.Test setup needs to perform two tasks:
Both of these are driven by configuration file. Following is an example of setup xml file.
<database> <tables> <table> <namePrefix>page_views_1brandnew</namePrefix> <dbName>default</dbName> <columns> <column> <name>user</name> <type>STRING</type> <avgLength>20</avgLength> <distribution>zipf</distribution> <cardinality>1000000</cardinality> <percentageNull>10</percentageNull> </column> <column> <name>timespent</name> <type>INT</type> <distribution>zipf</distribution> <cardinality>5000000</cardinality> <percentageNull>25</percentageNull> </column> <column> <name>query_term</name> <type>STRING</type> <avgLength>5</avgLength> <distribution>zipf</distribution> <cardinality>10000</cardinality> <percentageNull>0</percentageNull> </column> . . . </columns> <partitions> <partition> <name>timestamp</name> <type>INT</type> <distribution>zipf</distribution> <cardinality>1000</cardinality> </partition> <partition> <name>action</name> <type>INT</type> <distribution>zipf</distribution> <cardinality>8</cardinality> </partition> . . . </partitions> <instances> <instance> <size>1000000</size> <count>1</count> </instance> <instance> <size>100000</size> <count>1</count> </instance> . . . </instances> </table> </tables> </database> |
A column/parition has the following details:
name
: of the columntype
: Type of data (string/int
etc)avgLength
: average length if the type is stringdistribution
distribution type. Either uniform
or zipf
to generate data that follows Zipf's distribution (http://en.wikipedia.org/wiki/Zipf's_law)cardinality
Size of the sample spacepercentageNull
what percentage should be nullThe instances section defines how many instance of table with the same specification to be created and the number of rows for each of them.
HCatLoder()
and HCatStorer
for the same data:PigStorage
to load and HCatStorer
to storeHCatLoader
to load and HCatStorer
to storeHCatLoader
to load and PigStorage
to storePigStorage
to load and PigStorage
to storeInput Data Size |
Number of partitions |
---|---|
105MB |
0, 300, 600, 900, 1200, 1500, 2000 |
1GB |
0, 300 |
10GB |
0, 300 |
100GB |
0, 300 |
These tests are driven by configuration and new test could be added by dropping configuration.
hcatmix/src/test/resources/hcatmix_load_store_tests.yml
mvn test -Dtest=TestLoadStoreScripts -Phadoop20
mvn test -Dtest=TestLoadStoreScripts -DhcatSpecFile=src/test/resources/performance/100GB_300_parititons.xml -DnumRuns=1 -DnumDataGenMappers=30 -Phadoop20
hadoop20
or hadoop23
target/results
directoryThe hadoop map reduce framework itself has been used to do concurrency test, where in the map phase increases the number of tasks over time and keeps on generating
statistics every minute. The reduce phase aggregates the statistics of all the maps and outputs statistics as number of concurrent clients were increasing. Given map/reduce is used this tool can scale to any number of parallel clients required to do concurrency test.
Concurrency tests are done for the following api call:
The test is defined in a properties file
# For the following example the number of threads will increase from # 80 to 2000 over a period of 25 minutes. T25 = 4*20 + (25 - 1)*4*20 = 2000 # The comma separated task classes which contains the getTable() call task.class.names=org.apache.hcatalog.hcatmix.load.tasks.HCatGetTable # The number of map tasks to run num.mappers=20 # How many threds to increase at the end of fixed interval thread.increment.count=4 # The interval at which number of threads are increased thread.increment.interval.minutes=1 # For how long the map would run map.runtime.minutes=25 # Extra wait time to let the individual tasks to finish thread.completion.buffer.minutes=1 # The interval at which statistics would be collected stat.collection.interval.minutes=1 # input directory where dummy files are created to control the number of mappers input.dir=/tmp/hcatmix/loadtest/input # The location where the collected statistics would be stored output.dir=/tmp/hcatmix/loadtest/output |
More concurrent tests can be added by adding configuration files and adding a class that implements the Task
interface.
mvn test -Dtest=TestHCatalogLoad -DloadTestConfFile=src/main/resources/load/hcat_get_table_load_test.properties -Phadoop20
mvn test -Dtest=TestHCatalogLoad -DloadTestConfFile=src/main/resources/load/hcat_get_table_load_test.properties -Phadoop20
The following environment variables need to be defined:
HADOOP_HOME
HCAT_HOME
HADOOP_CONF_DIR