Page History

Recently we used CarbonData to do the performance in Telecommunication filed and summarize some of the Suggestions while creating the CarbonData table. We have tables which range from 10 thousand rows to 10 billion rows and have from 100 columns to 300 columns. Following are some of the columns used in the table.

Table Column Description

Column name	Data type	Cardinality	Attribution
msisdn	String	30 million	dimension
BEGIN_TIME	bigint	10 thousand	dimension
HOST	String	1 million	dimension
Dime_1	String	1 thousand	dimension
counter_1	numeric(20,0)	NA	measure
...	...	NA	measure
counter_100	numeric(20,0)	NA	measure

We have about more than 50 test cases; according to the test case we summarize some suggestion to create the table which can have a better query performance.

...

    create table carbondata_table(
    msisdn String,
    ...
    )STORED BYas 'org.apache.carbondata.format' 
    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,..','DICTIONARY_INCLUDE'='...');

2. If has multiple column which is frequently-use in the filter, put it to the front in the order as low cardinality to high cardinality.

...

    create table carbondata_table(
    Dime_1 String,
    HOST String,
    MSISDN String,
    ...
    )STORED BYas 'org.apache.carbondata.format' 
    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST..','DICTIONARY_INCLUDE'='Dime_1..');

3. If no column is frequent-use in filter, then can put all the dimension column order as from low cardinality to high cardinality.

...

    create table carbondata_table(
    Dime_1 String,
    BEGIN_TIME bigint
    HOST String,
    MSISDN String,
    ...
    )STORED BYas 'org.apache.carbondata.format' 
    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI..','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME..');

4. For measure that needs no high accuracy, then no need to use numeric(20,0) data type, suggestion is to use double to replace it than will increase the query performance.

...

    create table carbondata_table(
    Dime_1 String,
    BEGIN_TIME bigint
    HOST String,
    MSISDN String,
    counter_1 double,
    counter_2 double,
    ...
    counter_100 double,
    )STORED BYas 'org.apache.carbondata.format' 
    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');

5. If the column which is always incremental like start_time.

...

    create table carbondata_table(
    Dime_1 String,
    HOST String,
    MSISDN String,
    counter_1 double,
    counter_2 double,
    BEGIN_TIME bigint,
    ...
    counter_100 double,
    )STORED BYas 'org.apache.carbondata.format' 
    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');

One more suggestion

For the dimension whether dictionary is needed or not, we suggest the cardinality do not put it as dictionary column if you don not have enough memory. If high cardinality column put as dictionary will impact the load performance and need more memory. CarbonData default behavior is: CarbonData will do cardinality discovery for the first data load and will do dictionary if cardinality<1millioncardinality < 1million

Page tree

Versions Compared

Old Version 5

New Version Current

Key