Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Recently we used CarbonData to do the performance in Telecommunication filed and summarize some of the Suggestions while creating the CarbonData table. We have tables which range from 10 thousand rows to 10 billion rows and have from 100 columns to 300 columns. Following are some of the columns used in the table.

Table Column Description

Column name
Data type
Cardinality
Attribution
msisdnString30 milliondimension
BEGIN_TIMEbigint10 thousanddimension
HOSTString1 milliondimension
Dime_1String1 thousanddimension
counter_1numeric(20,0)NAmeasure
......NAmeasure
counter_100numeric(20,0)NAmeasure

We have about more than 50 test cases; according to the test case we summarize some suggestion to create the table which can have a better query performance.

...

    create table carbondata_table(
    msisdn String,
    ...
    )STORED BYas 'org.apache.carbondata.format' 
    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,..','DICTIONARY_INCLUDE'='...');

2. If has multiple column which is frequently-use in the filter, put it to the front in the order as low cardinality to high cardinality.

...

    create table carbondata_table(
    Dime_1 String,
    HOST String,
    MSISDN String,
    ...
    )STORED BYas 'org.apache.carbondata.format' 
    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST..','DICTIONARY_INCLUDE'='Dime_1..');

3. If no column is frequent-use in filter, then can put all the dimension column order as from low cardinality to high cardinality.

...

    create table carbondata_table(
    Dime_1 String,
    BEGIN_TIME bigint
    HOST String,
    MSISDN String,
    ...
    )STORED BYas 'org.apache.carbondata.format' 
    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI..','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME..');

4. For measure that needs no high accuracy, then no need to use numeric(20,0) data type, suggestion is to use double to replace it than will increase the query performance.

...

    create table carbondata_table(
    Dime_1 String,
    BEGIN_TIME bigint
    HOST String,
    MSISDN String,
    counter_1 double,
    counter_2 double,
    ...
    counter_100 double,
    )STORED BYas 'org.apache.carbondata.format' 
    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');

5. If the column which is always incremental like start_time.

...

    create table carbondata_table(
    Dime_1 String,
    HOST String,
    MSISDN String,
    counter_1 double,
    counter_2 double,
    BEGIN_TIME bigint,
    ...
    counter_100 double,
    )STORED BYas 'org.apache.carbondata.format' 
    TBLPROPERTIES ( 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');

One more suggestion

For the dimension whether dictionary is needed or not, we suggest the cardinality do not put it as dictionary column if you don not have enough memory. If high cardinality column put as dictionary will impact the load performance and need more memory. CarbonData default behavior is: CarbonData will do cardinality discovery for the first data load and will do dictionary if cardinality<1millioncardinality < 1million