CarbonData is a new big data file format for faster interactive query using advanced columnar storage, index, compression and dictionary encoding techniques to improve computing efficiency, in turn it will help speedup queries an order of magnitude faster over PetaBytes of data.
The Apache CarbonData community is pleased to announce the availability of CarbonData 0.1.0 which is the first stable release.
We encourage everyone to download the release, and feedback through the CarbonData user mailing lists!
Some highlights of the release are listed in the following sections.
1.CarbonData File format features:
1) Supported indexed columnar Hadoop native data store.
2) Supported short, int, bigint, double, string and timestamp datatypes.
3) Supported struct and array complex datatypes.
4) Supported global dictionary generation for the selected columns.
5) Supported RLE encoding and inverted indexing on data.
6) Supported Snappy compression in file format section.
7) Support direct dictionary key for date and time dimensions.
8) Supported column group to keep storage size more compact.
2.CarbonData integration with Apache Spark features:
1) Supported DDL commands to create, drop, show and describe table with hive syntax compliant.
2) Supported `LOAD` command to load data to CarbonData.
3) Supported data management like clean, delete segments.
4) Supported Lazy dictionary conversion using optimizer rule in Spark.
5) Supported push down of possible projection and filters to carbon.
6) Supported zookeeper based locking for concurrent loads and updates.
7) Supported data compaction to merge multiple loads.
8) Supported Spark dataframe and datasource API integration
3.Along with following jiras
Bugs
[CARBONDATA-5] - data mismatch between the carbon Table and Hive Table for columns having \N for non numeric data type
[CARBONDATA-7] - Fortify issue fixes
[CARBONDATA-9] - Carbon data load bad record is not written into the bad record log file
[CARBONDATA-10] - Avoid to much logging of timestamp parsing exception in TimeStampDirectDictionaryGenerator
[CARBONDATA-12] - Carbon data load bad record log file not renamed form inprogress to normal .log
[CARBONDATA-14] - carbon.cutOffTimestamp and carbon.timegranularity is not added in the carbon.properties.template
[CARBONDATA-15] - filter query throwing error if the query applied over a table having no data.
[CARBONDATA-16] - BLOCK distribution in query is not correct in query when number of executors are less than the cluster size.
[CARBONDATA-17] - select count(*) from table where column_x = 'value' is not returnig the correct count
[CARBONDATA-18] - DataType tinyint is not supported in carbondata
[CARBONDATA-21] - Filter query issue for >, <, <= than filter
[CARBONDATA-27] - filter expression to_date( productdate ) = '2012-12-12' not working
[CARBONDATA-33] - Test cases fails when storePath contains '-' like 'incubator-carbon/store'
[CARBONDATA-34] - Drop table fails when mentions along with db name.
[CARBONDATA-36] - fix dictionary exception when data column count less then schema
[CARBONDATA-43] - query decimal filed with "select * from table where decimalFied=1234.12", the query result is empty.
[CARBONDATA-46] - Fix dataframe API of Carbon
[CARBONDATA-49] - Can not query 3 million rows data which be loaded through local store system(not HDFS)
[CARBONDATA-53] - when No dictionary column has negative number, the query result is Null.
[CARBONDATA-54] - Windows functions are not working in Carbon
[CARBONDATA-55] - Pushdown greaterthan and lessthan filters to Carbon
[CARBONDATA-56] - Exception thrown when aggregation on dimension which has decimal datatype
[CARBONDATA-57] - BLOCK distribution un wanted wait for the executor node even though the sufficient nodes are available
[CARBONDATA-58] - dataloading is launched with wrong number of task
[CARBONDATA-59] - Filter queries on columns other than string datatype cannot get the correct result when included as dictionary column
[CARBONDATA-60] - wrong result when using union all
[CARBONDATA-61] - Change Cube to Table
[CARBONDATA-62] - Values not valid for column datatype are not getting discarded while generating global dictionary
[CARBONDATA-64] - data mismatch between the carbon Table and Hive Table for data having empty lines
[CARBONDATA-65] - Data load fails if there is space in the header names provided in FILEHEADER option in load command
[CARBONDATA-66] - Filter was failing when join condition is been applied between two tables
[CARBONDATA-67] - keyword is not allowed as carbon table name
[CARBONDATA-69] - Column Group Data loading is Failing
[CARBONDATA-71] - Percentile Aggregate function is not working for carbon format
[CARBONDATA-72] - Column group count query
[CARBONDATA-73] - Disable Autodetect highcardinality on column group
[CARBONDATA-74] - Remove code for describe command, as it will be handled in spark
[CARBONDATA-75] - Dictionary file not getting clean on global dictionary failure
[CARBONDATA-76] - Not Equals filter display even the null members while filtering non null values
[CARBONDATA-77] - Delete segment folder after segment clean up.
[CARBONDATA-78] - Update the ReadMe and related documents as per the latest changes
[CARBONDATA-79] - Data load fails when complex type column with timestamp primitives
[CARBONDATA-84] - Change Locking framework to suit database related locks
[CARBONDATA-86] - Value displayed as Null after increase in precision for decimal datatype after aggregation
[CARBONDATA-87] - Temp files not getting deleted
[CARBONDATA-90] - Struct of array query is execution is failing
[CARBONDATA-91] - Concurrent query returning empty result
[CARBONDATA-93] - Task not re-submitted by spark on data load failure
[CARBONDATA-95] - Columns values with numeric data types are not getting parsed when included in dictionary_include
[CARBONDATA-97] - Decimal Precision and scale is not getting applied based on schema metadata
[CARBONDATA-99] - Complex type column filters with like and not like not working
[CARBONDATA-105] - Correct precalculation of dictionary file existence
[CARBONDATA-110] - if user deletes the segment already selected for compaction then compaction need to get failed.
[CARBONDATA-111] - If compaction job is killed then need to stop the compaction tasks running.
[CARBONDATA-112] - regexp_replace filter query is failing for carbon table.
[CARBONDATA-114] - Decimal Precision and scale getting lost for Complex type columns while describing and querying
[CARBONDATA-115] - Log level updation for better maintainability of logs
[CARBONDATA-116] - major compacted segments are considered for minor also
[CARBONDATA-118] - clean up of temp files in compaction
[CARBONDATA-119] - zookeeper lock is not working at executor for dictionary locking
[CARBONDATA-120] - Explain extended carbon command is failing
[CARBONDATA-121] - Need to check the validity of segments before compaction.
[CARBONDATA-123] - Stored by 'carbondata' or 'org.apache.carbondata.format' shoulb be not case senstive
[CARBONDATA-124] - Exception thrown while executing drop table command in spark-sql cli
[CARBONDATA-126] - Csv FIle stream closing issue
[CARBONDATA-127] - Issue while type casting data read from sort temp file to big decimal type
[CARBONDATA-128] - Add block building statistics
[CARBONDATA-129] - Do null check before adding value to CarbonProperties
[CARBONDATA-134] - Temp location of data load is not getting cleared in case of exception in data load
[CARBONDATA-135] - Multiple hdfs client creation issue
[CARBONDATA-136] - Fixed Query data mismatch issue after compaction
[CARBONDATA-137] - Fixed detail limit query statistics issue
[CARBONDATA-138] - Scale up value of Avg aggregation for decimal type keeping sync with hive
[CARBONDATA-139] - Inconsistency during sortindex file reading
[CARBONDATA-140] - Fix legal [CARBONDATA-141] - Polish Maven coordinates and define Apache parent POM
[CARBONDATA-142] - Rename package org.carbondata to org.apache.carbondata
[CARBONDATA-146] - Data loading failure using carbon-spark-sql and carbon-spark-shell
[CARBONDATA-147] - Describe formatted command failing
[CARBONDATA-150] - Aggregate function with sub query using Order by is not working
[CARBONDATA-151] - count & count distinct column on same query is not working
[CARBONDATA-154] - Block prune can not get the right blocks and query result is wrong
Improvements
[CARBONDATA-11] - Support carbon carbon spark shell in carbondata to simplify operations for first time users
[CARBONDATA-13] - Time stamp range filters not able to prune blocks [CARBONDATA-19] - Column Group Filter (Rowlevel and Exclude)
[CARBONDATA-38] - Cleanup carbon.properties [CARBONDATA-40] - Make metastore_db location of derby configurable in CarbonContext
[CARBONDATA-47] - Simplified datasource format name and storage name in carbondata
[CARBONDATA-68] - Added Performance statistics for query execution in driver and executor
[CARBONDATA-92] - Remove the unnecessary intermediate conversion of key while scanning.
[CARBONDATA-96] - make zookeeper lock as default if zookeeper url is configured.
[CARBONDATA-102] - Exclude the Spark and hadoop from CarbonData assembly jar by default and reduce the jar file size
[CARBONDATA-106] - Add audit logs for DDL commands
[CARBONDATA-107] - Remove unnecessary ConverToSafe in spark planner
[CARBONDATA-122] - Provide second and third location preference to spark
[CARBONDATA-48] - Support Carbon sql cli to enhance user experiance
[CARBONDATA-50] - Support Spark 1.6.2 in CarbonData
Test [CARBONDATA-8] - Use create table instead of cube in all test cases