Apache CarbonData 1.6.0 Release

Apache CarbonData community is pleased to announce the release of the Version 1.6.0 in The Apache Software Foundation (ASF).

CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenarios, it supports queries on a single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release https://archive.apache.org/dist/carbondata/1.6.0/, and feedback through the CarbonData user mailing lists!

This release note provides information on the new features, improvements, and bug fixes of this release.

What’s New in CarbonData Version 1.6.0?

CarbonData 1.6.0 intention was to move closer to unified analytics. We have added index server to distribute the index cache. We have also supported incremental loading on MV datamaps to improve the loading time of datamap. we are now supporting reading cabondata tables from Hive and also supported Arrow format form SDK.

In this version of CarbonData, around 75 JIRA tickets related to new features, improvements, and bugs have been resolved. Following are the summary.

Index Server to distribute the index cache and parallelise the index pruning

Carbon currently prunes and caches all block/blocklet datamap index information into the driver. If the cache size becomes huge(70-80% of the driver memory) then there can be excessive GC in the driver which can slow down the queries and the driver may even go OutOfMemory. If multiple JDBC drivers want to read from same tables then every JDBC server needs to maintain their own copy of the cache. To solve these problems we have introduced distributed Index Cache Server. It is separate scalable server stores only index information and all the drivers can connect and prune the data using cached index information.

Incremental data loading on MV datamaps

Currently, MV datamaps can only be loaded with full load for any new data load on the parent table. Now we have supported incremental loading on MV datamaps so for any new load on parent table triggers the load on MV datamap only for incrementally added data.

Supported Arrow format from Carbon SDK

SDK reader now supports reading carbondata files and filling it to apache arrow vectors. This helps to avoid unnecessary intermediate serialisations when accessing from other execution engines or languages.

Supported read from Hive

CarbonData files can be read from the Hive. This helps users to easily migrate to CarbonData format on existing Hive deployments using other formats.

Behaviour Change

None

Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12344965

Sub-task

[CARBONDATA-3306] - Implement a DistributableIndexPruneRDD and IndexPruneFileFormat
[CARBONDATA-3337] - Implement a Hadoop RPC framwork for communication
[CARBONDATA-3338] - Incremental dat load support to datamap on single table
[CARBONDATA-3349] - add is_sorted and sort_columns information into show segments
[CARBONDATA-3350] - enhance custom compaction to support resort single segment
[CARBONDATA-3357] - Support TableProperties from single parent table and restrict alter/delete/partition on mv
[CARBONDATA-3378] - Display original query in Indexserver Job
[CARBONDATA-3381] - Large response size Exception is thrown from index server.
[CARBONDATA-3387] - Support Partition with MV datamap & Show DataMap Status
[CARBONDATA-3392] - Make use of LRU mandatory when using IndexServer
[CARBONDATA-3398] - Implement Show Cache for IndexServer and MV
[CARBONDATA-3399] - Implement Executor ID based task distribution for Index Server
[CARBONDATA-3402] - Block complex data types and validate dmproperties in mv
[CARBONDATA-3408] - CarbonSession partition support binary data type
[CARBONDATA-3409] - Fix Concurrent dataloading Issue with mv
[CARBONDATA-3423] - Validate dictionary for binary data type
[CARBONDATA-3440] - Expose a DDL to add index size and data size to tableStatus for legacy segments
[CARBONDATA-3459] - Fixed id based distribution for show cache command
[CARBONDATA-3462] - Add usage and deployment document for index server

Bug

[CARBONDATA-3247] - Support to select all columns when creating MV datamap
[CARBONDATA-3291] - MV datamap doesn't take affect when the same table join
[CARBONDATA-3294] - MV datamap throw error when using count(1) and case when expression
[CARBONDATA-3295] - MV datamap throw exception because its rewrite algorithm when multiply subquery
[CARBONDATA-3303] - MV datamap return wrong results when using coalesce and less groupby columns
[CARBONDATA-3317] - Executing 'show segments' command throws NPE when spark streaming app write data to new stream segment.
[CARBONDATA-3356] - There are some exception when carbonData DataSource read SDK files with varchar
[CARBONDATA-3364] - Support Read from Hive. Queries are giving empty results from hive.
[CARBONDATA-3367] - OOM when huge number of carbondata files are read from SDK reader
[CARBONDATA-3368] - InferSchema from datafile instead of index file
[CARBONDATA-3380] - Fix missing appName and AnalysisException bug in DirectSQLExample
[CARBONDATA-3382] - Fix compressor type displayed in desc formatted
[CARBONDATA-3384] - Delete/Update is throwing NullPointerException when index server is enabled.
[CARBONDATA-3393] - Merge Index Job Failure should not trigger the merge index job again. Exception propagation should be decided by the User.
[CARBONDATA-3395] - When same split object is passed to concurrent readers, build() fails randomly with Exception.
[CARBONDATA-3396] - Range Compaction Data mismatch
[CARBONDATA-3397] - Remove SparkUnknown Expression to Index Server
[CARBONDATA-3400] - Support IndexSever for Spark-Shell for in secure KERBROSE mode
[CARBONDATA-3403] - MV is not working for like and filter AND and OR queries
[CARBONDATA-3405] - SDK reader getSplits() must clear the cache.
[CARBONDATA-3406] - Support Binary, Boolean,Varchar, Complex data types read and Dictionary columns read
[CARBONDATA-3407] - distinct, count, Sum query fails when MV is created on single projection column
[CARBONDATA-3416] - When new analyzer rule added in spark, not reflecting in carbon
[CARBONDATA-3417] - Load time degrade for Range column due to cores configured
[CARBONDATA-3418] - Inherit Column Compressor Property from parent table to its child table's
[CARBONDATA-3419] - Desc Formatted not showing Range Column
[CARBONDATA-3424] - There are improper exception when query with avg(substr(binary data type)).
[CARBONDATA-3426] - Fix Load performance degrade by fixing task distribution
[CARBONDATA-3429] - CarbonCli on wrong segment path wrong error message is displayed
[CARBONDATA-3432] - Range Column compaction sending all the splits to all the executors one by one
[CARBONDATA-3433] - MV has issues when create on constant column, dupicate columns and limit queries
[CARBONDATA-3436] - update pre insert into rule as per spark
[CARBONDATA-3437] - Map Implementation not correct
[CARBONDATA-3442] - Fix creating mv datamap with column name having length more than 128
[CARBONDATA-3453] - Set segment doesn't work with adaptive execution
[CARBONDATA-3455] - Job Group ID is not displayed in the IndexServer
[CARBONDATA-3456] - Fix DataLaoding on MV when Yarn-Application is killed
[CARBONDATA-3457] - [MV]Fix Column not found with Cast Expression
[CARBONDATA-3458] - Running load, insert , CTAS command on carbon table sets double Execution ID info, and ID of CTAS is null
[CARBONDATA-3460] - EOF exception is thrown when quering using index server
[CARBONDATA-3467] - Fix count(*) with filter on string value
[CARBONDATA-3474] - Fix validate mvQuery having filter expression and correct error message
[CARBONDATA-3476] - Read time and scan time stats shown wrong in executor log for filter query
[CARBONDATA-3477] - Throw out exception when use sql: 'update table select\n...'
[CARBONDATA-3478] - Fix ArrayIndexOutOfBoundsException issue on compaction after alter rename operation
[CARBONDATA-3481] - Multi-thread pruning fails when datamaps count is just near numOfThreadsForPruning
[CARBONDATA-3482] - Null pointer exception when concurrent select queries are executed from different beeline terminals.
[CARBONDATA-3483] - Can not run horizontal compaction when execute update sql
[CARBONDATA-3486] - Serialization/ deserialization issue with Datatype
[CARBONDATA-3490] - Concurrent data load failure with carbondata FileNotFound exception
[CARBONDATA-3493] - Carbon query fails when enable.query.statistics is true in specific scenario.

New Feature

[CARBONDATA-3404] - Support CarbonFile API for coniguring custom file systems

Improvement

[CARBONDATA-3309] - MV datamap adapt to spark 2.1 version
[CARBONDATA-3365] - Support Apache arrow vector filling from carbondata SDK
[CARBONDATA-3447] - Index Server Performance Improvement
[CARBONDATA-3488] - Check the file size after move local file to carbon path

Page tree