Apache CarbonData community is pleased to announce the release of the Version 1.5.2 in The Apache Software Foundation (ASF). 

CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!

We encourage you to use the release https://archive.apache.org/dist/carbondata/1.5.2/, and feedback through the CarbonData user mailing lists!

This release note provides information on the new features, improvements, and bug fixes of this release.

What’s New in CarbonData Version 1.5.2?

CarbonData 1.5.2 intention was to move more closer to unified analytics. We want to enable CarbonData files to be read from more engines/libraries to support various use cases. In this regard we have enhanced and stabilized Presto features and the following features and improvements.

In this version of CarbonData, more than 68 JIRA tickets related to new features, improvements, and bugs has been resolved. Following are the summary.

CarbonData Core

Support Compaction for No-sort Load Segments

During Data loading, if sort scope is set as No-sort, the data loading performance would increase significantly as the data won't get sorted and is written as it is received. But this no-sort loading would cause the query performance to degrade as indexes are not built on these segments. Compacting these no-sort loaded segments would convert these segments into sorted segments and thereby improve the query performance as indexes get generated. The ideal scenario to use this feature is when high speed data loading is more important than a high query performance till the time the compaction is not done. 

Support Rename of Column Names

Column names can be renamed to reflect the business scenario or conventions. 

Support GZIP Compressor for CarbonData Files

GZIP compression is supported to compress each page of CarbonData file. GZIP offers better compression ratio there by reducing the store size. On the average GZIP compression reduces store size by 20-30% as compared to Snappy compression. GZIP compression is supported to compress sort temp files written during data loading. GZIP also has support from hardware. Hence data loading performance would increase on those machines where GZIP is supported natively from hardware.

Performance Improvements

Support Range Partitioned Sort during data load

Global Sort supported during Data loads ensures the data is entirely sorted and hence group all the same data to a particular node/machine.This helps to optimise the Spark scan performance and also increases the concurrency. The drawback of Global Sort is that is very slow as the data has to be globally sorted(Heavy shuffle). Local sort on the other hand partitions the data to multiple nodes/machines and ensure the data local to that node/machine is sorted. This improves the data loading performance, but query performance degrades a bit as more Spark tasks will have to be launched to scan the data. Range sort on the other hand, splits the data based on the value range and loads using local sort. This give a balanced performance for both load and query.

Other Improvements

Presto Enhancements

CarbonData implemented features to better integrate with Presto. Now Presto can recognise CarbonData as a native format. Many bugs were fixed to enhance the stability.

Support Map Data Type through DDL

1.5.0 version supported adding Map data type through CarbonData SDK. This version supports adding Map data type through DDL.

Behaviour Change

  1. If user doesn’t specify sort columns during table creation, default sort scope is set to no-sort during data loading
  2. Default Complex values delimiter value is changed from '$',':' to '\001' , '\002' respectively
  3. Inverted Index generation is disabled by default

New Configuration Parameters

Configuration name

Default Value

Range

carbon.table.load.sort.scopeLOCAL_SORT

LOCAL_SORT, NO_SORT, GLOBAL_SORT, BATCH_SORT

carbon.range.column.scale.factor31-300


Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12344321

Sub-task

Bug

  • [CARBONDATA-3080] - Supporting local dictionary enable by default for SDK
  • [CARBONDATA-3102] - There are some error when use thriftServer and beeline
  • [CARBONDATA-3116] - set carbon.query.directQueryOnDataMap.enabled=true not working
  • [CARBONDATA-3127] - Hive module test case has been commented off,can' t run.
  • [CARBONDATA-3147] - Preaggregate dataload fails in case of concurrent load in some cases
  • [CARBONDATA-3153] - Change of Complex Delimiters
  • [CARBONDATA-3154] - Fix spark-2.1 test error
  • [CARBONDATA-3159] - Issue with SDK Write when empty array is given
  • [CARBONDATA-3162] - Range filters doesn't remove null values for no_sort direct dictionary dimension columns.
  • [CARBONDATA-3165] - Query of BloomFilter java.lang.NullPointerException
  • [CARBONDATA-3174] - Fix trailing space issue with varchar column for SDK
  • [CARBONDATA-3181] - IllegalAccessError for BloomFilter.bits when bloom_compress is false
  • [CARBONDATA-3184] - Fix DataLoad failure with "using carbondata"
  • [CARBONDATA-3188] - Create carbon table as hive understandable metastore table needed by Presto and Hive
  • [CARBONDATA-3196] - Compaction Failing for Complex datatypes with Dictionary Include
  • [CARBONDATA-3203] - Compaction failing for table which is retstructured
  • [CARBONDATA-3205] - Fix Get Local Dictionary for empty Array of Struct
  • [CARBONDATA-3212] - Select * is failing with java.lang.NegativeArraySizeException in SDK flow
  • [CARBONDATA-3216] - There are some bugs in CSDK
  • [CARBONDATA-3221] - SDK don't support read multiple file from S3
  • [CARBONDATA-3222] - Fix dataload failure after creation of preaggregate datamap on main table with long_string_columns
  • [CARBONDATA-3224] - SDK should validate the improper value
  • [CARBONDATA-3233] - JVM is getting crashed during dataload while compressing in snappy
  • [CARBONDATA-3238] - Throw StackOverflowError exception using MV datamap
  • [CARBONDATA-3239] - Throwing ArrayIndexOutOfBoundsException in DataSkewRangePartitioner
  • [CARBONDATA-3243] - CarbonTable.getSortScope() is not considering session property CARBON.TABLE.LOAD.SORT.SCOPE
  • [CARBONDATA-3246] - SDK reader fails if vectorReader is false for concurrent read scenario and batch size is zero.
  • [CARBONDATA-3260] - Broadcast join is not properly in carbon with spark-2.3.2
  • [CARBONDATA-3262] - Failure to write merge index file results in merged segment being deleted when cleanup happens
  • [CARBONDATA-3265] - Memory Leak and Low Query Performance Issues in Range Partition
  • [CARBONDATA-3267] - Data loading is failing with OOM using range sort
  • [CARBONDATA-3268] - Query on Varchar showing as Null in Presto
  • [CARBONDATA-3269] - Range_column throwing ArrayIndexOutOfBoundsException when using KryoSerializer
  • [CARBONDATA-3273] - For table without SORT_COLUMNS, Loading data is showing SORT_SCOPE=LOCAL_SORT instead of NO_SORT
  • [CARBONDATA-3275] - There are 4 errors in CI after PR 3094 merged
  • [CARBONDATA-3282] - presto carbon doesn't work with Hadoop conf in cluster.

New Feature

Improvement

Test

  • No labels