Apache CarbonData community is pleased to announce the release of the Version 1.4.1 in The Apache Software Foundation (ASF).
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookups on detail record, streaming analytics, etc. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
We encourage you to use the release https://dist.apache.org/repos/dist/release/carbondata/1.4.1/, and feedback through the CarbonData user mailing lists!
This release note provides information on the new features, improvements, and bug fixes of this release.
What’s New in Version 1.4.1?
In this version of CarbonData, more than 230 JIRA tickets for new feature, improvement and bugs has been resolved. Following are the summary.
Support Cloud Storage (S3)
This can be used to store or retrieve data on Amazon cloud, Huawei Cloud(OBS) or on any other object stores conforming to S3 API. Storing data in cloud is advantageous as there are no restrictions on the size of data and the data can be accessed from anywhere at any time. Carbondata can support any Object Storage that conforms to Amazon S3 API. For more detail, please refer to S3 Guide.
Support Flat Folder
This feature allows all carbondata and index files to keep directly under table-path. This is useful for interoperability between the execution engines and plugin with other execution engines like Hive or Presto.
Support 32K Characters (Alpha Feature)
In common scenarios, the length of the string is less than 32000. In some cases, if the length of the string is more than 32000 characters, CarbonData introduces a table property called
LONG_STRING_COLUMNS to handle this scenario. For these columns, CarbonData internally stores the length of content using Integer.
Helps in getting more compression. Filter queries and full scan queries will be faster as filter will be done on encoded data. Reducing the store size and memory footprint as only unique values will be stored as part of local dictionary and corresponding data will be stored as encoded data. Getting higher IO throughput.
CarbonData supports merging of all the index files inside a segment to a single CarbonData index merge file. This reduces the time need to load the indexes into Driver memory and there by significantly reduces the first query response time.
Shows History Segments
CarbonData introduces a 'SHOW HISTORY SEGMENTS' to show all segment information including visible and invisible segments.
Custom compaction is a new compaction type in addition to MAJOR and MINOR compaction. In custom compaction, you can directly specify the segments to be merged.
Enhancement for Detail Record Analysis
Supports Bloom Filter DataMap
CarbonData introduce BloomFilter as an index datamap to enhance the performance of querying with precise value. It is well suitable for queries that do precise match on high cardinality columns(such as Name/ID). In concurrent filter query scenario (on high cardinality column), we observes 3~5 times improvement in concurrent queries per second comparing to last version. For more detail, please refer to BloomFilter DataMap Guide.
Improved Complex Datatypes
Improved complex datatypes compression and performance through adaptive encoding.
Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12343148
- [CARBONDATA-2504] - Support StreamSQL for streaming job
- [CARBONDATA-2638] - Implement driver min max caching for specified columns and segregate block and blocklet cache
- [CARBONDATA-2202] - Introduce local dictionary encoding for dimensions
- [CARBONDATA-2309] - Add strategy to generate bigger carbondata files in case of small amount of data
- [CARBONDATA-2355] - Support run SQL on carbon files directly, which is generated by SDk
- [CARBONDATA-2389] - Search mode support lucene datamap
- [CARBONDATA-2420] - Support string longer than 32000 characters
- [CARBONDATA-2428] - Support Flat folder structure in carbon.
- [CARBONDATA-2482] - Pass uuid while writing segment file if possible
- [CARBONDATA-2500] - The order is different between write and read data type of schema in SDK
- [CARBONDATA-2519] - Add document for CarbonReader
- [CARBONDATA-2521] - Support create carbonReader without tableName
- [CARBONDATA-2549] - Implement LRU cache in Bloom filter based on Carbon LRU cache interface
- [CARBONDATA-2553] - support ZSTD compression for sort temp file
- [CARBONDATA-2558] - Optimize carbon schema reader interface of SDK
- [CARBONDATA-2569] - Search mode throw exception but test case pass
- [CARBONDATA-2573] - Merge carbonstore branch to master
- [CARBONDATA-2575] - Add document to explain DataMap Management
- [CARBONDATA-2591] - SDK CarbonReader support filter
- [CARBONDATA-2593] - Add an option 'carbon.insert.storage.level' to support configuring the storage level when insert into data with 'carbon.insert.persist.enable'='true'
- [CARBONDATA-2609] - Change RPC implementation to Hadoop RPC
- [CARBONDATA-2630] - Alter table set Table comment is throwing exception in spark-2.2 cluster
- [CARBONDATA-2642] - Introduce configurable Lock Path
- [CARBONDATA-2656] - Presto Stream Readers performance Enhancement
- [CARBONDATA-2659] - Support partitioned carbon table by DataFrame.write
- [CARBONDATA-2686] - Implement left outer join in mv
- [CARBONDATA-2710] - Refactor CarbonSparkSqlParser for better code reuse.
- [CARBONDATA-2720] - Remove dead code from carbonData
- [CARBONDATA-2754] - fix failing UT for HiveMetastore
- [CARBONDATA-2760] - Reduce Memory footprint and store size for local dictionary encoded columns
- [CARBONDATA-2782] - dead code in class 'CarbonCleanFilesCommand'
- [CARBONDATA-2791] - Fix Adaptive Encoding for Double if exceeds LONG.Max_value
- [CARBONDATA-2801] - Add documentation for flat folder
- [CARBONDATA-2807] - Fixed data load performance issue with more number of records
- [CARBONDATA-2815] - Add documentation for memory spill and rebuild datamap
- [CARBONDATA-2836] - Fixed data loading performance issue
- [CARBONDATA-2481] - Adding SDV testcases for SDK Writer
- [CARBONDATA-2789] - Support Hadoop 2.8.3 eco-system integration
- [CARBONDATA-2512] - Support long_string_columns in sdk
- [CARBONDATA-2587] - Support Local dictionary in data loading
- [CARBONDATA-2589] - Support Query on Local dictionary columns
- [CARBONDATA-2590] - Support Query on Local dictionary Complex type column
- [CARBONDATA-2602] - Support Filter Query on Local dictionary generated columns
- [CARBONDATA-2606] - Projection push down for struct data type
- [CARBONDATA-2607] - Provide Adaptive Encoding and Decoding for all data type
- [CARBONDATA-2608] - SDK Support JSON data loading directly without AVRO conversion
- [CARBONDATA-2618] - Split to multiple pages if varchar column page exceeds 2GB/snappy limits
- [CARBONDATA-2624] - Add validations for Create table command for complex dataType columns for Local Dictionary Support
- [CARBONDATA-2633] - Bugs are found when bloomindex column is dictionary/sort/date column
- [CARBONDATA-2634] - Provide more information about the datamap when showing datamaps
- [CARBONDATA-2635] - Support different provider based index datamaps on same column
- [CARBONDATA-2637] - Fix bugs for deferred rebuild for bloomfilter datamap
- [CARBONDATA-2645] - Segregate block and blocklet cache
- [CARBONDATA-2647] - Add support for CACHE_LEVEL in create table and alter table properties
- [CARBONDATA-2648] - Add support for COLUMN_META_CACHE in create table and alter table properties
- [CARBONDATA-2649] - Add code for caching min/max only for specified columns
- [CARBONDATA-2650] - explain query shows negative skipped blocklets for bloomfilter datamap
- [CARBONDATA-2651] - Update IDG for COLUMN_META_CACHE and CACHE_LEVEL properties
- [CARBONDATA-2653] - Fix bugs in incorrect blocklet number in bloomfilter
- [CARBONDATA-2654] - Optimize output for explaining query with datamap
- [CARBONDATA-2655] - Support `in` operator for bloomfilter datamap
- [CARBONDATA-2657] - Loading/Filtering empty value fails on bloom index columns
- [CARBONDATA-2660] - Support filtering on longstring bloom index columns
- [CARBONDATA-2675] - Support config long_string_columns when create datamap
- [CARBONDATA-2681] - Fix loading problem using global/batch sort fails when table has long string columns
- [CARBONDATA-2682] - create table with long_string_columns property
- [CARBONDATA-2683] - Fix data convertion problem for Varchar
- [CARBONDATA-2685] - make datamap rebuild for all segments in parallel
- [CARBONDATA-2687] - update document for bloomfilter
- [CARBONDATA-2689] - Add test cases for alter statement for Local Dictionary Support and add validations for complex data Type columns
- [CARBONDATA-2693] - Fix bug for alter rename is renameing the existing table on which bloomfilter datamp exists
- [CARBONDATA-2694] - show long_string_columns in desc table command
- [CARBONDATA-2700] - Block dropping index columns for index datamap
- [CARBONDATA-2701] - Refactor code to store minimal required info in Block and Blocklet Cache
- [CARBONDATA-2702] - Fix bugs in clear bloom datamap
- [CARBONDATA-2706] - clear bloom index file after segment is deleted
- [CARBONDATA-2708] - clear index file if dataloading is failed
- [CARBONDATA-2714] - Support merge index files for the segment
- [CARBONDATA-2716] - Add validate for datamap writer while loading data
- [CARBONDATA-2719] - Table update/delete is needed block on table having datamaps
- [CARBONDATA-2723] - Failed to recreate the table which has bloomfilter on it with same table name but different bloom index
- [CARBONDATA-2727] - Support create bloom datamap on newly added column
- [CARBONDATA-2732] - Block create bloomfilter datamap index on column which its datatype is complex type
- [CARBONDATA-2734] - [BUG] support struct of date in create table
- [CARBONDATA-2745] - Add a seperate Impl for AtomicFileOperations for s3
- [CARBONDATA-2746] - Fix bug for getting datamap file when table has multiple datamaps
- [CARBONDATA-2757] - Fix bug when building bloomfilter on measure column
- [CARBONDATA-2770] - Optimize code to get blocklet id when rebuilding datamap
- [CARBONDATA-2774] - Exception should be thrown if expression do not satisfy bloomFilter's requirement
- [CARBONDATA-2783] - Update document of bloom filter datamap
- [CARBONDATA-2788] - Fix bugs in incorrect query result with bloom datamap
- [CARBONDATA-2790] - Optimize default parameter for bloomfilter datamap
- [CARBONDATA-2793] - Add document for 32k feature
- [CARBONDATA-2796] - Fix data loading problem when table has complex column and long string column
- [CARBONDATA-2800] - Add useful tips for bloomfilter datamap
- [CARBONDATA-1787] - Carbon 1.3.0- Global Sort: Global_Sort_Partitions parameter doesn't work, if specified in the Tblproperties, while creating the table.
- [CARBONDATA-2339] - 下标越界
- [CARBONDATA-2340] - load数据超过32000byte
- [CARBONDATA-2418] - Presto can't query Carbon table when carbonstore is created at s3
- [CARBONDATA-2478] - Add datamap-developer-guide.md file in readme
- [CARBONDATA-2491] - There are some error when reader twice with SDK carbonReader
- [CARBONDATA-2508] - There are some errors when I running SearchModeExample
- [CARBONDATA-2514] - Duplicate columns in CarbonWriter is throwing NullPointerException
- [CARBONDATA-2515] - Filter OR Expression not working properly in Presto integration
- [CARBONDATA-2516] - Filter Greater-than for timestamp datatype not generating Expression in PrestoFilterUtil
- [CARBONDATA-2528] - MV Datamap - When the MV is created with the order by, then when we execute the corresponding query defined in MV with order by, then the data is not accessed from the MV.
- [CARBONDATA-2529] - S3 Example not working with Hadoop 2.8.3
- [CARBONDATA-2530] - [MV] Wrong data displayed when parent table data are loaded
- [CARBONDATA-2531] - [MV] MV not hit when alias is in use
- [CARBONDATA-2534] - MV Dataset - MV creation is not working with the substring()
- [CARBONDATA-2539] - MV Dataset - Subqueries is not accessing the data from the MV datamap.
- [CARBONDATA-2540] - MV Dataset - Unionall queries are not fetching data from MV dataset.
- [CARBONDATA-2542] - MV creation is failed for other than default database
- [CARBONDATA-2546] - It will throw exception when give same column twice in projection and tries to print it.
- [CARBONDATA-2550] - [MV] Limit is ignored when data fetched from MV, Query rewrite is Wrong
- [CARBONDATA-2557] - Improve Carbon Reader Schema reading performance on S3
- [CARBONDATA-2560] - [MV] Exception in console during MV creation but MV registered successfully
- [CARBONDATA-2568] - [MV] MV datamap is not hit when ,column is in group by but not in projection
- [CARBONDATA-2571] - Calculating the carbonindex and carbondata file size of a table is wrong
- [CARBONDATA-2576] - MV Datamap - MV is not working fine if there is more than 3 aggregate function in the same datamap.
- [CARBONDATA-2604] - getting ArrayIndexOutOfBoundException during compaction after IUD in cluster
- [CARBONDATA-2614] - There are some exception when using FG in search mode and the prune result is none
- [CARBONDATA-2616] - Incorrect explain and query result while using bloomfilter datamap
- [CARBONDATA-2617] - Invalid tuple and block id getting formed for non partition table
- [CARBONDATA-2621] - Lock problem in index datamap
- [CARBONDATA-2623] - Add DataMap Pre and Pevent listener
- [CARBONDATA-2626] - Logic for dictionary/nodictionary column pages in TablePage is wrong
- [CARBONDATA-2627] - remove dependecy of tech.allegro.schema.json2avro
- [CARBONDATA-2629] - SDK carbon reader don't support filter in HDFS and S3
- [CARBONDATA-2632] - BloomFilter DataMap Bugs and Optimization
- [CARBONDATA-2644] - Validation not present for carbon.load.sortMemory.spill.percentage parameter
- [CARBONDATA-2646] - While loading data into into a table with 'SORT_COLUMN_BOUNDS' property, 'ERROR' flag is displayed instead of 'WARN' flag.
- [CARBONDATA-2658] - Fix bug in spilling in-memory pages
- [CARBONDATA-2666] - Rename command should not rename the table directory
- [CARBONDATA-2669] - Local Dictionary Store Size optimisation and other function issues
- [CARBONDATA-2674] - Streaming with merge index enabled does not consider the merge index file while pruning.
- [CARBONDATA-2684] - Code Generator Error is thrown when Select filter contains more than one count of distinct of ComplexColumn with group by Clause
- [CARBONDATA-2703] - Fix bugs in tests
- [CARBONDATA-2704] - Index file size in describe formatted command is not updated correctly with the segment file
- [CARBONDATA-2715] - Failed to run tests for Search Mode With Lucene in Windows env
- [CARBONDATA-2717] - Table id is empty when taking drop lock which causes failure
- [CARBONDATA-2721] - [SDK] [JsonWriter] NPE when schema and data are not of same length or Data is null.
- [CARBONDATA-2722] - [SDK] [JsonWriter] Json writer is writing only first element of an array and discarding the rest of the elements
- [CARBONDATA-2724] - Unsupported create datamap on table with V1 or V2 format data
- [CARBONDATA-2729] - Schema Compatibility problem between version 1.3.0 and 1.4.0
- [CARBONDATA-2735] - Performance issue for complex array data type when number of elements in array is more
- [CARBONDATA-2738] - Block Preaggregate, Dictionary Exclude/Include for child columns for Complex datatype
- [CARBONDATA-2740] - flat folder structure is not handled for implicit column and segment file is not getting deleted after load is failed
- [CARBONDATA-2741] - Exception occurs after alter add few columns and selecting in random order
- [CARBONDATA-2742] - [MV] Wrong data displayed after MV creation.
- [CARBONDATA-2747] - Fix Lucene datamap choosing and DataMapDistributable building
- [CARBONDATA-2749] - In HDFS Empty tablestatus file is written during datalaod, iud or compaction when disk is full.
- [CARBONDATA-2751] - Thread leak issue in data loading and Compatibility issue
- [CARBONDATA-2753] - Fix Compatibility issues
- [CARBONDATA-2762] - Long string column displayed as string in describe formatted
- [CARBONDATA-2763] - Create table with partition and no_inverted_index on long_string column is not blocked
- [CARBONDATA-2769] - Fix bug when getting shard name from data before version 1.4
- [CARBONDATA-2777] - NonTransactional tables, Select count(*) is not giving latest results for incremental load with same segment ID (UUID)
- [CARBONDATA-2778] - Empty result in query after IUD delete operation
- [CARBONDATA-2779] - Filter query is failing for store created with V1/V2 format
- [CARBONDATA-2781] - Add fix for Null Pointer Exception when Pre-aggregate create command is killed from UI
- [CARBONDATA-2784] - [SDK writer] Forever blocking wait with more than 20 batch of data, when consumer is dead due to data loading exception
- [CARBONDATA-2792] - Create external table fails post schema restructure
- [CARBONDATA-2795] - Add documentation on the usage of S3 as carbon store
- [CARBONDATA-2798] - Fix Dictionary_Include for ComplexDataType
- [CARBONDATA-2799] - Query failed with bloom datamap on preagg table with dictionary column
- [CARBONDATA-2802] - Creation of Bloomfilter Datamap is failing after UID,compaction,pre-aggregate datamap creation
- [CARBONDATA-2803] - data size is wrong in table status & handle local dictionary for older tables is not proper, null pointer exption is thrown
- [CARBONDATA-2804] - Incorrect error message when bloom filter or preaggregate datamap tried to be created on older V1-V2 version stores
- [CARBONDATA-2805] - Wrong order in custom compaction
- [CARBONDATA-2808] - Insert into select is crashing as both are sharing the same task context
- [CARBONDATA-2812] - Implement freeMemory for complex pages
- [CARBONDATA-2813] - Major compaction on partition table created in 1.3.x store is throwing Unable to get file status error.
- [CARBONDATA-2823] - Alter table set local dictionary include after bloom creation fails throwing incorrect error
- [CARBONDATA-2829] - Fix creating merge index on older V1 V2 store
- [CARBONDATA-2831] - Support Merge index files read from non transactional table.
- [CARBONDATA-2832] - Block loading error for select query executed after merge index command executed on V1/V2 store table
- [CARBONDATA-2834] - Refactor code to remove nested for loop to extract invalidTimestampRange.