Apache CarbonData community is pleased to announce the release of the Version 1.5.0 in The Apache Software Foundation (ASF).
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookups on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
We encourage you to use the release https://archive.apache.org/dist/carbondata/1.5.0/, and feedback through the CarbonData user mailing lists!
This release note provides information on the new features, improvements, and bug fixes of this release.
What’s New in CarbonData Version 1.5.0?
CarbonData 1.5.0 intention was to move more closer to unified analytics. We want to enable CarbonData files to be read from more engines/libraries to support various use cases. In this regard we have added support to read CarbonData files from c++ libraries. Additionally CarbonData files can be read using Java SDK, Spark FileFormat interface, Spark, Presto.
CarbonData added multiple optimisations to reduce the store size so that query can take advantage of lesser IO. Several enhancements have been made to Streaming support from CarbonData.
In this version of CarbonData, more than 150 JIRA tickets related to new features, improvements, and bugs has been resolved. Following are the summary.
Ecosystem Integration
Support Spark 2.3.2 ecosystem integration
Now CarbonData supports Spark 2.3.2
Spark 2.3.2 has many performance improvements in addition to critical bug fixes. Spark 2.3.2 has many improvements related to Streaming and unification of interfaces. In 1.5.0 version, CarbonData integrated with Spark so that future versions of CarbonData can add enhancements based on Spark's new and improved capabilities.
Support Hadoop 3.1.1 ecosystem integration
Now CarbonData supports Hadoop 3.1.1 which is the latest and stable hadoop version and support many new features.(EC, federation cluster etc.)
LightWeight Integration with Spark
CarbonData now supports the Spark FileFormat Data Source APIs so that CarbonData can be integrated to Spark as an external file source. This integration helps querying CarbonData tables from SparkSession, it also helps applications which needs standard compliance's with respect to interfaces.
Spark data source APIs support file format level operations such as read and write. CarbonData’s enhanced features namely IUD, Alter, Compaction, Segment Management, Streaming will not be available to use when CarbonData is integrated as a Spark’s data source through the data source API.
CarbonData Core
Adaptive Encoding for Numeric Columns
CarbonData now supports adaptive encoding for numeric columns. Adaptive encoding helps to store each data of a column as delta of Min/Max value of that column, there by reducing the effective bits required to store the value. This results in smaller store size there by increasing the query performance due to lesser IO. Adaptive encoding for dictionary columns is already supported from version 1.1.0, now supports for all numeric columns.
Performance improvement measurement is not complete in 1.5.0. The results will be published along with 1.5.1 release.
Configurable Column Size for Generating Min/Max
CarbonData generates Min/Max index for all columns and uses it for effective pruning of data while querying. Generating Min/Max for columns having longer width(like address column) will lead to increased storage size, increased memory footprint there by reducing the query performance. Moreover filters are not applied on such columns and hence there is no necessity of generating the indexes; or the filters on such columns are very minimal and would be wise to have lower query performance in such scenarios, rather than affecting the over all performance for other filter scenarios due to increased index size. CarbonData now supports configuring the limit of the column width(in terms of characters) beyond which the Min/Max generation would be skipped.
By Default the Min/Max is generated for all string columns. Users who are aware of they data schema and know the columns which have more number of characters and on which filters will not be applied upon, can configure the exclude such columns; or the maximum length of characters upto which the Min/Max can be generated can be specified so that CarbonData would skip Min/Max index generation if the column character length crosses this configured threshold. By default string columns with more than 200 bytes are skipped from Min/Max index generation. In Java each character occupies 2 characters.Hence column length greater than 100 characters are skipped from Min/Max index generation.
Support for Map Complex Data Type
CarbonData has integrated map complex data type support. Map data schema defined in Avro can be stored into CarbonData tables. Map data types help for an efficient look up of data. Adding Map complex data type support CarbonData helps the user to directly store their Avro data without writing the conversion logic into CarbonData supported data types.
Support for Byte and Float Data Types
CarbonData supports Byte and Float data types so that the data types defined in Avro schema can be stored into CarbonData tables. Columns of Byte data type can be included in sort columns.
ZSTD Compression
ZSTD compression is supported to compress each page of CarbonData file. ZSTD offers better compression ratio there by reducing the store size. On the average ZSTD compression reduces store size by 20-30% . ZSTD compression is supported to compress sort temp files written during data loading.
CarbonData SDK
SDK Supports C++ Interfaces to read CarbonData files
To enable integration with non java based execution engines, CarbonData supports C++ reader to read the CarbonData files. These readers can be integrated with any execution engine and queried for data stored in CarbonData tables without the dependency on Spark or Hadoop.
Multi-Thread Safe Writer API in SDK
To improve the write performance when using SDK, CarbonData supports multi-thread safe writer APIs. This enables the applications to write data to a single CarbonData file in parallel. Multi-Thread safe writers help in generating bigger CarbonData files there by avoiding the small files problem faced in HDFS.
Streaming
StreamSQL supports Kafka as streaming source
StreamSQL DDL now supports specifying Kafka as streaming source. With this support, users need not write custom application to ingest streaming data from Kafka into CarbonData. They can easily do so by specifying 'format' as 'kafka' in CREATE TABLE DDL.
StreamSQL supports Json records from Kafka/socket streaming sources
Now StreamSQL can accept Json as data format in addition to csv. This helps the users not to write their custom applications to ingest streaming data into CarbonData.
Min/Max Index Support for Streaming Segment
CarbonData supports generating Min/Max indexes for Streaming segment so that filter pruning is more efficient and increases the query performance. CarbonData is able to serve the queries faster due to the Min/Max indexes built at various levels. Adding Min/Max index support to Stream segment will enable CarbonData to serve the queries with same performance as other columnar segments.
Debugging and Maintenance enhancements
Data Summary Tool
CarbonData supports a CLI tool to retrieve the statistical information from each CarbonData file.It can list various parameters like number of blocklets, pages, encoding types, Min/Max indexes. This tool is useful to identify the reason for a block/blocklet selection during pruning.Looking at the Min/Max indexes, user can easily decide the size of blocklet so as to avoid false positives. Scan performance benchmarking is supported from this tool. User can use this to identify the time taken to scan each blocklet for a particular column.
Other Improvements
- Code optimized to avoid unnecessary listing of CarbonData files stored in S3, resulting in S3 performance enhancement.
- Now SDK supports Varchar columns greater than 32K characters.
- Now you can decide the sort_scope during CarbonData write operation from SDK.
- Memory footprint for Dataloading with Local dictionary is optimized to consume approximately 2x times that of DataLoading with Global Dictionary. In earlier versions, the memory footprint was 10x.
- SDK APIs are more simplified for easy accommodation of new input types (for example, CSV, JSON, and so on) without modifying much of business code.
- Bloom Filter quality has been further enhanced by fixing various bugs related to bloom index creation and clean up. Now bloom filter scan for In Expressions have be optimised to scan once.
- MV datamap quality has been enhanced by fixing numerous bugs related to MV selection logic and by supporting various sql constructs. Examples have been added to explain the usage of MV.
- Compaction bug of ignoring subsequent segments from compacting when configuration is of (X,1) is handled.
- SHOW SEGMENT command now displays the size of each segment. This helps the user to perform maintenance operations like compaction, backup.
- SDK has been enhanced to support long_string_columns, Map complex data type, sort_scope.
Behavioral Changes
Renaming of Table Names
Earlier renaming of CarbonData table used to rename in Hive metastore as well as folder name on HDFS. Now, it will be renamed only in Hive metastore.
Changed Configuration Default Values
Configuration name | Old Value | New Value |
---|---|---|
bloom_size | 32000 | 640000 |
bloom_fpp | 0.01 | 0.00001 |
carbon.stream.parser | org.apache.carbondata.streaming.parser.CSVStreamParserImp | org.apache.carbondata.streaming.parser.RowStreamParserImp |
New Configuration Parameters
Configuration name | Default Value | Range |
---|---|---|
carbon.minmax.allowed.byte.count | 200 bytes (100 characters) | 10-1000 bytes |
carbon.insert.persist.enable | false | NA |
carbon.insert.storage.level | MEMORY_AND_DISK | http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence |
carbon.update.storage.level | MEMORY_AND_DISK | http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence |
carbon.global.sort.rdd.storage.level | MEMORY_ONLY | http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence |
Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12341006
Sub-task
- [CARBONDATA-2512] - Support long_string_columns in sdk
- [CARBONDATA-2633] - Bugs are found when bloomindex column is dictionary/sort/date column
- [CARBONDATA-2634] - Provide more information about the datamap when showing datamaps
- [CARBONDATA-2635] - Support different provider based index datamaps on same column
- [CARBONDATA-2637] - Fix bugs for deferred rebuild for bloomfilter datamap
- [CARBONDATA-2650] - explain query shows negative skipped blocklets for bloomfilter datamap
- [CARBONDATA-2653] - Fix bugs in incorrect blocklet number in bloomfilter
- [CARBONDATA-2654] - Optimize output for explaining query with datamap
- [CARBONDATA-2655] - Support `in` operator for bloomfilter datamap
- [CARBONDATA-2657] - Loading/Filtering empty value fails on bloom index columns
- [CARBONDATA-2660] - Support filtering on longstring bloom index columns
- [CARBONDATA-2675] - Support config long_string_columns when create datamap
- [CARBONDATA-2681] - Fix loading problem using global/batch sort fails when table has long string columns
- [CARBONDATA-2683] - Fix data convertion problem for Varchar
- [CARBONDATA-2685] - make datamap rebuild for all segments in parallel
- [CARBONDATA-2687] - update document for bloomfilter
- [CARBONDATA-2693] - Fix bug for alter rename is renameing the existing table on which bloomfilter datamp exists
- [CARBONDATA-2694] - show long_string_columns in desc table command
- [CARBONDATA-2702] - Fix bugs in clear bloom datamap
- [CARBONDATA-2706] - clear bloom index file after segment is deleted
- [CARBONDATA-2708] - clear index file if dataloading is failed
- [CARBONDATA-2790] - Optimize default parameter for bloomfilter datamap
- [CARBONDATA-2811] - Add query test case using search mode on table with bloom filter
- [CARBONDATA-2835] - Block MV datamap on streaming table
- [CARBONDATA-2844] - SK AK not getting passed to executors for global sort
- [CARBONDATA-2845] - Merge bloom index files of multi-shards for each index column
- [CARBONDATA-2851] - support zstd as column compressor
- [CARBONDATA-2852] - support zstd on legacy store
- [CARBONDATA-2853] - Add min/max index for streaming segment
- [CARBONDATA-2859] - add sdv test case for bloomfilter datamap
- [CARBONDATA-2869] - SDK support for Map DataType
- [CARBONDATA-2894] - Add support for complex map type through spark carbon file format API
- [CARBONDATA-2922] - support long string columns with spark FileFormat and SDK with "long_string_columns" TableProperties
- [CARBONDATA-2935] - Write is_sorted field in file footer
- [CARBONDATA-2942] - Add read and write support for writing min max based on configurable bytes count
- [CARBONDATA-2952] - Provide CarbonReader C++ interface for SDK
- [CARBONDATA-2957] - update document about zstd support in carbondata
Bug
- [CARBONDATA-1787] - Carbon 1.3.0- Global Sort: Global_Sort_Partitions parameter doesn't work, if specified in the Tblproperties, while creating the table.
- [CARBONDATA-2418] - Presto can't query Carbon table when carbonstore is created at s3
- [CARBONDATA-2478] - Add datamap-developer-guide.md file in readme
- [CARBONDATA-2515] - Filter OR Expression not working properly in Presto integration
- [CARBONDATA-2516] - Filter Greater-than for timestamp datatype not generating Expression in PrestoFilterUtil
- [CARBONDATA-2528] - MV Datamap - When the MV is created with the order by, then when we execute the corresponding query defined in MV with order by, then the data is not accessed from the MV.
- [CARBONDATA-2530] - [MV] Wrong data displayed when parent table data are loaded
- [CARBONDATA-2531] - [MV] MV not hit when alias is in use
- [CARBONDATA-2534] - MV Dataset - MV creation is not working with the substring()
- [CARBONDATA-2539] - MV Dataset - Subqueries is not accessing the data from the MV datamap.
- [CARBONDATA-2540] - MV Dataset - Unionall queries are not fetching data from MV dataset.
- [CARBONDATA-2542] - MV creation is failed for other than default database
- [CARBONDATA-2550] - [MV] Limit is ignored when data fetched from MV, Query rewrite is Wrong
- [CARBONDATA-2560] - [MV] Exception in console during MV creation but MV registered successfully
- [CARBONDATA-2568] - [MV] MV datamap is not hit when ,column is in group by but not in projection
- [CARBONDATA-2576] - MV Datamap - MV is not working fine if there is more than 3 aggregate function in the same datamap.
- [CARBONDATA-2610] - DataMap creation fails on null values
- [CARBONDATA-2614] - There are some exception when using FG in search mode and the prune result is none
- [CARBONDATA-2616] - Incorrect explain and query result while using bloomfilter datamap
- [CARBONDATA-2629] - SDK carbon reader don't support filter in HDFS and S3
- [CARBONDATA-2644] - Validation not present for carbon.load.sortMemory.spill.percentage parameter
- [CARBONDATA-2658] - Fix bug in spilling in-memory pages
- [CARBONDATA-2674] - Streaming with merge index enabled does not consider the merge index file while pruning.
- [CARBONDATA-2703] - Fix bugs in tests
- [CARBONDATA-2711] - carbonFileList is not initalized when updatetablelist call
- [CARBONDATA-2715] - Failed to run tests for Search Mode With Lucene in Windows env
- [CARBONDATA-2729] - Schema Compatibility problem between version 1.3.0 and 1.4.0
- [CARBONDATA-2758] - selection on local dictionary fails when column having all null values more than default batch size.
- [CARBONDATA-2769] - Fix bug when getting shard name from data before version 1.4
- [CARBONDATA-2802] - Creation of Bloomfilter Datamap is failing after UID,compaction,pre-aggregate datamap creation
- [CARBONDATA-2823] - Alter table set local dictionary include after bloom creation fails throwing incorrect error
- [CARBONDATA-2854] - Release table status file lock before delete physical files when execute 'clean files' command
- [CARBONDATA-2862] - Fix exception message for datamap rebuild command
- [CARBONDATA-2866] - Should block schema when creating external table
- [CARBONDATA-2874] - Support SDK writer as thread safe api
- [CARBONDATA-2886] - select filter with int datatype is showing incorrect result in case of table created and loaded on old version and queried in new version
- [CARBONDATA-2888] - Support multi level sdk read support for carbon tables
- [CARBONDATA-2901] - Problem: Jvm crash in Load scenario when unsafe memory allocation is failed.
- [CARBONDATA-2902] - Fix showing negative pruning result for explain command
- [CARBONDATA-2908] - the option of sort_scope don't effects while creating table by data frame
- [CARBONDATA-2910] - Support backward compatability in fileformat and support different sort colums per load
- [CARBONDATA-2924] - Fix parsing issue for map as a nested array child and change the error message in sort column validation for SDK
- [CARBONDATA-2925] - Wrong data displayed for spark file format if carbon file has mtuiple blocklet
- [CARBONDATA-2926] - ArrayIndexOutOfBoundException if varchar column is present before dictionary columns along with empty sort_columns.
- [CARBONDATA-2927] - Multiple issue fixes for varchar column and complex columns that grows more than 2MB
- [CARBONDATA-2932] - CarbonReaderExample throw some exception: Projection can't be empty
- [CARBONDATA-2933] - Fix errors in spelling
- [CARBONDATA-2940] - Fix BufferUnderFlowException for ComplexPushDown
- [CARBONDATA-2955] - bug for legacy store and compaction with zstd compressor and adaptiveDeltaIntegralCodec
- [CARBONDATA-2956] - CarbonReader can't support use configuration to read S3 data
- [CARBONDATA-2967] - Select is failing on pre-aggregate datamap when thrift server is restarted.
- [CARBONDATA-2969] - Query on local dictionary column is giving empty data
- [CARBONDATA-2974] - Bloomfilter not working when created bloom on multiple columns and queried
- [CARBONDATA-2975] - DefaultValue choosing and removeNullValues on range filters is incorrect
- [CARBONDATA-2979] - select count fails when carbondata file is written through SDK and read through sparkfileformat for complex datatype map(struct->array->map)
- [CARBONDATA-2980] - clear bloomindex cache when dropping datamap
- [CARBONDATA-2982] - CarbonSchemaReader don't support Array<string>
- [CARBONDATA-2984] - streaming throw NPE when there is no data in the task of a batch
- [CARBONDATA-2986] - Table Properties are lost when multiple driver concurrently creating table
- [CARBONDATA-2990] - JVM crashes when rebuilding the datamap.
- [CARBONDATA-2991] - NegativeArraySizeException during query execution
- [CARBONDATA-2992] - Fixed Between Query Data Mismatch issue for timestamp data type
- [CARBONDATA-2993] - Concurrent data load throwing NPE randomly.
- [CARBONDATA-2994] - Unify property name for badrecords path in create and load.
- [CARBONDATA-2995] - Queries slow down after some time due to broadcast issue
New Feature
- [CARBONDATA-2896] - Adaptive encoding for primitive data types
- [CARBONDATA-2916] - Support CarbonCli tool for data summary
- [CARBONDATA-2919] - StreamSQL support ingest from Kafka
- [CARBONDATA-2945] - Support JSON record in StreamSQL
- [CARBONDATA-2965] - Support scan performance benchmark tool
- [CARBONDATA-2976] - Support dumping column chunk meta in CarbonCli
Improvement
- [CARBONDATA-2309] - Add strategy to generate bigger carbondata files in case of small amount of data
- [CARBONDATA-2428] - Support Flat folder structure in carbon.
- [CARBONDATA-2532] - Carbon to support spark 2.3 version
- [CARBONDATA-2549] - Implement LRU cache in Bloom filter based on Carbon LRU cache interface
- [CARBONDATA-2553] - support ZSTD compression for sort temp file
- [CARBONDATA-2593] - Add an option 'carbon.insert.storage.level' to support configuring the storage level when insert into data with 'carbon.insert.persist.enable'='true'
- [CARBONDATA-2594] - Incorrect logic when set 'Encoding.INVERTED_INDEX' for each dimension column
- [CARBONDATA-2599] - Use RowStreamParserImp as default value of config 'carbon.stream.parser'
- [CARBONDATA-2656] - Presto Stream Readers performance Enhancement
- [CARBONDATA-2686] - Implement left outer join in mv
- [CARBONDATA-2801] - Add documentation for flat folder
- [CARBONDATA-2815] - Add documentation for memory spill and rebuild datamap
- [CARBONDATA-2837] - Add MV Example in examples module
- [CARBONDATA-2857] - Improvement in "How to contribute to Apache CarbonData" page
- [CARBONDATA-2876] - Support Avro datatype conversion to Carbon Format
- [CARBONDATA-2879] - Support Sort Scope for SDK
- [CARBONDATA-2884] - Should rename the methods of ByteUtil class to avoid the misuse
- [CARBONDATA-2899] - Add MV modules to assembly JAR
- [CARBONDATA-2900] - Add dynamic configuration support for some system properties
- [CARBONDATA-2903] - Fix compiler warnings
- [CARBONDATA-2905] - Should allow set stream property on streaming table
- [CARBONDATA-2906] - Show segment data size in SHOW SEGMENT command
- [CARBONDATA-2907] - Support setting blocklet size in table property
- [CARBONDATA-2909] - Support Multiple User reading and writing through SDK.
- [CARBONDATA-2911] - Remove unused BTree related code
- [CARBONDATA-2915] - Updates to CarbonData documentation and structure
- [CARBONDATA-2929] - Add block skipped info for explain command
- [CARBONDATA-2938] - Update comment of blockletId in IndexDataMapRebuildRDD
- [CARBONDATA-2947] - Adaptive encoding support for timestamp no dictionary and Refactor ColumnPageWrapper
- [CARBONDATA-2948] - Support Float and Byte Datatypes for SDK and DataSource
- [CARBONDATA-2961] - Simplify SDK API interfaces
- [CARBONDATA-2963] - Add support to add byte column as a sort column
- [CARBONDATA-2964] - Unsupported Float datatype exception for query with more than 1 page
- [CARBONDATA-2966] - Update Documentation For Avro DataType conversion
- [CARBONDATA-2972] - Debug Logs and a function for type of Adaptive Encoding
- [CARBONDATA-2973] - Add Documentation for complex Columns for Local Dictionary Support
- [CARBONDATA-2983] - Change bloom query model to proceed multiple filter values
- [CARBONDATA-2985] - Fix issues in Table level compaction and TableProperties
- [CARBONDATA-2989] - Upgrade spark integration version to 2.3.2
Task
- [CARBONDATA-2756] - Add BSD license for ZSTD external dendency
- [CARBONDATA-2839] - Add custom compaction example