Apache CarbonData community is pleased to announce the release of the Version 1.5.1 in The Apache Software Foundation (ASF).
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
We encourage you to use the release https://archive.apache.org/dist/carbondata/1.5.1/, and feedback through the CarbonData user mailing lists!
This release note provides information on the new features, improvements, and bug fixes of this release.
What’s New in CarbonData Version 1.5.1?
CarbonData 1.5.1 intention was to move more closer to unified analytics. We want to enable CarbonData files to be read from more engines/libraries to support various use cases. In this regard we have added support to write CarbonData files from c++ libraries.
CarbonData added multiple optimizations to improve query and compaction performance.
In this version of CarbonData, more than 78 JIRA tickets related to new features, improvements, and bugs has been resolved. Following are the summary.
CarbonData Core
Support Custom Column Compressor
Carbondata supports customized column compressor so that user can add their own implementation of compressor. To customize compressor, user can directly use its full class name while creating table or setting it to carbon property.
Performance Improvements
Optimized Carbondata Scan Performance
Carbondata scan performance is improved by avoiding multiple data copies in case of vector flow. This is achieved through short circuit the read and vector filling, it means fill the data directly to vector after reading the data from file with out any intermediate copies.
Now row level filter processing is handled in execution engine, only blocklet and page pruning is handled in CarbonData for vector flow. This is controlled by property carbon.push.rowfilters.for.vector and default it is false.
Optimized Compaction Performance
Compaction performance is optimized through pre-fetching the data while reading carbon files.
Improved Blocklet DataMap Pruning in Driver
Blocklet DataMap pruning is improved using multi-thread processing in driver.
CarbonData SDK
SDK Supports C++ Interfaces for Writing CarbonData files
To enable integration with non java based execution engines, CarbonData supports C++ JNI wrapper to write the CarbonData files. It can be integrated with any execution engine and write data to CarbonData files without the dependency on Spark or Hadoop.
Multi-Thread Read API in SDK
To improve the read performance when using SDK, CarbonData supports multi-thread read APIs. This enables the applications to read data from multiple CarbonData files in parallel. It significantly improves the SDK read performance.
Other Improvements
- Added more CLI enhancements by adding more options.
- Supported fallback mechanism, when offheap memory is not enough then switch to on heap instead of failing the job
- Supported a separate audit log.
- Support read batch row in CSDK to improve performance.
Behavior Change
- Enable Local dictionary by default.
- Make inverted index false by default.
- Sort temp files during data loading are now compressed by default with Snappy compression to improve IO.
New Configuration Parameters
Configuration name | Default Value | Range |
---|---|---|
carbon.push.rowfilters.for.vector | false | NA |
carbon.max.driver.threads.for.block.pruning | 4 | 1-4 |
Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12344320
Sub-task
- [CARBONDATA-2930] - Support customize column compressor
- [CARBONDATA-2981] - Support read primitive data type in CSDK
- [CARBONDATA-2997] - Support read schema from index file and data file in CSDK
- [CARBONDATA-3000] - Provide C++ interface for writing carbon data
- [CARBONDATA-3003] - Suppor read batch row in CSDK
- [CARBONDATA-3004] - Fix bug in writing dataframe to carbon table while the field order is different
- [CARBONDATA-3038] - Add annotation for carbon properties and mark whether is dynamic configuration
- [CARBONDATA-3044] - Handle exception in CSDK
- [CARBONDATA-3056] - Implement concurrent reading through CarbonReader
- [CARBONDATA-3057] - Implement Vectorized CarbonReader for SDK
- [CARBONDATA-3063] - Support set carbon property in CSDK
- [CARBONDATA-3095] - Optimize the documentation of SDK/CSDK
- [CARBONDATA-3131] - Update the requested columns to the Scan
Bug
- [CARBONDATA-2996] - readSchemaInIndexFile can't read schema by folder path
- [CARBONDATA-2998] - Refresh column schema for old store(before V3) for SORT_COLUMNS option
- [CARBONDATA-3002] - Fix some spell error and remove the data after test case finished running
- [CARBONDATA-3007] - Fix error in document
- [CARBONDATA-3025] - Add SQL support for cli, and enhance CLI , add more metadata to carbon file
- [CARBONDATA-3026] - clear expired property that may cause GC problem
- [CARBONDATA-3029] - Failed to run spark data source test cases in windows env
- [CARBONDATA-3036] - Carbon 1.5.0 B010 - Select query fails when min/max exceeds and index tree cached
- [CARBONDATA-3040] - Fix bug for merging bloom index
- [CARBONDATA-3058] - Fix some exception coding in data loading
- [CARBONDATA-3060] - Improve CLI and fix other bugs in CLI tool
- [CARBONDATA-3062] - Fix Compatibility issue with cache_level as blocklet
- [CARBONDATA-3065] - by default disable inverted index for all the dimension column
- [CARBONDATA-3066] - ADD documentation for new APIs in SDK
- [CARBONDATA-3069] - fix bugs in setting cores for compaction
- [CARBONDATA-3077] - Fixed query failure in fileformat due stale cache issue
- [CARBONDATA-3078] - Exception caused by explain command for count star query without filter
- [CARBONDATA-3081] - NPE when boolean column has null values with Vectorized SDK reader
- [CARBONDATA-3083] - Null values are getting replaced by 0 after update operation.
- [CARBONDATA-3084] - data load with float datatype falis with internal error
- [CARBONDATA-3098] - Negative value exponents giving wrong results
- [CARBONDATA-3106] - Written_BY_APPNAME is not serialized in executor with GlobalSort
- [CARBONDATA-3117] - Rearrange the projection list in the Scan
- [CARBONDATA-3120] - apache-carbondata-1.5.1-rc1.tar.gz Datamap's core and plan project, pom.xml, is version 1.5.0, which results in an inability to compile properly
- [CARBONDATA-3122] - CarbonReader memory leak
- [CARBONDATA-3123] - JVM crash when reading through CarbonReader
- [CARBONDATA-3124] - Updated log message in Unsafe Memory Manager and changed faq.md accordingly.
- [CARBONDATA-3132] - Unequal distribution of tasks in case of compaction
- [CARBONDATA-3134] - Wrong result when a column is dropped and added using alter with blocklet cache.
New Feature
- [CARBONDATA-2977] - Write uncompress_size to ChunkCompressMeta in the file
Improvement
- [CARBONDATA-3008] - make yarn-local and multiple dir for temp data enable by default
- [CARBONDATA-3009] - Optimize the entry point of code for MergeIndex
- [CARBONDATA-3019] - Add error log in catch block to avoid to abort the exception which is thrown from catch block when there is an exception thrown in finally block
- [CARBONDATA-3022] - Refactor ColumnPageWrapper
- [CARBONDATA-3024] - Use Log4j directly
- [CARBONDATA-3030] - Remove no use parameter in test case
- [CARBONDATA-3031] - Find wrong description in the document for 'carbon.number.of.cores.while.loading'
- [CARBONDATA-3032] - Remove carbon.blocklet.size from properties template
- [CARBONDATA-3034] - Combing CarbonCommonConstants
- [CARBONDATA-3035] - Optimize parameters for unsafe working and sort memory
- [CARBONDATA-3039] - Fix Custom Deterministic Expression for rand() UDF
- [CARBONDATA-3041] - Optimize load minimum size strategy for data loading
- [CARBONDATA-3042] - Column Schema objects are present in Driver and Executor even after dropping table
- [CARBONDATA-3046] - remove outdated configurations in template properties
- [CARBONDATA-3047] - UnsafeMemoryManager fallback mechanism in case of memory not available
- [CARBONDATA-3048] - Added Lazy Loading For 2.2/2.1
- [CARBONDATA-3050] - Remove unused parameter doc
- [CARBONDATA-3051] - unclosed streams cause tests failure in windows env
- [CARBONDATA-3052] - Improve drop table performance by reducing the namenode RPC calls during physical deletion of files
- [CARBONDATA-3053] - Un-closed file stream found in cli
- [CARBONDATA-3054] - Dictionary file cannot be read in S3a with CarbonDictionaryDecoder.doConsume() codeGen
- [CARBONDATA-3061] - Add validation for supported format version and Encoding type to throw proper exception to the user while reading a file
- [CARBONDATA-3064] - Support separate audit log
- [CARBONDATA-3067] - Add check for debug to avoid string concat
- [CARBONDATA-3071] - Add CarbonSession Java Example
- [CARBONDATA-3074] - Change default sort temp compressor to SNAPPY
- [CARBONDATA-3075] - Select Filter fails for Legacy store if DirectVectorFill is enabled
- [CARBONDATA-3087] - Prettify DESC FORMATTED output
- [CARBONDATA-3088] - enhance compaction performance by using prefetch
- [CARBONDATA-3104] - Extra Unnecessary Hadoop Conf is getting stored in LRU (~100K) for each LRU entry
- [CARBONDATA-3112] - Optimise decompressing while filling the vector during conversion of primitive types
- [CARBONDATA-3113] - Fixed Local Dictionary Query Performance and Added reusable buffer for direct flow
- [CARBONDATA-3118] - Parallelize block pruning of default datamap in driver for filter query processing
- [CARBONDATA-3121] - CarbonReader build time is huge
- [CARBONDATA-3136] - JVM crash with preaggregate datamap