Apache CarbonData community is pleased to announce the release of the Version 1.2.0 in The Apache Software Foundation (ASF). CarbonData is a new BigData native file format for a faster interactive query using advanced columnar storage, index, compression, and encoding techniques to improve computing efficiency. In turn, it will help to speed up queries an order of magnitude faster over PetaBytes of data.
We encourage everyone to download the release https://archive.apache.org/dist/carbondata/1.2.0/, and feedback through the CarbonData user mailing lists!
This release note provides information on the new features, improvements, and bug fixes of this release.
What’s New in Version 1.2.0?
In this version of CarbonData, following are the new features added for performance improvements, compatibility, and usability of CarbonData.
Support Presto Integration
CarbonData Presto connector allows faster fetching of results of interactive queries. It enables exploration of data to determine the types of record in tables at quicker rates and it is faster with queries that comprise of joins with a large Fact table and many smaller Dimension tables.
Support Hive Integration
Hive connector with CarbonData is the best solution when you want to use batch-style data processing, large data aggregations, and large fact-to-fact joins.
Optimized Measure Filter for Improved Performance
Supports Sort Columns
Now you can specify only required columns (which are used in query filters) to be sorted while loading data. It improves the loading speed marginally.
Supports Four Types of Sort Scope
Now the sort scope is only defined while creating the table and it cannot be changed during loading. There are four types of sort supported Local Sort, Batch Sort, Global Sort, and No Sort. These sorts help to improve the performance like load, point query, and so on.
Support Partition
Partition helps in better data organization, management, and storage. Partitioning the table also helps in avoiding full table scan in some scenarios; hence improving the query performance. There are three types of partition table supported Hash Partition, Range Partition, and List Partition.
Optimized Data Update & Delete for Spark 2.1
Optimized data update and delete for Spark 2.1 for improved query performance.
Support DataMap
Support DataMap framework that can be used for index and statistics to accelerate query performance. It enables developers to add custom indexes for driver side pruning.
Please find the detailed JIRA list: https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12320220&version=12340260
Sub-task
- [CARBONDATA-807] - Add the basic presto integration code
- [CARBONDATA-813] - Fix pom issues and add the correct dependency jar to build success for integration/presto
- [CARBONDATA-815] - Add basic hive integration code
- [CARBONDATA-816] - Add examples for hive integration under /Examples
- [CARBONDATA-826] - Create carbondata-connector for query carbon data in presto
- [CARBONDATA-899] - Added Support for DecimalType and Timestamp for spark-2.1 for integration/presto
- [CARBONDATA-935] - 1. Define PartitionInfo model
- [CARBONDATA-936] - 2. Create Table with Partition
- [CARBONDATA-937] - 3. Data loading of partition table
- [CARBONDATA-938] - 4. Detail filter query on partition column
- [CARBONDATA-940] - 6. Alter table add/split partition
- [CARBONDATA-946] - TUPLEID implicit column support in spark 2.1
- [CARBONDATA-1008] - Make Hive table schema compatible with spark sql
- [CARBONDATA-1015] - Refactor write step to use ColumnarPage
- [CARBONDATA-1017] - Add interface for column encoding and compression
- [CARBONDATA-1018] - Make ColumnPage use Unsafe
- [CARBONDATA-1074] - Add TablePage for data load process
- [CARBONDATA-1098] - change statistics to use exact type instead of Object
- [CARBONDATA-1117] - Update SET & RESET command details in online help documentation
- [CARBONDATA-1124] - Use Snappy.rawCompression on unsafe data
- [CARBONDATA-1126] - Change carbon data file definition for encoding override
- [CARBONDATA-1146] - V3 format support for delete operation in IUD.
- [CARBONDATA-1158] - Hive integration code optimization
- [CARBONDATA-1163] - Use sortBy operator to load data
- [CARBONDATA-1181] - 9. show partitions
- [CARBONDATA-1209] - 12. Add partitionId in show partition result
- [CARBONDATA-1231] - Add datamap interfaces for pruning and indexing
- [CARBONDATA-1250] - 13. Change default partition id from Max to 0
- [CARBONDATA-1252] - Add BAD_RECORD_PATH option in Load options section in the Carbon Help doc
- [CARBONDATA-1268] - Add encoding selection strategy for columns
- [CARBONDATA-1270] - Documentation - Update the segment deletion syntax in documentation
- [CARBONDATA-1271] - Hive Integration Performance Improvement
- [CARBONDATA-1312] - 14. Fix comparator bug
- [CARBONDATA-1316] - 15. alter table drop partition
- [CARBONDATA-1325] - 16. create guidance documents for partition table
Bug
- [CARBONDATA-649] - Rand() function is not working while updating data
- [CARBONDATA-834] - Describe Table in Presto gives incorrect order of columns
- [CARBONDATA-835] - Null values in carbon table gives a NullPointerException when querying from Presto
- [CARBONDATA-848] - Select count(*) from table gives an exception in Presto
- [CARBONDATA-872] - Fix comment issues of integration/presto for easier reading
- [CARBONDATA-917] - count(*) doesn't work
- [CARBONDATA-950] - selecting table data having a column of "date" type throws exception in hive
- [CARBONDATA-980] - Result does not displays while using not null operator in presto integration.
- [CARBONDATA-982] - Incorrect result displays while using not in clause in presto integration
- [CARBONDATA-989] - decompressing error while load 'gz' and 'bz2' data into table
- [CARBONDATA-1034] - FilterUnsupportedException thrown for select from table where = filter for int column has negative of value larger than int max range
- [CARBONDATA-1049] - avoid logging data into log file
- [CARBONDATA-1050] - int and short measures should not be considered as long.
- [CARBONDATA-1056] - Data_load failure using single_pass true with spark 2.1
- [CARBONDATA-1060] - Query statistics issue in case of multiple blocklet and block
- [CARBONDATA-1061] - If AL_DICTIONARY_PATH is used in load option then by SINGLE_PASS must be used.
- [CARBONDATA-1062] - Data load fails if a column specified as sort column is of numeric data type
- [CARBONDATA-1063] - When multi user perform concurrent operations like show segments NullPointerException is getting thrown
- [CARBONDATA-1070] - Not In Filter Expression throwing NullPointer Exception
- [CARBONDATA-1075] - Close Dictionary Server when application ends
- [CARBONDATA-1076] - Join Issue caused by dictionary and shuffle exchange
- [CARBONDATA-1077] - ColumnDict and ALL_DICTIONARY_PATH must be used with SINGLE_PASS='true'
- [CARBONDATA-1078] - Query return incorrect result when selecting complex column before dictionary column in spark 2.1
- [CARBONDATA-1091] - Implicit column tupleId is not returning results if VectorReader is enabled.
- [CARBONDATA-1092] - alter table add column query should support no_inverted_index
- [CARBONDATA-1093] - User data is getting printed in logs if the server fails to respond to client
- [CARBONDATA-1094] - Wrong results returned by the query in case inverted index is not created on a column
- [CARBONDATA-1097] - describe formatted query should display no_inverted_index column
- [CARBONDATA-1104] - Query failure while using unsafe for query execution numeric data type column specified as sort column
- [CARBONDATA-1105] - Remove the fixed spark.version in submodule for supporting new spark version.
- [CARBONDATA-1107] - Multi User load on same table is failing with NullPointerException
- [CARBONDATA-1109] - Page lost in load process when last page is not be consumed at the end
- [CARBONDATA-1111] - Improve No dictionary column Include And Exclude filter
- [CARBONDATA-1113] - Add validation for partition column feature
- [CARBONDATA-1118] - Inset Pushdown in Carbondata.
- [CARBONDATA-1119] - Database drop cascade is not working in Spark 2.1 and alter table not working in vector reader
- [CARBONDATA-1121] - Restrict Sort Column Addition in Alter Table
- [CARBONDATA-1122] - When user specific operations are performed on multiple terminals, some are failing for missing privileges
- [CARBONDATA-1133] - Executor lost failure in case of data load failure due to bad records
- [CARBONDATA-1134] - Generate redundant folders under integration model when run test cases with mvn command in spark1.6
- [CARBONDATA-1138] - Exception is expected if SORT_COLUMNS hava duplicate column name
- [CARBONDATA-1144] - Drop column operation failed in Alter table.
- [CARBONDATA-1145] - Single-pass loading not work on partition table
- [CARBONDATA-1149] - Fix issue of mismatch type of partition column when specify partition info and range info overlapping values issue
- [CARBONDATA-1151] - Update useful-tips-on-carbondata.md
- [CARBONDATA-1154] - Driver Side IUD Performance Optimization
- [CARBONDATA-1155] - DataLoad failure for noDictionarySortColumns with 3Lakh data
- [CARBONDATA-1156] - IUD Performance Improvement And Synchonizaion issue
- [CARBONDATA-1159] - Batch sort loading is not proper without synchronization
- [CARBONDATA-1166] - creating partition on decimal column is failing
- [CARBONDATA-1167] - Mismatched between class name and logger class name
- [CARBONDATA-1170] - Skip single_pass loading during first load
- [CARBONDATA-1172] - Batch load fails randomly
- [CARBONDATA-1177] - Fixed batch sort synchronization issue
- [CARBONDATA-1178] - Data loading of partitioned table is throwing NPE on badrecords
- [CARBONDATA-1179] - Improve the Object Size calculation for Objects added to LRU cache
- [CARBONDATA-1183] - Update CarbonPartitionTable because partition columns should not be specified in the schema
- [CARBONDATA-1187] - Fix Documentation links pointing to wrong urls in useful-tips-on-carbondata and faq
- [CARBONDATA-1189] - Delete with subquery is not working in spark 2.1
- [CARBONDATA-1191] - Remove carbon-spark-shell script
- [CARBONDATA-1194] - Problem in filling/processing multiple implicit columns
- [CARBONDATA-1197] - Update related docs which still use incubating such as presto integration
- [CARBONDATA-1204] - Update operation fail and generate extra records when test with big data
- [CARBONDATA-1207] - Resource leak problem in CarbonDictionaryWriter
- [CARBONDATA-1210] - Exception should be thrown on bad record logger failure to write to log file or csv file.
- [CARBONDATA-1211] - Implicit Column Projection
- [CARBONDATA-1212] - Memory leak in case of compaction when unsafe is true
- [CARBONDATA-1213] - Removed rowCountPercentage check and fixed IUD data load issue
- [CARBONDATA-1217] - Failure in data load when we first load the bad record and then valid record and bad record action is set to Fail
- [CARBONDATA-1221] - DOCUMENTATION - Remove unsupported parameter
- [CARBONDATA-1222] - Residual files created from Update are not deleted after clean operation
- [CARBONDATA-1223] - Fixing empty file creation in batch sort loading
- [CARBONDATA-1242] - Query block distribution is more time before scheduling tasks to executor.
- [CARBONDATA-1245] - NullPointerException invoked by CarbonFile.listFiles() function which returns null
- [CARBONDATA-1246] - NullPointerException in Presto Integration
- [CARBONDATA-1251] - Add test cases for IUD feature
- [CARBONDATA-1257] - Measure Filter Block Prunning and Filter Evaluation Support
- [CARBONDATA-1267] - Failure in data loading due to bugs in delta-integer-codec
- [CARBONDATA-1276] - Owner name of delta files created after update/delete records operation in Beeline is spark2x instead of login user who performed delete operation
- [CARBONDATA-1277] - Dictionary generation failure if there is failure in closing output stream in HDFS
- [CARBONDATA-1279] - Push down for some select queries not working as expected in Spark 2.1
- [CARBONDATA-1280] - Solve HiveExample dependency issues and fix CI with spark 1.6
- [CARBONDATA-1282] - Query with large no of column gives codegeneration issue
- [CARBONDATA-1283] - Carbon should continue with the default value if wrong value is set for the configurable parameter.
- [CARBONDATA-1285] - Compilation error in HiveEmbeededserver on master branch due to changes in pom.xml of hive
- [CARBONDATA-1291] - CarbonData query performace improvement when number of carbon blocks are high
- [CARBONDATA-1305] - On creating the dictinary with large dictionary csv NegativeArraySizeException is thrown
- [CARBONDATA-1306] - Carbondata select query crashes when using big data with more than million rows
- [CARBONDATA-1307] - TableInfo serialization not working in cluster mode
- [CARBONDATA-1317] - Multiple dictionary files being created in single_pass
- [CARBONDATA-1329] - The first carbonindex file needs to be deleted during clean files operation
- [CARBONDATA-1337] - Problem while intermediate merging
- [CARBONDATA-1338] - Spark can not query data when 'spark.carbon.hive.schema.store' is true
- [CARBONDATA-1339] - CarbonTableInputFormat should use serialized TableInfo
- [CARBONDATA-1345] - outdated tablemeta cache cause operation failed in multiple session
- [CARBONDATA-1348] - Sort_Column should not supported for no dictionary column having numeric data-type and measure column
- [CARBONDATA-1350] - When 'SORT_SCOPE'='GLOBAL_SORT', the verification of 'single_pass' must be false is invalid.
- [CARBONDATA-1351] - When 'SORT_SCOPE'='GLOBAL_SORT' and 'enable.unsafe.columnpage'='true', 'ThreadLocalTaskInfo.getCarbonTaskInfo()' return null
- [CARBONDATA-1353] - SDV cluster tests are failing for measure filter feature
- [CARBONDATA-1354] - When 'SORT_SCOPE'='GLOBAL_SORT', 'single_pass' can be 'true'
- [CARBONDATA-1357] - byte[] convert to UTF8String bug
- [CARBONDATA-1358] - Tests are failing in master branch Spark 2.1
- [CARBONDATA-1359] - Unable to use carbondata on hive
- [CARBONDATA-1363] - Add DataMapWriter interface
- [CARBONDATA-1366] - When sort_scope=global_sort, use 'StorageLevel.MEMORY_AND_DISK_SER' instead of 'StorageLevel.MEMORY_AND_DISK' for 'convertRDD' persisting to improve loading performance
- [CARBONDATA-1367] - Fix wrong dependency of carbondata-examples-flink
- [CARBONDATA-1375] - clean hive pom
- [CARBONDATA-1379] - Date range filter with cast not working
- [CARBONDATA-1380] - Tablestatus file is not updated in case of load failure. Insert Overwrite does not work properly
- [CARBONDATA-1386] - Fix findbugs issues in carbondata-core module
- [CARBONDATA-1392] - Fixed bug for fetching the error value of decimal type in presto
- [CARBONDATA-1393] - Throw NullPointerException when UnsafeMemoryManager.freeMemory
- [CARBONDATA-1395] - Fix Findbugs issues in carbondata-hadoop module
- [CARBONDATA-1396] - Fix findbugs issues in carbondata-hive module
- [CARBONDATA-1397] - Fix findbugs issues in carbondata-presto module
- [CARBONDATA-1399] - Enable findbugs to run by default on every build
- [CARBONDATA-1400] - Array column out of bound when writing carbondata file
- [CARBONDATA-1403] - Compaction log is not correct
- [CARBONDATA-1406] - Fix inconsistent usage of QUOTATION MARK " and LEFT DOUBLE QUOTATION MARK “ in installation.md file
- [CARBONDATA-1408] - Data loading with GlobalSort is failing in long run scenario
- [CARBONDATA-1411] - Show Segment command gives Null Pointer Exception after the table is updated
- [CARBONDATA-1412] - delete working incorrectly while using segment.starttime before '<any_date_value>'
- [CARBONDATA-1413] - Incorrect result displays after creating a partition table with incorrect range_info
- [CARBONDATA-1417] - Add Cluster tests for IUD, batch sort and Global sort features
- [CARBONDATA-1420] - Partition Feature doesn't support a Partition Column of Date Type.
- [CARBONDATA-1421] - Auto Compaction Failing in CarbonData Loading
- [CARBONDATA-1422] - Major and Minor Compaction Failing
- [CARBONDATA-1431] - Dictionary_Include working incorrectly for date and timestamp data type.
- [CARBONDATA-1432] - Default.value property is not throwing any exception when specified column name does not matches with column name in the query
- [CARBONDATA-1433] - Presto Integration - Vectorized Reader Implementation
- [CARBONDATA-1435] - Reader is not backward compatible
- [CARBONDATA-1437] - Wrong Exception Mesage When Number Of Bucket is Specified as zero
- [CARBONDATA-1441] - schema change does not reflect back in hive when schema is alter in carbon
- [CARBONDATA-1443] - Throwing NPE while creating table while running tests
- [CARBONDATA-1445] - if 'carbon.update.persist.enable'='false', it will fail to update data
- [CARBONDATA-1446] - Alter query throws invalid exception while executing on range partitioned table
- [CARBONDATA-1452] - Issue with loading timestamp data beyond cutoff
- [CARBONDATA-1453] - Optimize the cluster test case ID and make it more general
- [CARBONDATA-1456] - Regenerate cached hive results if cluster testcases fail
- [CARBONDATA-1458] - Error in fetching decimal type data loaded with Carbondata 1.1.0 in Carbondata 1.2.0
- [CARBONDATA-1461] - Unable to Read Date And TimeStamp Type in HIve
- [CARBONDATA-1464] - SparkSessionExample is not working
- [CARBONDATA-1465] - Hive unable to query carbondata when column names is in small letters
- [CARBONDATA-1470] - csv data should not show in error log when data column length is greater than 100000 characters
- [CARBONDATA-1471] - Replace BigDecimal to double to improve performance
- [CARBONDATA-1472] - Optimize memory and fix nosort queries
- [CARBONDATA-1477] - Wrong values shown when fetching date type values in hive
- [CARBONDATA-1482] - fix the failing integration test cases of presto
Improvement
- [CARBONDATA-773] - During parallel load multiple instances of DictionaryServer are being created.
- [CARBONDATA-882] - Add SORT_COLUMNS option support in dataframe writer
- [CARBONDATA-888] - Dictionary include / exclude option in dataframe writer
- [CARBONDATA-920] - errors while executing create table examples from docs
- [CARBONDATA-1047] - Add load options to perform batch sort and add more testcases
- [CARBONDATA-1065] - Implement set command in carbon to update carbon properties dynamically
- [CARBONDATA-1073] - Support INPUT_FILES
- [CARBONDATA-1123] - Rename interface and variable for RLE encoding
- [CARBONDATA-1132] - describe formatted query should display SORT_COLUMNS column
- [CARBONDATA-1137] - Documentation for SORT_COLUMNS should be updated in open source doc
- [CARBONDATA-1150] - Update vector reader support in documentation
- [CARBONDATA-1164] - Make Column Group feature deprecated
- [CARBONDATA-1196] - Add 3 Bytes data type support in value compression
- [CARBONDATA-1214] - Change the syntax of the Segment Delete by ID and date as per hive syntax.
- [CARBONDATA-1229] - Restrict Drop table if load is in progress
- [CARBONDATA-1236] - Support absolute path without scheme in loading
- [CARBONDATA-1238] - Decouple the datatype convert from Spark code in core module
- [CARBONDATA-1241] - Single_Pass either should be blocked with Global_Sort
- [CARBONDATA-1244] - Rewrite README.md of presto integration and add/rewrite some comments to presto integration.
- [CARBONDATA-1248] - LazyColumnPage should extend from ColumnPage
- [CARBONDATA-1255] - Remove "COLUMN_GROUPS" feature from documentation
- [CARBONDATA-1259] - CompareTest improvement
- [CARBONDATA-1281] - Disk hotspot found during data loading
- [CARBONDATA-1286] - Change query related RDD to use TableInfo
- [CARBONDATA-1287] - Remove unnecessary MDK generation in loading
- [CARBONDATA-1289] - remove a unused method in CarbonDataLoadingException
- [CARBONDATA-1301] - change command to update schema and data separately
- [CARBONDATA-1308] - Added tableProvider to supply carbonTable wherever needed
- [CARBONDATA-1310] - merge test code for AddColumnTestCases, DropColumnTestCases and ChangeDataTypeTestCases
- [CARBONDATA-1313] - Remove unnecessary statistics
- [CARBONDATA-1323] - Presto Performace Improvement at Integration Layer
- [CARBONDATA-1335] - Duplicated & time-consuming method call found in query
- [CARBONDATA-1346] - Develop framework for SDV tests to run in cluster. And add all existing SDV tests to it
- [CARBONDATA-1356] - Invert overwrite should delete files immediately
- [CARBONDATA-1364] - Add the blocklet info to index file and make the datamap distributable with job
- [CARBONDATA-1372] - Fix some errors and update the examples in documentation
- [CARBONDATA-1373] - Enhance update performance in carbondata
- [CARBONDATA-1418] - Use CarbonTableInputFormat in Presto Integration
- [CARBONDATA-1425] - Inappropriate Exception displays while creating a new partition with incorrect partition type
- [CARBONDATA-1429] - Add a value based compression for decimal data type when decimal is stored as Int or Long
- [CARBONDATA-1434] - Remove useless class para for metastore
- [CARBONDATA-1436] - optimize concurrency control for datamap
- [CARBONDATA-1438] - Unify the sort column and sort scope in create table command
- [CARBONDATA-1442] - Reformat Partition-Guide.md File
- [CARBONDATA-1447] - Add the CNCF link in the CarbonDataWebsite
- [CARBONDATA-1451] - Removing configuration for number_of_rows_per_blocklet_column_page
- [CARBONDATA-1462] - Add an option 'carbon.update.storage.level' to support configuring the storage level when updating data with 'carbon.update.persist.enable'='true'
- [CARBONDATA-1463] - Compare Test should validate result size
- [CARBONDATA-1466] - Presto Integration - Performance Improvement
- [CARBONDATA-1467] - Presto Integration - Performance Improvement
- [CARBONDATA-1468] - Presto Integration - Performance Improvement
- [CARBONDATA-1469] - Presto Integration - Performance Improvement
- [CARBONDATA-1478] - Update compaction documentation
New Feature
- [CARBONDATA-1365] - Add RLE Codec implementation
- [CARBONDATA-1450] - Support timestamp more than 68 years, Enhance NoDictionary Datatypes - int , long
Task
- [CARBONDATA-1095] - Fix rebase issues of presto and hive integration
- [CARBONDATA-1205] - Use Spark 2.1 as default from 1.2.0 onwards
- [CARBONDATA-1274] - add update and delete examples
- [CARBONDATA-1327] - write sort columns example
- [CARBONDATA-1423] - Add Integration Test Cases For Presto
Test
- [CARBONDATA-1361] - Reduce the sdv cluster test time
- [CARBONDATA-1368] - HDFS lock issue in SDV cluster