Apache CarbonData community is pleased to announce the release of the Version 2.1.0 in The Apache Software Foundation (ASF).
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenarios, it supports queries on a single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
We encourage you to use the release https://dist.apache.org/repos/dist/release/CarbonData/2.1.0/, and feedback through the CarbonData user mailing lists!
This release note provides information on the new features, improvements, and bug fixes of this release.
What’s New in CarbonData Version 2.1.0?
In CarbonData 2.1.0, 134 JIRA tickets related to improvements, and bugs have been resolved. Please find the summary of the important features that are release with this release.
Transactional write support using Presto
CarbonData now supports writing in transactional mode from presto servers. This is a positive step in presto integration as now the tables can be read from spark/hive engines without the need to recreate the tables.
Presto local dictionary and reading for complex types
Carbondata now supports local dictionary on complex types and reading(only array and struct). For now only single level array and struct types would be supported for reading.
Make GeoID visible to the user
Generated geohash column will now be included in the schema. Alter commands, Indexes, MV and other table properties are not supported on this column.
Support loading data from parquet, ORC, CSV, Avro and JSON using CarbonData SDK
Now CarbonData supports loading of data from parquet, ORC, CSV, Avro and JSON formats directly in Carbon format. This would enable users to migrate data directly from the mentioned formats to Carbon.
Support delete and update from CarbonData SDK
Updating and Deleting rows is now supported from carbondata SDK.
Support array<string> complex type with Secondary Index
Secondary Index can now be created on an array<string> data type to accelerate queries which have an array_contains filter. Data would be stored in a flattened format in Secondary Index for the array cloumn.
Support IndexServer with Presto Engine
Improve index caching for presto engine using index server. Now the indexes for the table being scanned can be cached in index server reducing the presto server memory footprint.
Support Change Column Comment
Column Comments can now be changed using the alter command.
Support global sort for Secondary index table
Using global sort for SI table can improve the query performance by accelerating the filter process.
Reorder filter according to the column storage ordinal to improve reading
Reorder the filter according to the column storage ordinal to avoid backward seek. This will be helpful in cloud scenarios where scanning is relatively very coslty.
Implementing a new Reindex command to repair the missing SI Segments
Support a separate SQL reindex command(reindex [index_table] on table maintable) to call the SI repair logic without load/insert.
Support order limit by push down for secondary index queries
Improve SI scan time by reducing output size by pushing down limit and order by when Limit is present and order by column and all the filter column is SI column.
Please find the detailed JIRA list here.