Blog - CarbonData - Apache Software Foundation

CDC merge capability comparison of Apache CarbonData and Apache Hudi

Kunal Kapoor posted on Jan 19, 2022

Change data capture (CDC) is a process that captures changes made in a database and ensures that those changes are replicated to a destination such as a data warehouse or a data lake. To generalize, this change data can be both from database changes or any custom changes applied by users.

Read the complete blog here

Make Apache Spark better with CarbonData

Kunal Kapoor posted on Sep 21, 2021

Spark is no doubt a powerful processing engine and a distributed cluster computing framework for faster processing. Unfortunately there are few areas where spark has drawbacks. If we combine Apache Spark with Apache CarbonData, it can overcome those drawbacks. Few of the drawbacks with Apache Spark are as below:

No Support for ACID transaction
No data quality enforcement
Small files problem
Inefficient data skipping

Read the complete blog here.

Comparative study of Apache Iceberg, Open Delta, Apache CarbonData and Hudi

Kunal Kapoor posted on Sep 21, 2021

We have seen a lot of interest for an efficient and reliable solution to provide the mutation and transaction capability into the data lakes. In the data lake, it is very common that users generate reports based on a single set of data. As various types of data flow into data lake, the state of data cannot be immutable. Various use cases requiring mutating data includes data changes with time, late arriving data, balancing real time availability and backfilling, state changing data like CDC, data snapshotting, data cleansing etc, While generating reports, these will result in write/update the same set of tables.

Read the complete blog here.

Boosting CarbonData Query Performance with Materialized views

Kunal Kapoor posted on Sep 21, 2021

Materialized view is a pre-computed data set which is one of the most important query performance tuning tools used in Bigdata systems, allowing users to pre-join complex views and pre-compute summaries for quick response time. In CarbonData, materialized views helps in improving performance by doing pre-computation of relevant query projections,filters and expensive operations like aggregations and joins. With materialized views on carbon table, we can avoid unnecessary big-table full-table scans to make query faster.

Read the complete blog here.

CarbonData Distributed Cache Mechanism

Kunal Kapoor posted on Sep 21, 2021

CarbonData uses caching to increase the query performance by caching block/blocklet index information and prunes using the cache. Using caching, the number of files that are to be read are reduced thereby reducing the IO time and improving the overall query performance.

Read the complete blog here.