Authors: Jacky Li, Ravindra Pesala

Approaching CarbonData 1.6

When we started CarbonData project in 2016, the goal of it was to provide a unified data solution for various data analytic scenarios. Three years have passed, CarbonData has grown to become a popular solution for various scenarios, including

  • Adhoc analytic on detail record
  • Near real-time streaming data analytic
  • Large scale history data with update capability 
  • Data mart with a materialized view and SQL rewrite capability

In CarbonData 1.6.0, following features further improve CarbonData's capability on the above scenarios:

  • distributed index server, to serve block pruning for hyper-scale data. In a real world use case, we have a production system that serves 10 trillion records in one table with second level query response time.
  • adjustable index key, now user can change SORT_COLUMNS after creating a table, which gives the user the ability to tune the query performance dynamically as gaining more knowledge of what business needs.
  • formally support of spark, presto, hive, the three most popular compute engine in the big data world.
  • production ready Materialized View and SQL rewrite, which is a powerful feature to accelerate query performance in data mart scenario.

Thanks to these unique features, until now lots of opensource and commercial CarbonData software has been deployed, including various use case on the Internet, Telecom, Finacial and Banking, Smart City, etc

Towards CarbonData 2

On one hand, as Apache Carbon is approaching version 1.6.0, the technologies supporting the above scenarios become mature and production ready. On the other hand, analytic workload and runtime environment is becoming even more diverse than three years ago, for example, Cloud environment requires more complex data management for cloud storage, AI required data management for unstructured data. Thus it is high time now to discuss the future of CarbonData.

First, let's exam what is changing in the data landscape, so that we can target CarbonData 2.x in a more reasonable direction.

In my humble opinion, there are 3 trends in the data world

Trend 1: Data Lake and Data Warehouse are converging

As more and more project relies on data lake and more and more user found data lake is lack of some basic features comparing to a data warehouse, like data reliability and performance, while a data warehouse is notorious to scale. So, in order to meet the business goal, data lake and data warehouse are often used in a complementary manner.

Trend 2: Cloud makes data management more complex

With elastic and low cost resources, the cloud is changing how enterprise stores and analyzes their data. More and more user employes private cloud and public cloud technology, while they enjoy the benefit that cloud brings, they also have to manage more complex scenarios like cloud burst, data synchronization, etc. By the end of the day, they even find there are more data silos to manage.

Trend 3: AI raises new challenges for data management

Nowadays, AI is becoming everywhere in the application. Before a model can be trained, according to an industry study, 80% of the effort is spent on data preparation, and with the popular deep learning technology, unstructured data becomes dominant in AI domain. All these changes of data usage are leading to new challenges of data management, including data transform, data tracking, version control, etc.

The Goal of CarbonData 2

With these challenges in mind, following the goal of CarbonData 2 is proposed:

Goal: Build a data fabric to manage large scale and diverse data reliably.

By the term data fabric, I mean

  • It can be used as Data Lake with high reliability and high performance
  • It can be used as Data Warehouse with scalability and flexibility of compute-storage decoupling
  • Be ready to support data management for Hybrid Cloud and AI application

Thus, making CarbonData a unified data management solution for Data Warehouse + Data Lake + AI

Finally, it may become something looks like:

 

How can we achieve these goals? To achieve it, we must leverage the strength of CarbonData 1.x and add new features that towards these goals. So, we might ask ourselves what really makes CarbonData 1.x unique. And I summarize it as follows.

  •  segment data organization, as a basis for large scale, faster loading time, and transactional operation management.
  •  materialized and distributed metadata, as a basis for ease of data migration and again, large scale, since metadata is totally treated as data, thus avoiding a single process of holding metadata that limited by memory.
  •  multi-level indexing, as a basis for providing fast query performance while offering tunable loading performance to the user.
  •  main-delta based IUD with ACID compliant, as a basis for SCD scenario while keeping minimum IO impact on an immutable file system

These are the most important features that make CarbonData unique among so many big data solutions. When going towards CarbonData 2.x, these features should be preserved and leveraged.

CarbonData 2 Roadmap

Finally, we propose the following road map of CarbonData 2. This list is an initial draft of what we think CarbonData 2 should have, and they will be implemented span across multiple 2.x versions

Segment Plugin Interface

  • The segment related code refractory to abstract Segment Plugin Interface and make it format-neutral to allow plugin added by community developer
  •  The following format may be supported as builtin plugin in the initial iteration: carbon-row, carbon-columnar, CSV, and  more plugin may be supported in the community

Transactional Segment Operation

  • Support transactional operations on the segment, making data management ACID compliant on HDFS and the cloud. No more dirty data.
  •  Make the following operation formal in SQL and DataFrame API (can be done later in 2.x)
    • ALTER TABLE ADD SEGMENT
    •  ALTER TABLE DROP SEGMENT
    •  ALTER TABLE MOVE SEGMENT, and support setting segment moving policy
    •  SEGMENT ITERATIVE QUERY (for pagination and sampling purpose)
    •  ALTER TABLE COMPACTION
    •  SHOW SEGMENTS

Cloud Ready

  • Segment location awareness, supporting cloud storage, on-prem and hybrid cloud

  • Segment replication, for caching, cloud burst, cloud data synchronization, etc

New features for Bad Record

  • specify data validation during load

  • a new way to collect bad record during loading, and provide an easy-to-use tool for exploring bad records

New features for Update

  • Support MERGE syntax to simplify the SCD type 2 kind of updates
  • Support timestamp or version based query (Time Travel)
  • Support update/delete on the streaming table.

Integration

  • Adapting to Spark extension interface, removing all features that conflicting with an extension mechanism
  • Support Spark 2.4 integration
  • Flink integration by SDK to write and read carbondata
  • Support integration with more Hadoop distribution
  • SDK support transactional table

CarbonUI

  • support segment management UI

  • the backend server can act as a central server to trigger data management

  • data connection management between cloud and on-prem

Misc

  • upgrade Java to 1.8 as default
  • compiled SQL template for higher query performance for small table
  • support multiple spark version in the mvn repo, like carbon-2.3.2_2.11:1.6
  • No labels