In this era of BigData, traditional data processing applications are inadequate to deal with the challenges of analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating, and information privacy.
In most of our applications we are gathering data from various sources like transactional data derived from user’s interactions with the application, data collected from machine sensors, sales data, data from social media or collected results of scientific experiments. These data can be structured like SQL database store or JSON data, as well as unstructured data like documents and streaming data from sensors. Since this data is being generated constantly over a period of time the data become enormous and special techniques are needed to handle this amount of data.
For us as an organization this data is very important as it gives insight into a lot of things e.g. root causes of failures, user habits, frauds, sales figures , geographies , service disruption etc.. But this data is totally useless if we can not analyse this data quickly and draw conclusions. Hence analysing this data has become prime need for us.
This need have resulted in a slew of open source innovation in Big Data tools and technologies to collect, store and analyse the high volume, high velocity and high veracity of data. There are various tools available in the open source that helps us to store and ingest huge amount of data (In petabytes) as well as analysis on them by executing different set of queries like Analytical Query, Batch Query, Operational Query and so on.
Most often our use cases define which specific set of data we need and then we will write different kind of queries for the same. Some examples of Use Cases and queries needed to get data for these Use cases are as follow
1.Operational Queries (Random Access): In an E-Commerce application we want to display all the orders placed by a particular user (see Figure 1).
In this case there may be million of rows but we are mainly focused on data for our current logged in user, so we are interested in a specific set of rows that meet the criteria (user_id == ) . These kinds of queries are Operational Queries which involved certain rows based on a particular filter. In big data world HBASE is one of the technologies that is being used for these kind of queries.
2. Batch Queries (Sequential Access): In the same E-Commerce Application we want to generate a report which contains the number of orders per user, the amount of money spent by that user, number of returns, number of coupons used and monthly orders (see Figure 2). For this use case we have to access all the rows and then aggregate or reduce the data to generate a report.
Most of the Business Intelligence (BI) tools execute these queries. Earlier we used to use the Data Warehouse technology which was specifically developed for these needs in relational world. So in Batch Queries we are focused on the data from all or majority of rows, the number of columns involved might only be few. Apache Hive is quite popular for these kinds of queries.
3. Interactive Analysis Queries (OLAP Queries) : Taking an example of same application we have a need to provide analysis regarding Users, Items, Sales, Warehouse Inventory for the Items, Sizes Available, Vendor Information and so on (see Figure 3). In this use case we need to query multiple tables, filter multiple columns as well as have aggregate functions or nested queries. Such queries are often generated by Interactive Analysis tools like Tableau, PentaHo. Even for graph databases which provide excellent fast retrieval of OLTP queries we have to use a Big Data tool for OLAP. There are a lot of tools available in the market for these kinds of queries like Apache Impala and Apache Kylin.
As you observe in the above section there is a specific tool available for these kind of queries and we can use them to solve the problems so what is the issue here?
Primarily these are the challenges that we faced.
Different tool needed for different type of queries e.g. HBASE is suited for random access queries and Hive for batch queries.
Data replication: Deploying multiple datastores/engines means that the same data has to be stored in all these engines.
High Cost: Since we are replicating the data to meet our generic needs there is a whole lot of overhead that we are bearing in terms of storage, processing power as well as maintenance that adds to the cost.
Maintenance: Managing and maintaining all the different frameworks is quite a headache.
Excessive IO scans and excessive utilization of CPU resources in above mentioned frameworks which de-accelerated the query performance by several times.
Space optimization and querying on encoded data without decoding the data until the last stage was another challenge.
Reducing the row construction cost.
So we researched about a solution that can define a Data format for all our specific needs and all type of queries can be executed on the same framework with same efficiency.
Is CarbonData the solution ?
CarbonData was envisaged to solve this problem, it provides a way to create a single data format for all kind of queries and analysis on Big Data. Apache CarbonData can provide blazing fast performance for any query scenario described above using a single storage format. What makes CarbonData so special is, its storage format, which is designed and developed from the ground up to provide fast response for any query scenario.
Older file formats like Parquet and ORC in the Hadoop eco-system fail to cater equal efficiency to all domains of query like OLAP, Sequential and Random. While working with these two, we found them to be working suitably for big scans. Also they support HDFS to allow levering of existing Hadoop Cluster, but they have been found to be unsuitable in providing the sub second responses for primary key lookups, Olap style queries over big data involving filters and fetching of all columns of record. For Primary Key based fetching we will need to have indexes and that is one of the key features of CarbonData.
Following are the prominent features of CarbonData.
Global Dictionary with Deferred Decoding:- Allows the query engine to perform computations such as Group BY/Aggregation and Sort using encoded values without having to decode each and every row. The decoding needs to be done only for the final query result which in most cases contains much less rows than the scanned data. This helps in Batch queries and Interactive queries.
Index (MDK,Min-Max,Inverted Index):- Enables the scanner to quickly locate the required rows without having to read and compare each and every row against the query predicates(filters). This saves a lot in I/O bandwidth and time. This helps in Interactive Queries and Operational Queries.
Column Groups:-Enables CarbonData to provide row-store like query performance for wide table queries in spite of CarbonData being a columnar data storage format. It helps eliminate the implicit join which is otherwise required to reconstruct rows from columnar storage format. This helps in making Operational Queries several magnitude faster.
Advanced Push Down Optimizations: Carbon pushes as much of query processing as possible close to the data to minimize the amount of data being read, processed, converted and transmitted/shuffled.
Projection and Filters: Since carbon uses columnar format, it reads only the required columns from the store and also reads only the rows that match the filter conditions provided in the query.
Following figures show comparison of other tools with Carbon Data:-
Current Users of CarbonData
Hulu is one of the first users of CarbonData, It is a North American video industry Internet Company. They have switched over CarbonData to speed up their systems by utilising the features of CarbonData. They filter out 2% to 5% of data for aggregation, the filters are majority based on 5 to 10 columns and the result set has 100+ columns. The finger grained index, columnar groups provided in CarbonData render speedier results than Parquet/ ORC in this Use Case.
CarbonData provides features that are very useful to us like low latency, space efficiency, rapid processing and intelligent solutions to BigData problems. CarbonData’s indexing significantly accelerates query performance and reduces the I/O scans and CPU resources, where there are filters in the query. CarbonData index consists of multiple level of indices allowing a processing framework to leverage on these index(s) to reduce the task it needs to schedule and process, and also supports skip scan in more finer grain unit (called blocklet) in task side scanning instead of scanning the whole file. CarbonData by supporting efficient compression and global encoding schemes, allows queries on compressed/encoded data, the data can be converted just before returning the results to the users, which is "late materialized". It allow multiple columns to form a column group that would be stored as row format. This reduces the row reconstruction cost at query time.