...
Consider the database schema created by the following DDL statements:
Code Block | ||||
---|---|---|---|---|
| ||||
<DDL statements> |
Assume we want to obtain frequently information about employees that were hired in different period granularities after 2016 and their departments. We may create the following materialized view:
...
CREATE MATERIALIZED VIEW mv1
AS
SELECT empid, deptname, hire_date
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2016-01-01';
CREATE TABLE emps (
empid INT,
deptno INT,
name VARCHAR(256),
salary FLOAT,
hire_date TIMESTAMP)
STORED AS ORC
TBLPROPERTIES ('transactional'='true');
CREATE TABLE depts (
deptno INT,
deptname VARCHAR(256),
locationid INT)
STORED AS ORC
TBLPROPERTIES ('transactional'='true'); |
Assume we want to obtain frequently information about employees that were hired in different period granularities after 2016 and their departments. We may create the following materialized view:
Code Block | ||||
---|---|---|---|---|
| ||||
CREATE MATERIALIZED VIEW mv1
AS
SELECT empid, deptname, hire_date
FROM emps JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2016-01-01'; |
Then, the following query extracting information about employees that were hired in Q1 2018 is issued to Hive
Then, the following query extracting information about employees that were hired in Q1 2018 is issued to Hive:
...
SELECT empid, deptname
FROM emps
JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31';
Hive will be able to rewrite the incoming query using the materialized view, including a compensation predicate on top of the scan over the materialization. Though the rewriting happens at the algebraic level, to illustrate this example, we include the SQL statement equivalent to the rewriting using the mv
used by Hive to answer the incoming query:
Code Block | ||||
---|---|---|---|---|
| ||||
SELECT empid, deptname
FROM emps
JOIN depts
ON (emps.deptno = depts.deptno)
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31'; |
Hive will be able to rewrite the incoming query using the materialized view, including a compensation predicate on top of the scan over the materialization. Though the rewriting happens at the algebraic level, to illustrate this example, we include the SQL statement equivalent to the rewriting using the mv
used by Hive to answer the incoming query:
Code Block | ||||
---|---|---|---|---|
| ||||
SELECT empid, deptname
FROM mv1
WHERE hire_date >= '2018-01-01'
AND hire_date <= '2018-03-31'; |
Example 2
For the second example, consider the star schema based on the SSB benchmark created by the following DDL statements:
Code Block | ||||
---|---|---|---|---|
| ||||
CREATE TABLE `customer`( `c_custkey` BIGINT, `c_name` STRING, `c_address` STRING, `c_city` STRING, `c_nation` STRING, `c_region` STRING, `c_phone` STRING, `c_mktsegment` STRING, PRIMARY KEY (`c_custkey`) DISABLE RELY) STORED AS ORC TBLPROPERTIES ('transactional'='true'); CREATE TABLE `dates`( `d_datekey` BIGINT, `d_date` STRING, `d_dayofweek` STRING, `d_month` STRING, `d_year` INT, `d_yearmonthnum` INT, `d_yearmonth` STRING, `d_daynuminweek` INT, `d_daynuminmonth` INT, `d_daynuminyear` INT, `d_monthnuminyear` INT, `d_weeknuminyear` INT, `d_sellingseason` STRING, `d_lastdayinweekfl` INT, `d_lastdayinmonthfl` INT, `d_holidayfl` INT, `d_weekdayfl`INT, PRIMARY KEY (`d_datekey`) DISABLE RELY) STORED AS ORC TBLPROPERTIES ('transactional'='true'); CREATE TABLE `part`( `p_partkey` BIGINT, `p_name` STRING, `p_mfgr` STRING, `p_category` STRING, `p_brand1` STRING, `p_color` STRING, `p_type` STRING, `p_size` INT, `p_container` STRING, PRIMARY KEY (`p_partkey`) DISABLE RELY) STORED AS ORC TBLPROPERTIES ('transactional'='true'); CREATE TABLE `supplier`( `s_suppkey` BIGINT, `s_name` STRING, `s_address` STRING, `s_city` STRING, `s_nation` STRING, `s_region` STRING, `s_phone` STRING, PRIMARY KEY (`s_suppkey`) DISABLE RELY) STORED AS ORC TBLPROPERTIES ('transactional'='true'); CREATE TABLE `lineorder`( `lo_orderkey` BIGINT, `lo_linenumber` int, `lo_custkey` BIGINT not null DISABLE RELY, `lo_partkey` BIGINT not null DISABLE RELY, `lo_suppkey` BIGINT not null DISABLE RELY, `lo_orderdate` BIGINT not null DISABLE RELY, `lo_ordpriority` STRING, `lo_shippriority` STRING, `lo_quantity` DOUBLE, `lo_extendedprice` DOUBLE, `lo_ordtotalprice` DOUBLE, `lo_discount` DOUBLE, `lo_revenue` DOUBLE, `lo_supplycost` DOUBLE, `lo_tax` DOUBLE, `lo_commitdate` BIGINT, `lo_shipmode` STRING, PRIMARY KEY (`lo_orderkey`) DISABLE RELY, CONSTRAINT fk1 FOREIGN KEY (`lo_custkey`) REFERENCES `customer_n1`(`c_custkey`) DISABLE RELY, CONSTRAINT fk2 FOREIGN KEY (`lo_orderdate`) REFERENCES `dates_n0`(`d_datekey`) DISABLE RELY, CONSTRAINT fk3 FOREIGN KEY (`lo_partkey`) REFERENCES `ssb_part_n0`(`p_partkey`) DISABLE RELY, deptname FROM mv1 WHERE hire_date >= '2018-01-01' AND hire_date <= '2018-03-31'; |
...
For the second example, consider the database schema created by the following DDL statements:
...
<DDL statements>
CONSTRAINT fk4 FOREIGN KEY (`lo_suppkey`) REFERENCES `supplier_n0`(`s_suppkey`) DISABLE RELY)
STORED AS ORC
TBLPROPERTIES ('transactional'='true'); |
As you can observe, we declare multiple integrity constraints for the database, using the RELY
keyword so they are visible to the optimizer. Now assume we want to create a materialization that denormalizes the database contents (consider dims
to be the set of dimensions that we will be querying often):
Code Block | ||||
---|---|---|---|---|
| ||||
CREATE MATERIALIZED VIEW mv2 AS SELECT <dims>, lo_revenue, lo_extendedprice * lo_discount AS d_price, lo_revenue - lo_supplycost FROM customer, dates, lineorder, part, supplier WHERE lo_orderdate = d_datekey AND lo_partkey = p_partkey AND lo_suppkey = s_suppkey AND lo_custkey = c_custkey; |
...
For the third example, consider the database schema with a single table that stores the edit events produced by a given website:website:
Code Block | ||||
---|---|---|---|---|
| ||||
CREATE TABLE `wiki` (
`time` TIMESTAMP,
`page` STRING,
`user` STRING,
`characters_added` BIGINT,
`characters_removed` BIGINT)
STORED AS ORC
TBLPROPERTIES ('transactional'='true'); | ||||
Code Block | ||||
sql | sql | <DDL statements> |
For this example, we will use Druid to store the materialized view. Assume we want to execute queries over the table, however we are not interested on any information about the events at a higher time granularity level than a minute. We may create the following materialized view that rolls up the events by the minute:
Code Block | ||||
---|---|---|---|---|
| ||||
CREATE MATERIALIZED VIEW mv3 STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' AS SELECT floor(time to minute) as `__time`, page, SUM(characters_added) AS c_added, SUM(characters_removed) AS c_removed FROM wiki GROUP BY floor(time to minute), page; |
...