Apache DataSketches (https://datasketches.apache.org/) is integrated into Hive via HIVE-22939.
This enables various kind of sketch operations thru regular sql statement.
Sketch functions
Naming convention
All sketch functions are registered using the following naming convention:
ds_{sketchType}_{functionName}
For example we have a function called: ds_hll_estimate which could be used to estimate the distinct values from an hll sketch.
sketchType
For detailed info about the sketches themself please refer to the datasketches site!
- frequency
- hll
- cpc
- theta
- frequent items
- freq
- histograms
- kll
functionName
name | description |
---|---|
sketch | generates sketch data from input |
estimate | computes the estimate for frequency related sketches |
union | aggregate function to merge multiple sketches |
union_f | unions 2 sketches given in the arguments |
n | number of elements |
cdf | cumulative distribution |
rank | estimates the rank of the given element; returns a value in the range of 0~1 |
intersect | aggregate to intersect multiple sketches |
intersect_f | intersect 2 sketches given in the arguments |
stringify | returns the the sketch in a more readable form |
List declared sketch functions
Given that we have ~60 functions registered I would recommend to also consider listing/getting info about a single udf.
You could list all functions prefixed by ds_ using:
show functions like 'ds_%';
And you can access the description of a function like:
desc function ds_freq_sketch;
Integration with materialized views
Sketch aggregation(s) are exposed to Calcite by some extensions - which could enable both the usage of an MV in a smaller dimension query; or could help in incremental updates.
Examples
#0 Prepare sample table
-- prepare input data create temporary table sketch_input (id int, category char(1)); insert into table sketch_input values (1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 'a'), (9, 'a'), (10, 'a'), (6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11, 'b'), (12, 'b'), (13, 'b'), (14, 'b'), (15, 'b');
-- prepare input data
create temporary table sketch_input (id int, category char(1));
insert into table sketch_input values
(1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8, 'a'), (9, 'a'), (10, 'a'),
(6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11, 'b'), (12, 'b'), (13, 'b'), (14, 'b'), (15, 'b');
-- build sketches per category
create temporary table sketch_intermediate (category char(1), sketch binary);
insert into sketch_intermediate select category, ds_hll_sketch(id) from sketch_input group by category;
-- get unique count estimates per category
select category, ds_hll_estimate(sketch) from sketch_intermediate;
-- union sketches across categories and get overall unique count estimate
select ds_hll_estimate(ds_hll_union(sketch)) from sketch_intermediate;
#1 HLL