You are viewing an old version of this page. View the current version.

Compare with Current View Page History

Version 1 Next »


Apache DataSketches (https://datasketches.apache.org/) is integrated into Hive via HIVE-22939.
This enables various kind of sketch operations thru regular sql statement.

Sketch functions

Naming convention

All sketch functions are registered using the following naming convention:

ds_{sketchType}_{functionName}

For example we have a function called: ds_hll_estimate which could be used to estimate the distinct values from an hll sketch.

sketchType

For detailed info about the sketches themself please refer to the datasketches site!

  • frequency
    • hll
    • cpc
    • theta
  • frequent items
    • freq
  • histograms
    • kll

functionName

namedescription
sketchgenerates sketch data from input
estimatecomputes the estimate for frequency related sketches
union

aggregate function to merge multiple sketches

union_f

unions 2 sketches given in the arguments

nnumber of elements
cdfcumulative distribution
rankestimates the rank of the given element; returns a value in the range of 0~1
intersectaggregate to intersect multiple sketches
intersect_fintersect 2 sketches given in the arguments
stringifyreturns the the sketch in a more readable form

List declared sketch functions

Given that we have ~60 functions registered I would recommend to also consider listing/getting info about a single udf.

You could list all functions prefixed by ds_ using:

show functions like 'ds_%';

And you can access the description of a function like:

desc function ds_freq_sketch;

  • No labels