Statistics and Data Mining in Hive
This page is the secondary documentation for the slightly more advanced statistical and data mining functions that are being integrated into Hive, and especially the functions that warrant more than one-line descriptions.
ngrams() and context_ngrams(): N-gram frequency estimation
N-grams are subsequences of length N drawn from a longer sequence. The purpose of the
ngrams() UDAF is to find the
k most frequent n-grams from one or more sequences. It can be used in conjunction with the
sentences() UDF to analyze unstructured natural language text, or the
collect() function to analyze more general string data.
Contextual n-grams are similar to n-grams, but allow you to specify a 'context' string around which n-grams are to be estimated. For example, you can specify that you're only interested in finding the most common two-word phrases in text that follow the context "I love". You could achieve the same result by manually stripping sentences of non-contextual content and then passing them to
context_ngrams() makes it much easier.
- (ngrams) Find important topics in text in conjunction with a stopword list.
2. (ngrams) Find trending topics in text.
3. (context_ngrams) Extract marketing intelligence around certain words (e.g., "Twitter is ___").
4. (ngrams) Find frequently accessed URL sequences.
5. (context_ngrams) Find frequently accessed URL sequences that start or end at a particular URL.
6. (context_ngrams) Pre-compute common search lookaheads.
The command above will return the top-100 bigrams (2-grams) from a hypothetical table called
tweet column is assumed to contain a string with arbitrary, possibly meaningless, text. The
lower() UDF first converts the text to lowercase for standardization, and then
sentences() splits up the text into arrays of words. The optional fourth argument is the precision factor that control the tradeoff between memory usage and accuracy in frequency estimation. Higher values will be more accurate, but could potentially crash the JVM with an OutOfMemory error. If omitted, sensible defaults are used.
The command above will return a list of the top 100 words that follow the phrase "i love" in a hypothetical database of Twitter tweets. Each
null specifies the position of an n-gram component to estimate; therefore, every query must contain at least one
null in the context array.
Note that the following two queries are identical, but
ngrams() will be slightly faster in practice.
histogram_numeric(): Estimating frequency distributions
Histograms represent frequency distributions from empirical data. The kind that is referred to here are histograms with variable-sized bins. Specifically, this UDAF will return a list of (x,y) pairs that represent histogram bin centers and heights. It's up to you to then plot them in Excel / Gnuplot / Matlab / Mathematica to get a nice graphical display.
- Estimating the frequency distribution of a column, possibly grouped by other attributes.
2. Choosing discretization points in a continuous valued column.
The command above is self-explanatory. Converting the output into a graphical display is a bit more involved. The following Gnuplot command should do it, assuming that you've parsed the output from
histogram() into a text file of (x,y) pairs called