Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Explain syntax of explode for maps

Hive Operators and User-Defined Functions (UDFs)

Table of Contents

Tip
titleCase-insensitive

All Hive keywords are case-insensitive, including the names of Hive operators and functions.

In the CLI, use the commands below to show the latest documentation:

No Format

SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;

...

Return Type

Name(Signature)

Description

varies

java_method(class, method[, arg1[, arg2..]])

Synonym for reflect (as of Hive 0.9.0)

varies

reflect(class, method[, arg1[, arg2..]])

Use this UDF to call Java methods by matching the argument signature (uses reflection). (as of Hive 0.7.0)

int

hash(a1[, a2...])

Returns a hash value of the arguments (as of Hive 0.4)

xpath

The following functions are described in LanguageManual XPathUDF:

...

Example: src_json table is a single column (json), single row table:

Code Block

+----+
                               json
+----+
{"store":
  {"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
   "bicycle":{"price":19.95,"color":"red"}
  },
 "email":"amy@only_for_json_udf_test.net",
 "owner":"amy"
}
+----+

The fields of the json object can be extracted using these queries:

Code Block

hive> SELECT get_json_object(src_json.json, '$.owner') FROM src_json;
amy

hive> SELECT get_json_object(src_json.json, '$.store.fruit\[0]') FROM src_json;
{"weight":8,"type":"apple"}

hive> SELECT get_json_object(src_json.json, '$.non_exist_key') FROM src_json;
NULL

...

explode() takes in an array (or a map) as an input and outputs the elements of the array (map) as separate rows. UDTF's can be used in the SELECT expression list and as a part of LATERAL VIEW.

...

Then running the query:

Code Block
sql
sql

SELECT explode(myCol) AS myNewCol FROM myTable;

...

(int) myNewCol

100

200

300

400

500

600

The usage with Maps is similar:

Code Block
languagesql
SELECT explode(myMap) AS myMapKey, myMapValue FROM myMapTable;

posexplode

Info
titleVersion

Available as of Hive 0.13.0. See HIVE-4943.

...

Then running the query:

Code Block
sql
sql

SELECT posexplode(myCol) AS pos, myNewCol FROM myTable;

...

A new json_tuple() UDTF is introduced in hive 0.7. It takes a set of names (keys) and a JSON string, and returns a tuple of values using one function. This is much more efficient than calling GET_JSON_OBJECT to retrieve more than one key from a single JSON string. In any case where a single JSON string would be parsed more than once, your query will be more efficient if you parse it once, which is what JSON_TUPLE is for. As JSON_TUPLE is a UDTF, you will need to use the LATERAL VIEW syntax in order to achieve the same goal.

For example,

Code Block

select a.timestamp, get_json_object(a.appevents, '$.eventid'), get_json_object(a.appenvets, '$.eventname') from log a;

should be changed to

Code Block

select a.timestamp, b.*
from log a lateral view json_tuple(a.appevent, 'eventid', 'eventname') b as f1, f2;

...

The parse_url_tuple() UDTF is similar to parse_url(), but can extract multiple parts of a given URL, returning the data in a tuple. Values for a particular key in QUERY can be extracted by appending a colon and the key to the partToExtract argument, e.g. parse_url_tuple('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY:k1', 'QUERY:k2') returns a tuple with values of 'v1','v2'. This is more efficient than calling parse_url() multiple times. All the input parameters and output column types are string.

Code Block

SELECT b.*
FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;

...

A typical OLAP pattern is that you have a timestamp column and you want to group by daily or other less granular date windows than by second. So you might want to select concat(year(dt),month(dt)) and then group on that concat(). But if you attempt to GROUP BY or SORT BY a column on which you've applied a function and alias, like this:

Code Block

select f(col) as fc, count(*) from table_name group by fc;

You will get an error:

Code Block

FAILED: Error in semantic analysis: line 1:69 Invalid Table Alias or Column Reference fc

Because you are not able to GROUP BY or SORT BY a column alias on which a function has been applied. There are two workarounds. First, you can reformulate this query with subqueries, which is somewhat complicated:

Code Block

select sq.fc,col1,col2,...,colN,count(*) from
  (select f(col) as fc,col1,col2,...,colN from table_name) sq
 group by sq.fc,col1,col2,...,colN;

Or you can make sure not to use a column alias, which is simpler:

Code Block

select f(col) as fc, count(*) from table_name group by f(col);

...

The context of a UDF's evaluate method is one row at a time. A simple invocation of a UDF like

Code Block

SELECT length(string_col) FROM table_name;

...

If you would like to control which rows get sent to the same UDF (and possibly in what order), you will have the urge to make the UDF evaluate during the reduce phase. This is achievable by making use of DISTRIBUTE BY, DISTRIBUTE BY + SORT BY, CLUSTER BY. An example query would be:

Code Block

SELECT reducer_udf(my_col, distribute_col, sort_col) FROM
(SELECT my_col, distribute_col, sort_col FROM table_name DISTRIBUTE BY distribute_col SORT BY distribute_col, sort_col) t

...