Child pages
  • LanguageManual XPathUDF
Skip to end of metadata
Go to start of metadata

Documentation for Built-In User-Defined Functions Related To XPath

UDFs

xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_number, xpath_string

  • Functions for parsing XML data using XPath expressions.
  • Since version: 0.6.0

    Overview

The xpath family of UDFs are wrappers around the Java XPath library javax.xml.xpath provided by the JDK. The library is based on the XPath 1.0 specification. Please refer to http://java.sun.com/javase/6/docs/api/javax/xml/xpath/package-summary.html for detailed information on the Java XPath library.

All functions follow the form: xpath_*(xml_string, xpath_expression_string). The XPath expression string is compiled and cached. It is reused if the expression in the next input row matches the previous. Otherwise, it is recompiled. So, the xml string is always parsed for every input row, but the xpath expression is precompiled and reused for the vast majority of use cases.

Backward axes are supported. For example:

Each function returns a specific Hive type given the XPath expression:

  • xpath returns a Hive array of strings.
  • xpath_string returns a string.
  • xpath_boolean returns a boolean.
  • xpath_short returns a short integer.
  • xpath_int returns an integer.
  • xpath_long returns a long integer.
  • xpath_float returns a floating point number.
  • xpath_double,xpath_number returns a double-precision floating point number (xpath_number is an alias for xpath_double).

The UDFs are schema agnostic - no XML validation is performed. However, malformed xml (e.g., <a><b>1</b></aa>) will result in a runtime exception being thrown.

Following are specifics on each xpath UDF variant.

xpath

The xpath() function always returns a hive array of strings. If the expression results in a non-text value (e.g., another xml node) the function will return an empty array. There are 2 primary uses for this function: to get a list of node text values or to get a list of attribute values.

Examples:

Non-matching XPath expression:

Get a list of node text values:

Get a list of values for attribute 'id':

Get a list of node texts for nodes where the 'class' attribute equals 'bb':

xpath_string

The xpath_string() function returns the text of the first matching node.

Get the text for node 'a/b':

Get the text for node 'a'. Because 'a' has children nodes with text, the result is a composite of text from the children.

Non-matching expression returns an empty string:

Gets the text of the first node that matches '//b':

Gets the second matching node:

Gets the text from the first node that has an attribute 'id' with value 'b_2':

xpath_boolean

Returns true if the XPath expression evaluates to true, or if a matching node is found.

Match found:

No match found:

Match found:

No match found:

xpath_short, xpath_int, xpath_long

These functions return an integer numeric value, or the value zero if no match is found, or a match is found but the value is non-numeric.
Mathematical operations are supported. In cases where the value overflows the return type, then the maximum value for the type is returned.

No match:

Non-numeric match:

Adding values:

Overflow:

xpath_float, xpath_double, xpath_number

Similar to xpath_short, xpath_int and xpath_long but with floating point semantics. Non-matches result in zero. However,
non-numeric matches result in NaN. Note that xpath_number() is an alias for xpath_double().

No match:

Non-numeric match:

A very large number:

UDAFs

UDTFs

  • No labels