Motivations

Hive is widely applied as a solution to numerous distinct problem types in the domain of big data. Quite clearly it is often used for the ad hoc querying of large datasets. However it is also used to implement ETL type processes. Unlike ad hoc queries, the Hive SQL written for ETLs has some distinct attributes:

Code exhibiting such properties is a strong candidate for unit test coverage as it is very prone to bugs, errors, and accidental breakage, all of which can represent risk to an organisation.

Challenges

There are a number of challenges posed by both Hive and Hive SQL that can make it difficult to construct suites of unit tests for Hive based ETL systems. These can be broadly described as follows:

Modularisation

By modularising processes implemented using Hive they become easier to test effectively and more resilient to change. Although Hive provides a number of vectors for modularisation it is not always clear how a large process can be decomposed. Features for encapsulation of query logic into components is separated into two perpendicular concerns: column level logic, and set level logic. Column level logic refers to the expressions applied to individual columns or groups of columns in the query, commonly described as ‘functions’. Set level logic concerns Hive SQL constructs that manipulate groupings of data such as: column projection with SELECT, GROUP BY aggregates, JOINs, ORDER BY sorting, etc. In either case we expect individual components to live in their own source file or deployable artifact and imported as needed by the composition. For Hive SQL-based components, the SOURCE command provides this functionality.

Encapsulation of column level logic

In the case of column level logic Hive provides both UDFs and macros that allow the user to extract and reuse the expressions applied to columns. Once defined, UDFs and Macros can be readily isolated for testing. UDFs can be simply tested with existing Java/Python unit test tools such as JUnit whereas Macros require a Hive command line interface to execute the macro declaration and then exercise it with some sample SELECT statements.

Encapsulation of set level logic

Unlike column level logic, it is much less obvious how best to encapsulate and compose collections of set based logic. Consider the following example of a single complex query comprising joins, groupings, and column projections:

SELECT ... FROM (                  -- Query 1
  SELECT ... FROM (                --  Query 2
    SELECT ... FROM (              --   Query 3
      SELECT ... FROM a WHERE ...  --    Query 4
    ) A LEFT JOIN (                --   Query 3
      SELECT ... FROM b            --    Query 5
    ) B ON (...)                   --   Query 3 
  ) ab FULL OUTER JOIN (           --  Query 2
    SELECT ... FROM c WHERE ...    --   Query 6
  ) C ON (...)                     --  Query 2
) abc LEFT JOIN (                  -- Query 1
  SELECT ... FROM d WHERE ...      --  Query 7
) D ON (...)                       -- Query 1
GROUP BY ...;                      -- Query 1

This query has a very broad set of responsibilities which cannot be easily verified in isolation. On closer inspection it appears that it is in fact formed of at least 7 distinct queries. 

To effectively unit test the process that this query represents, VIEWs can be used to separate and encapsulate each of the subqueries so that they can be tested independently. This approach seems to be more effective than using sequential execution of components with intermediate tables (both TEMPORARY and not), due to the performance implications of using intermediate tables; not only do they take longer to run, but more I/O is generated, and in using this approach the query optimization opportunities available to the query planner become limited.

Through limited testing it was also shown that VIEWs do not in fact suffer from performance issues when compared with both a single large query and chaining tables together, as often prophesied; in fact the execution plans and times for VIEWs and monolithic queries were comparable. 

Variable substitution was suggested as an approach for modularizing large queries, but upon inspection it was found to be unsuitable as an additional bash file is required which would make testing more complex. HPL/SQL was also considered, however it does not have the necessary pipelined function feature required for query modularization. 

Tools and frameworks

When constructing tests it is helpful to have a framework that simplifies the declaration and execution of tests. Typically these tools allow the specification of many of the following:

The precise details are of course framework specific, but generally speaking tools manage the full lifecycle of tests by composing the artifacts provided by the developer into a sequence such as:

  1. Configure Hive execution environment.
  2. Setup test input data.
  3. Execute SQL script under test.
  4. Extract data written by the executed script.
  5. Make assertions on the data extracted.

At this time there are are a number of concrete approaches to choose from:

Useful practices

The following Hive specific practices can be used to make processes more amenable to unit testing and assist in the simplification of individual tests.

Relevant issues

Other Hive unit testing concerns

Although not specifically related to Hive SQL, tooling exists for the testing of other aspects of the Hive ecosystem. In particular the BeeJU project provides JUnit rules to simplify the testing of integrations with the Hive Metastore and HiveServer2 services. These are useful, if for example, you are developing alternative data processing frameworks or tools that aim to leverage Hive's metadata features.