Hive is widely applied as solution to numerous distinct problem types in the domain of big data. Quite clearly it often used for the ad hoc querying of large datasets. However it is also used to implement ETL type processes. Unlike ad hoc queries, the HQL written for ETLs has some distinct attributes:
- It is executed repeatedly, usually on a schedule.
- It is frequently a large, complex body of code.
- It has much greater longevity, persisting in an organisation’s code base for long periods of time.
- It is frequently modified over time, often in an organic manner to meet changing business requirements.
- It is critical to the operation of the organisation in that it generates valuable data.
Code exhibiting such properties is a strong candidate for unit test coverage as it is very prone to bugs, errors, and accidental breakage, all of which can represent risk to an organisation.
There are a number of challenges posed by both Hive and HQL that can make it difficult to construct suites of unit tests for Hive based ETL systems. These can be broadly described as follows:
- Defining boundaries between components: How can and how should a problem be decomposed into smaller, testable units.
- Harness provision: Providing a local execution environment that seamlessly supports Hive’s features in a local IDE setting (UDFs etc.)
- Speed of execution: The goal is to have large numbers of isolated, small tests. Test isolation requires frequent setup and teardown and the costs incurred are multiplied the number of tests.
By modularising processes implemented using Hive they become easier to test effectively and more resilient to change. Although Hive provides a number of vectors for modularisation it is not always clear how a large process can be decomposed. Features for encapsulation of query logic into components is separated into two perpendicular concerns: column level logic, and set level logic. Column level logic refers to the expressions applied to individual columns or groups of columns in the query, commonly described as ‘functions’. Set level logic concerns HQL constructs that manipulate groupings of data such as: column projection with
GROUP BY aggregates,
ORDER BY sorting, etc. In either case we expect individual components to live in their own source file or deployable artifact and imported as needed by the composition. For HQL based components, the
SOURCE command provides this functionality.
Encapsulation of column level logic
In the case of column level logic Hive provides both UDFs and macros that allow the user to extract and reuse the expressions applied to columns. Once defined, UDFs and Macros can be readily isolated for testing. UDFs can be simply tested with existing Java/Python unit test tools such as JUnit whereas Macros require a Hive command line interface to execute the macro declaration and then exercise it with some sample
Encapsulation of set level logic
Unlike column level logic, it is much less obvious how best to encapsulate and compose collections of set based logic. Consider the following example of a single complex query comprising joins, groupings, and column projections:
This query has a very broad set of responsibilities which cannot be easily verified in isolation. On closer inspection it appears that it is in fact formed of at least 7 distinct queries. To effectively unit test the process that this query represents an approach must be applied that separates and encapsulates each of the subqueries so that they can be tested independently. Possibly approaches to this include:
VIEWs, sequential execution of components with intermediate (possibly
TEMPORARY) tables, and even variable substitution of query fragments.
This section requires some definitive guidance.
The following tools provide test harnesses for the local execution of Hive tests.
- HiveRunner: Test cases are declared using Java, HQL and JUnit and can execute locally in your IDE. This library focuses on ease of use and execution speed. No local Hive/Hadoop installation required. Full test isolations and seamless UDF integration (they need only be on the project classpath).
- beetest: Test cases are declared using HQL and 'expected' data files. Test suites are executed using a script on the command line.
- hive_test: Test cases are declared using Java, HQL and JUnit and can execute locally in your IDE.
- How to utilise the Hive project's internal test framework
Declaration / Conflict of interest
These tools are listed according to level of experience I have with each tool, HiveRunner being the tool with which I have the most exposure. Furthermore, I have committed code to the HiveRunner project.