This Confluence has been LDAP enabled, if you are an ASF Committer, please use your LDAP Credentials to login. Any problems file an INFRA jira ticket please.

Child pages
  • Unit Testing Hive SQL

Versions Compared


  • This line was added.
  • This line was removed.
  • Formatting was changed.


This query has a very broad set of responsibilities which cannot be easily verified in isolation. On closer inspection it appears that it is in fact formed of at least 7 distinct queries. To  To effectively unit test this process we must the process that this query represents, views can be used to separate and encapsulate set level components that comprise the process we must encapsulate each of the subqueries so that they can be tested independently. This approach seems to be A number of approaches are open to us including:

  • Sequential execution of components with intermediate tables.
  • Views.
  • Variable substitution of query fragments.
  • Functional/procedural SQL approaches.

Limited testing suggests that the VIEW approach is more effective than using sequential execution of components with intermediate tables (both TEMPORARY and not), due to the performance implications of using intermediate tables; not only do they , both in terms of elegance and performance. Intermediate table solutions (including TEMPORARY tables) take longer to run, but generate more I/O is generated, and in using this approach the restrict query optimization opportunities available to the query planner become limited.Through limited testing it was also shown that VIEWs do not in fact . It should also be noted that views do not appear to suffer from performance issues when compared with both a single large query and chaining tables together, as often prophesied; in fact the execution plans and times for views and monolithic queries were comparable. 

Variable substitution was also suggested as an approach for modularizing large queries, but upon inspection it was found to be unsuitable as an additional bash file is required which would make testing more complex. HPL/SQL was also considered, however it does not have the necessary pipelined function feature required for query modularization.