Impala's tests depend on a significant number of test databases that are used by various tests. This page aims to provide an introduction and some tips for working with this test data.
Data Sets
The Impala test data infrastructure has a concept of a data set, which is essentially a collection of tables in a database. A data set can be loaded for a range of different file formats, e.g. uncompressed text, gzip-compressed text, Kudu, snappy-compressed Parquet, etc. Each of the different formats is loaded into a separate database. E.g. the functional data set is loaded into various databases for different file formats: functional_text_gzip, functional_kudu, functional_parquet, etc.
Impala's tests generally depend on the "exhaustive" set of file formats from the functional data set and the "core" set of file formats from the other data sets.
Data sets are defined under testdata/datasets
. schema_constraints.csv controls which tables are generated for which file formats. testdata/bin/generate-schema-statements.py
determines which file formats a data set is generated for based on the workloads in testdata/workloads
.
Incrementally Loading Data
Sometimes while developing it is useful to load a new table or reload a modified table without redoing the whole data load. It is often possible to do incremental loads using bin/load-data.py. Note that the Impala minicluster has to be started in order to execute this script of load-data.py
, i.e., we have to execute $IMPALA_HOME/bin/start-impala-cluster.py
first.
# Reload a specific table for specific file formats. # -f forces reloading the table even if it exists. ./bin/load-data.py -f -w functional-query --table_names=decimal_rtf_tiny_tbl --table_formats=text/none,kudu/none --exploration_strategy=exhaustive # Load any missing tables from the functional data set (which is used by the functional-query workload) # Omitting -f means that data is not reloaded if the script detects that it is present. # We specify exhaustive because exhaustive tables are always used for the functional data set. ./bin/load-data.py -w functional-query --exploration_strategy=exhaustive # Reload all versions of the TPC-H nation table for "core" file formats. ./bin/load-data.py -w tpch --table_names=nation -f