Impala's tests depend on a significant number of test databases that are used by various tests. This page aims to provide an introduction and some tips for working with this test data.
Data Sets
The Impala test data infrastructure has a concept of a data set, which is essentially a collection of tables in a database. A data set can be loaded for a range of different file formats, e.g. uncompressed text, gzip-compressed text, Kudu, snappy-compressed Parquet, etc. Each of the different formats is loaded into a separate database. E.g. the functional data set is loaded into various databases for different file formats: functional_text_gzip, functional_kudu, functional_parquet, etc.
Impala's tests generally depend on the "exhaustive" set of file formats from the functional data set and the "core" set of file formats from the other data sets.
Codes
Data sets are defined under testdata/datasets
.
- Currently there are 4 datasets: functional, tpch, tpcds, tpcds_partitioned
schema_constraints.csv
in each dataset controls which tables are generated for which file formats.- Check
testdata/datasets/README
for more details.
Table schemas are defined in template files like testdata/datasets/functional/functional_schema_template.sql
testdata/bin/generate-schema-statements.py
determines which file formats a data set is generated for based on the workloads in testdata/workloads
. It generates SQL files based on the template files and puts them in the directory ${IMPALA_DATA_LOADING_SQL_DIR}/${workload}
(e.g. ${IMPALA_HOME}/logs/data_loading/sql/tpch
). When adding a new table, you might want to check the SQL files first. Example command to use generate-schema-statements.py
:
# Generate SQL files for Parquet format based on testdata/datasets/functional/functional_schema_template.sql testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive --table_formats=parquet/none/none
It generates the following SQL files under logs/data_loading/sql/functional/
- create-functional-query-exhaustive-impala-generated-parquet-none-none.sql
- load-functional-query-exhaustive-impala-generated-parquet-none-none.sql
- load-functional-query-exhaustive-hive-generated-parquet-none-none.sql
- invalidate-functional-query-exhaustive-impala-generated.sql
The entrance of loading test data is bin/load-data.py. It invokes generate-schema-statements.py
and creates/loads the tables.
Incrementally Loading Data
Sometimes while developing it is useful to load a new table or reload a modified table without redoing the whole data load. It is often possible to do incremental loads using bin/load-data.py. Note that the Impala minicluster has to be started in order to execute this script of load-data.py
, i.e., we have to execute $IMPALA_HOME/bin/start-impala-cluster.py
first.
# Reload a specific table for specific file formats. # -f forces reloading the table even if it exists. ./bin/load-data.py -f -w functional-query --table_names=decimal_rtf_tiny_tbl --table_formats=text/none,kudu/none --exploration_strategy=exhaustive # Load any missing tables from the functional data set (which is used by the functional-query workload) # Omitting -f means that data is not reloaded if the script detects that it is present. # We specify exhaustive because exhaustive tables are always used for the functional data set. ./bin/load-data.py -w functional-query --exploration_strategy=exhaustive # Reload all versions of the TPC-H nation table for "core" file formats. ./bin/load-data.py -w tpch --table_names=nation -f