Impala Test Data

Impala's tests depend on a significant number of test databases that are used by various tests. This page aims to provide an introduction and some tips for working with this test data.

Data Sets

The Impala test data infrastructure has a concept of a data set, which is essentially a collection of tables in a database. A data set can be loaded for a range of different file formats, e.g. uncompressed text, gzip-compressed text, Kudu, snappy-compressed Parquet, etc. Each of the different formats is loaded into a separate database. E.g. the functional data set is loaded into various databases for different file formats: functional_text_gzip, functional_kudu, functional_parquet, etc.

Impala's tests generally depend on the "exhaustive" set of file formats from the functional data set and the "core" set of file formats from the other data sets.

Codes

Data sets are defined under testdata/datasets.

Currently there are 4 datasets: functional, tpch, tpcds, tpcds_partitioned
schema_constraints.csv in each dataset controls which tables are generated for which file formats.
Check testdata/datasets/README for more details.

Table schemas are defined in template files like testdata/datasets/functional/functional_schema_template.sql

testdata/bin/generate-schema-statements.py determines which file formats a data set is generated for based on the workloads in testdata/workloads. It generates SQL files based on the template files and puts them in the directory ${IMPALA_DATA_LOADING_SQL_DIR}/${workload} (e.g. ${IMPALA_HOME}/logs/data_loading/sql/tpch). When adding a new table, you might want to check the SQL files first. Example command to use generate-schema-statements.py:

# Generate SQL files for Parquet format based on testdata/datasets/functional/functional_schema_template.sql
testdata/bin/generate-schema-statements.py -w functional-query -e exhaustive --table_formats=parquet/none/none

It generates the following SQL files under logs/data_loading/sql/functional/

create-functional-query-exhaustive-impala-generated-parquet-none-none.sql
load-functional-query-exhaustive-impala-generated-parquet-none-none.sql
load-functional-query-exhaustive-hive-generated-parquet-none-none.sql
invalidate-functional-query-exhaustive-impala-generated.sql

The entrance of loading test data is bin/load-data.py. It invokes generate-schema-statements.py and creates/loads the tables.

Incrementally Loading Data

Sometimes while developing it is useful to load a new table or reload a modified table without redoing the whole data load. It is often possible to do incremental loads using bin/load-data.py. Note that the Impala minicluster has to be started in order to execute this script of load-data.py, i.e., we have to execute $IMPALA_HOME/bin/start-impala-cluster.py first.

# Reload a specific table for specific file formats.
# -f forces reloading the table even if it exists.
./bin/load-data.py -f -w functional-query --table_names=decimal_rtf_tiny_tbl --table_formats=text/none,kudu/none --exploration_strategy=exhaustive


# Load any missing tables from the functional data set (which is used by the functional-query workload)
# Omitting -f means that data is not reloaded if the script detects that it is present.
# We specify exhaustive because exhaustive tables are always used for the functional data set.
./bin/load-data.py -w functional-query --exploration_strategy=exhaustive


# Reload all versions of the TPC-H nation table for "core" file formats.
./bin/load-data.py -w tpch --table_names=nation -f

Space shortcuts

Page tree

Data Sets

Codes

Incrementally Loading Data