This page documents how to build Docker containers for the various Impala daemon processes, suitable for deployment in Docker or another system like Kubernetes that can deploy Docker containers. If you want to do Impala development inside a Docker container, see Impala Development Environment inside Docker instead.
Prerequisites
These instructions assume that you have an Impala development environment set up and that you can build Impala and run Impala tests. If not, see "Getting Started" on Impala Home.
You will also need docker installed on your system. I recommend docker-ce on Ubuntu 22.04. See the Jenkins bootstrap script or the Docker documentation for steps to install docker-ce.
Building containers and publishing to your local Docker repo
Building and publishing containers is integrated into the Impala CMake files, with dependencies on all the build artifacts that go into the containers. Building is as simple as:
./buildall.sh -release -skiptests -ninja -noclean -notests ninja docker_images
You may want to set USE_CDP_HIVE in bin/impala-config-local.sh to build containers with Hive 3 support
Note - in future it would be nice if we could have a flag to buildall.sh to also build this target.
You can check if the images are in your local repository with the below command:
$ docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE impala_base latest ffb1347a05b7 2 weeks ago 2.91GB impalad_coord_exec latest dbea9fb26675 4 weeks ago 2.39GB impalad_coordinator latest bd213630acce 4 weeks ago 2.39GB statestored latest 59372e24a8f7 4 weeks ago 2.39GB catalogd latest d56a5adf86d0 4 weeks ago 2.39GB impalad_executor latest a735455d0c83 4 weeks ago 2.39GB
Pushing Images to a Repository
./docker/push-images.sh is a script that can push the built images to a docker repository. See that script's help for more information.
Running Dockerized Minicluster
As an initial step, you will need to set up a docker bridge network for the dockerized daemons to communicate over. We have a script ./docker/configure_test_network.sh to automate the setup. See that script for more details. You need to run it with the desired network name as the first argument, as follows:
$ ./docker/configure_test_network.sh impala-cluster-network Removing existing network 'impala-cluster-network' Error: No such network: impala-cluster-network Create network 'impala-cluster-network' ad24ced93c4b0b3cda288294d0a6bee35e05d9b1e4150b503f2adec325810280 Gateway is '172.18.0.1' Updating impala-config-local.sh tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ docker network ls NETWORK ID NAME DRIVER SCOPE 3a11d08c3503 bridge bridge local ef2c6011323d host host local ad24ced93c4b impala-cluster-network bridge local a345b0b987e0 none null local
The network setup changed some settings in impala-config-local.sh. You need to regenerate cluster configs and restart your minicluster services before starting Impala so that services like HDFS, HMS, etc will listen for connections on your new docker bridge network.
. bin/impala-config.sh && ./bin/create-test-configuration.sh && ./testdata/bin/run-all.sh
Once those services are up, you should be able to run dockerised Impala. If you previously had an Impala minicluster running, you must kill any non-dockerised Impala processes so they are not listening on the same ports used by the dockerized daemons. If the cluster starts up successfully, you should be able to run some queries via impala-shell.
start-impala-cluster.py --kill start-impala-cluster.py --docker_network=impala-cluster-network impala-shell.sh
Note that querying any existing tables is likely to fail because "localhost" is baked into a lot of metadata. You will need to load, or re-load data before running end-to-end tests.
You can see the running docker containers with "docker ps"
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES af295db2f818 impalad_coord_exec "/opt/impala/bin/dae…" 2 minutes ago Up 2 minutes 0.0.0.0:21002->21000/tcp, 0.0.0.0:21052->21050/tcp, 0.0.0.0:25002->25000/tcp impala-test-cluster-impala-cluster-network-impalad_coord_exec-2 57011eda4ded impalad_coord_exec "/opt/impala/bin/dae…" 2 minutes ago Up 2 minutes 0.0.0.0:21001->21000/tcp, 0.0.0.0:21051->21050/tcp, 0.0.0.0:25001->25000/tcp impala-test-cluster-impala-cluster-network-impalad_coord_exec-1 7a3caac8ccc0 impalad_coord_exec "/opt/impala/bin/dae…" 2 minutes ago Up 2 minutes 0.0.0.0:21000->21000/tcp, 0.0.0.0:21050->21050/tcp, 0.0.0.0:25000->25000/tcp impala-test-cluster-impala-cluster-network-impalad_coord_exec-0 e54887a3bc2c catalogd "/opt/impala/bin/dae…" 2 minutes ago Up 2 minutes 0.0.0.0:25020->25020/tcp impala-test-cluster-impala-cluster-network-catalogd 57cd35033b15 statestored "/opt/impala/bin/dae…" 2 minutes ago Up 2 minutes 0.0.0.0:25010->25010/tcp impala-test-cluster-impala-cluster-network-statestored
Running Tests
Automated end-to-end tests are run with this job: https://jenkins.impala.io/job/ubuntu-20.04-dockerised-tests/. https://github.com/apache/impala/blob/master/bin/jenkins/dockerized-impala-bootstrap-and-test.sh bootstraps ubuntu from scratch and runs the tests.
Tips for Working with Dockerized Minicluster
- The Impala debug pages are mostly exposed on the same ports as the regular minicluster, i.e. localhost:25000, localhost:25001, localhost:25002, localhost:25010, localhost:25020. This is achieved by mapping from the default webserver ports inside the container to the desired ports outside the container. I.e. all of the Impala daemons are exposing their webserver on port 25000 inside the container.
- If you want to look at logs or other state in a running container, you can use "docker exec" to run a bash process inside a container (using the name or ID from "docker ps"). Inside the container is a very stripped down Ubuntu environment, so it may be missing many commands you're accustomed to!
docker exec -it impala-test-cluster-impala-cluster-network-catalogd /bin/bash cat /tmp/catalogd.INFO
Switching between Dockerized and Non-Dockerized Minicluster
This is not totally streamlined. Here are some rough notes about the issues you may run into:
- HMS and Kudu metadata has hostnames embedded in various places, so if you did data load with the non-dockerized cluster, you will likely not be able to access any tables with a dockerized cluster.
- If you load data with a dockerized cluster, you can generally access tables with the non-dockerized cluster so long as you keep your bridge network around and all of the processes are listening on the bridge network's gateway IP.
- You will likely need to explicitly kill running impala processes with start-impala-cluster.py --kill to avoid port conflicts.