This page documents how to build Docker containers for the various Impala daemon processes, suitable for deployment in Docker or another system like Kubernetes that can deploy Docker containers. If you want to do Impala development inside a Docker container, see Impala Development Environment inside Docker instead.

Prerequisites

These instructions assume that you have an Impala development environment set up and that you can build Impala and run Impala tests. If not, see "Getting Started" on Impala Home.

You will also need docker installed on your system. I recommend docker-ce on Ubuntu 22.04. See the Jenkins bootstrap script or the Docker documentation for steps to install docker-ce.

Building containers and publishing to your local Docker repo

Building and publishing containers is integrated into the Impala CMake files, with dependencies on all the build artifacts that go into the containers. Building is as simple as:

./buildall.sh -release -skiptests -ninja -noclean -notests


ninja docker_images
You may want to set USE_CDP_HIVE in bin/impala-config-local.sh to build containers with Hive 3 support


Note - in future it would be nice if we could have a flag to buildall.sh to also build this target.

You can check if the images are in your local repository with the below command:

$ docker image ls
REPOSITORY                                                                            TAG                 IMAGE ID            CREATED             SIZE
impala_base                                                                           latest              ffb1347a05b7        2 weeks ago         2.91GB
impalad_coord_exec                                                                    latest              dbea9fb26675        4 weeks ago         2.39GB
impalad_coordinator                                                                   latest              bd213630acce        4 weeks ago         2.39GB
statestored                                                                           latest              59372e24a8f7        4 weeks ago         2.39GB
catalogd                                                                              latest              d56a5adf86d0        4 weeks ago         2.39GB
impalad_executor                                                                      latest              a735455d0c83        4 weeks ago         2.39GB


Pushing Images to a Repository

./docker/push-images.sh is a script that can push the built images to a docker repository. See that script's help for more information.

Running Dockerized Minicluster

As an initial step, you will need to set up a docker bridge network for the dockerized daemons to communicate over. We have a script ./docker/configure_test_network.sh to automate the setup. See that script for more details. You need to run it with the desired network name as the first argument, as follows:

$ ./docker/configure_test_network.sh impala-cluster-network
Removing existing network 'impala-cluster-network'
Error: No such network: impala-cluster-network
Create network 'impala-cluster-network'
ad24ced93c4b0b3cda288294d0a6bee35e05d9b1e4150b503f2adec325810280
Gateway is '172.18.0.1'
Updating impala-config-local.sh
tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ docker network ls
NETWORK ID          NAME                     DRIVER              SCOPE
3a11d08c3503        bridge                   bridge              local
ef2c6011323d        host                     host                local
ad24ced93c4b        impala-cluster-network   bridge              local
a345b0b987e0        none                     null                local

The network setup changed some settings in impala-config-local.sh. You need to regenerate cluster configs and restart your minicluster services before starting Impala so that services like HDFS, HMS, etc will listen for connections on your new docker bridge network.

. bin/impala-config.sh && ./bin/create-test-configuration.sh && ./testdata/bin/run-all.sh

Once those services are up, you should be able to run dockerised Impala. If you previously had an Impala minicluster running, you must kill any non-dockerised Impala processes so they are not listening on the same ports used by the dockerized daemons. If the cluster starts up successfully, you should be able to run some queries via impala-shell.

start-impala-cluster.py --kill
start-impala-cluster.py --docker_network=impala-cluster-network
impala-shell.sh

Note that querying any existing tables is likely to fail because "localhost" is baked into a lot of metadata. You will need to load, or re-load data before running end-to-end tests.

You can see the running docker containers with "docker ps"

$ docker ps
CONTAINER ID        IMAGE                COMMAND                  CREATED             STATUS              PORTS                                                                          NAMES
af295db2f818        impalad_coord_exec   "/opt/impala/bin/dae…"   2 minutes ago       Up 2 minutes        0.0.0.0:21002->21000/tcp, 0.0.0.0:21052->21050/tcp, 0.0.0.0:25002->25000/tcp   impala-test-cluster-impala-cluster-network-impalad_coord_exec-2
57011eda4ded        impalad_coord_exec   "/opt/impala/bin/dae…"   2 minutes ago       Up 2 minutes        0.0.0.0:21001->21000/tcp, 0.0.0.0:21051->21050/tcp, 0.0.0.0:25001->25000/tcp   impala-test-cluster-impala-cluster-network-impalad_coord_exec-1
7a3caac8ccc0        impalad_coord_exec   "/opt/impala/bin/dae…"   2 minutes ago       Up 2 minutes        0.0.0.0:21000->21000/tcp, 0.0.0.0:21050->21050/tcp, 0.0.0.0:25000->25000/tcp   impala-test-cluster-impala-cluster-network-impalad_coord_exec-0
e54887a3bc2c        catalogd             "/opt/impala/bin/dae…"   2 minutes ago       Up 2 minutes        0.0.0.0:25020->25020/tcp                                                       impala-test-cluster-impala-cluster-network-catalogd
57cd35033b15        statestored          "/opt/impala/bin/dae…"   2 minutes ago       Up 2 minutes        0.0.0.0:25010->25010/tcp                                                       impala-test-cluster-impala-cluster-network-statestored

Running Tests

Automated end-to-end tests are run with this job: https://jenkins.impala.io/job/ubuntu-20.04-dockerised-tests/https://github.com/apache/impala/blob/master/bin/jenkins/dockerized-impala-bootstrap-and-test.sh bootstraps ubuntu from scratch and runs the tests.

Tips for Working with Dockerized Minicluster

  • The Impala debug pages are mostly exposed on the same ports as the regular minicluster, i.e. localhost:25000, localhost:25001, localhost:25002, localhost:25010, localhost:25020. This is achieved by mapping from the default webserver ports inside the container to the desired ports outside the container. I.e. all of the Impala daemons are exposing their webserver on port 25000 inside the container.
  • If you want to look at logs or other state in a running container, you can use "docker exec" to run a bash process inside a container (using the name or ID from "docker ps"). Inside the container is a very stripped down Ubuntu environment, so it may be missing many commands you're accustomed to!
docker exec -it impala-test-cluster-impala-cluster-network-catalogd /bin/bash
cat /tmp/catalogd.INFO


Switching between Dockerized and Non-Dockerized Minicluster

This is not totally streamlined. Here are some rough notes about the issues you may run into:

  • HMS and Kudu metadata has hostnames embedded in various places, so if you did data load with the non-dockerized cluster, you will likely not be able to access any tables with a dockerized cluster.
  • If you load data with a dockerized cluster, you can generally access tables with the non-dockerized cluster so long as you keep your bridge network around and all of the processes are listening on the bridge network's gateway IP.
  • You will likely need to explicitly kill running impala processes with start-impala-cluster.py --kill to avoid port conflicts.


  • No labels