Impala currently depends on the ORC C++ library to read ORC files. This document introduces how to compile Impala with a customizd ORC branch. So at least you can test on the integration with the latest ORC library.

Compile ORC library

Checkout to your customized ORC branch and compile it using the same compiler of Impala toolchain. Here we use master branch as an example:

git clone https://github.com/apache/orc.git
cd orc
mkdir build && cd build

# Export CC and CXX to let cmake use Impala's gcc
# Note that IMPALA-9760 changes the toolchain location in Impala 4.0. Before that you should use
#   export CC="${IMPALA_HOME}/toolchain/gcc-${IMPALA_GCC_VERSION}/bin/gcc"
#   export CXX="${IMPALA_HOME}/toolchain/gcc-${IMPALA_GCC_VERSION}/bin/g++"
export CC="${IMPALA_TOOLCHAIN_PACKAGES_HOME}/gcc-${IMPALA_GCC_VERSION}/bin/gcc"
export CXX="${IMPALA_TOOLCHAIN_PACKAGES_HOME}/gcc-${IMPALA_GCC_VERSION}/bin/g++"

# Use Impala's cmake. Don't build the java lib and libhdfspp.
# Get the latest command example in https://github.com/cloudera/native-toolchain/blob/master/source/orc/build.sh
${IMPALA_TOOLCHAIN_PACKAGES_HOME}/cmake-${IMPALA_CMAKE_VERSION}/bin/cmake .. -DBUILD_JAVA=OFF -DBUILD_LIBHDFSPP=OFF -DINSTALL_VENDORED_LIBS=OFF -DBUILD_SHARED_LIBS=ON
# Then compile with multi-processes. $(nproc) is the number of virtual CPU cores.
CFLAGS="-fPIC -DPIC" make -j $(nproc)

# If succeeds, you should be able to find the binary at c++/src/liborc.a

Link Impala with your customized ORC library

Manually replace the ORC library in Impala's toolchain dir with your customized one. Then recompile Impala. Let's say ${ORC_HOME} is where you clone the ORC repo.

# Before IMPALA-9760, the location is $IMPALA_HOME/toolchain instead.
cd $IMPALA_TOOLCHAIN_PACKAGES_HOME

# Backup the existing library
cp -r orc-${IMPALA_ORC_VERSION} orc-${IMPALA_ORC_VERSION}-bak
cd orc-${IMPALA_ORC_VERSION}

# Replace the library
cp ${ORC_HOME}/build/c++/src/liborc.a lib/liborc.a
# Replace the header files
rm include/orc/*
cp ${ORC_HOME}/build/c++/include/orc/orc-config.hh include/orc/
cp ${ORC_HOME}/c++/include/orc/*.hh include/orc/
# ORC-751 adds another header subdir 'sargs'. Copy it as well.
cp -r ${ORC_HOME}/c++/include/orc/sargs include/orc/

# Recompile Impala
cd $IMPALA_HOME
make -j $(nproc) impalad

Troubleshooting

1. version GLIBCXX not found

CMake Error at /root/orc/build/protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf_ep-build-RELWITHDEBINFO.cmake:49 (message):
  Command failed: 2

   'make'

  See also

    /root/orc/build/protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf_ep-build-*.log


make[2]: *** [protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf_ep-build] Error 1
make[1]: *** [CMakeFiles/protobuf_ep.dir/all] Error 2
make: *** [all] Error 2

$ cat protobuf_ep-prefix/src/protobuf_ep-stamp/protobuf_ep-build-err.log                                                                                                                                  
./js_embed: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by ./js_embed)
make[5]: *** [/root/orc/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/compiler/js/well_known_types_embed.cc] Error 1
make[5]: *** Deleting file `/root/orc/build/protobuf_ep-prefix/src/protobuf_ep/src/google/protobuf/compiler/js/well_known_types_embed.cc'
make[4]: *** [CMakeFiles/libprotoc.dir/all] Error 2
make[3]: *** [all] Error 2

It's using the system provided libstdc++.so. It should use the one in Impala toolchain. Fix it by providing the path in LD_LIBRARY_PATH, e.g.

export LD_LIBRARY_PATH=/root/Impala/toolchain/toolchain-packages-gcc7.5.0/kudu-f486f0813a/debug/lib