Currently publishing MXNet-Scala packages to Maven is a multi-step manual process that could take 1-2 days, a new package is published when a new Apache MXNet release is made. The publish process is frequent and by automating it will save manual effort. Since the process is manual, it could introduce unintended human errors. Currently the packages are not built daily if built daily and published to SNAPSHOT it could be useful for users who want to pick up the latest changes without having to manually build from source.
Another problem is the current Scala package build relies on dynamic loading of dependencies and the user will have to know what version of the various dependencies the package is built with to use in their environment.
Please see the discussion drafted here and share your opinion as well in this place.
This document will show a step by step process to deliver Scala release to Apache Maven SNAPSHOT repository as well as a guidance to a constant build pipeline for other languages.
Proposed MXNet package release architecture in CI
In order to make sure the design is flexible and extendable, the whole built was breaking into three stages.
Install all the dependencies that are required in the system and build the links to them before running the backend built. The dependencies and methods of installation will be separate for different platforms.
Build the MXNet backend for different packages. The difference will be the build flags that enabled for different language
This stage will be running in a restricted node in CI that would provide credentials to publish package to public repo (Pypi and Maven).
The final outcome for all packages will combine all dependencies (opencv, blas) in one place. In this case, user can import the package without installing these dependencies.
Step 1: Build the backend and prepare for the binary
Currently, Scala package follows the build process in the installation guide which would brings the compiled MXNet backend(.so) and not the dependencies together into one place. In order to address this problem, we need to hard-link all dependencies and create a static library when we are building the backend code and wrap them up into a single place.
Option 1: Build MXNet backend with dynamic loading of dependencies (what we already using)
Build MXNet-Backend with dynamic loading of latest version of all dependencies and use libmxnet.so in the Scala jar.
This could be a first step in automating the release pipeline and later can consider one of the two below.
Option 2 Use Python pre-built binary
Since MXNet-Python already already has a pip package that wraps up dependencies, we can use the MXNet binary in the pip package. However, the Ubuntu version they used to publish is still in Ubuntu 14.xx, we need to update that in 16.xx and bring into CI.
Option 3 Build Statically Linked MXNet Binary for Scala
Build a MXNet static library by static linking dependencies and package it for use by MXNet-Scala.
The pros and cons of Option 1 and Option 2 are listed below
User does not have to build/install any dependencies.
No dependency conflicts.
A large jar containing all the dependencies.
User will have to use the same set of dependencies that the package came with.
Step 2: Make Scalapkg and pass the test
This part has already been done and published on CI.
Step 3: Automate Maven release and send a demo PR
We need to follow what is explained in here.
- Env Build stage: Build extra dependencies for Maven publish (gnupg, gnupg2 and gnupg-agent)
- Key generation stage: obtain the key (this procedure can be done in AWS Secret) and ship it in the package. We will mock a fake license when sent out a PR.
- gpg communication stage: Update the gnupg-agent cachettl
- Maven publish stage: make the package and sign the jar file
These stages can be in a single VM built on CI of Ubuntu. The outcome from this stage is a runnable CI that can generate the signed package.
Step 4: Test with CI maintainers to integrate credentials
This step requires people working on CI to store credentials to AWS secrets and test through a restricted Slave.
We successfully finished the Step 1 to 4. Current pipeline can be found in here.
All dependencies are built in the first stage and placed in the deps/ folder. During the second stage, we tested the package on six platforms. Finally we inject the credentials and deploy the package on maven.
Static build instruction
We provide comprehensive build instruction for all users to build on 14.04.
We provide License file for all packages we published
change publish OS (Severe)
As we know, Ubuntu 14.04 will no longer be supported by Canonical as of April 2019. Although they won't shut down their server, still no patches or upgrades will be happening. If we continue to rely on it for building MXNet there will be potential security risks when we publish the package. We also need to change a lot more to keep using the public version of Ubuntu 14.04
In this case, it is expected to use the next version of LTS system such as 16.04 to publish all of the packages. However, due to the test on the package built from there, Sheng found that the GLIBC version was not compatible with Cent OS 7 with the follow error:
/lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.23' not found (required by /tmp/mxnet6145590735071079280/libmxnet.so
GLIBC is a library that shipped with a fixed version in different system. It cannot be changed easily either to upgrade or downgrade as all packages that distributed with the system would potentially be unstable. In this case, we may lose Cent OS 7 and Amazon Linux support entirely if we decide to go with 16.04 build. We cannot static-link GLIBC since it is under GPL.
The followings is the list of GLIBC version that different system used:
ubuntu@ip-172-31-19-57:~$ ldd --version
ldd (Ubuntu EGLIBC 2.19-0ubuntu6.14) 2.19
ubuntu@ip-172-31-37-210:~$ ldd --version
ldd (Ubuntu GLIBC 2.23-0ubuntu10) 2.23
Cent OS 7:
[centos@ip-172-31-13-196 ~]$ ldd --version
ldd (GNU libc) 2.17
Amazon Linux 1
$ ldd --version
ldd (GNU libc) 2.17
Amazon Linux 2
ldd (GNU libc) 2.26
In order to solve this issue, I propose several solutions listed below:
Build with different GLIBC
https://www.tldp.org/HOWTO/Glibc2-HOWTO-6.html It is still worthwhile to configure GLIBC in a system that all builds will be based on. This could be the ideal solution as we can still use the up-to-date system and make it compatible with all previous versions.
Still using 14.04
As mentioned above, we can still use 14.04 even if the supporting life-cycle is done. By adding an archive repository in the system would help to keep it available with apt-get install command. The safer way will be a docker image that contains all configuration in the system and we do not require apt-get install anymore to build the whole package. Moreover, 14.04 should not be used to do the publish as there could be potential security problem. Instead, a system before End of Life should be used specifically for the publish. In our case, only the backend build contains the requirement for the GLIBC version. Then we can keep PyPi and Maven publish out of the 14.04
Using Cent OS 7
As we still need to maintain the support on Cent OS and Amazon Linux system, the best solution is to choose an OS that still have support. In this case, Cent OS 7 could be the best one to migrate our building script to. However, all of the current GPU build scripts would be unavailable since nvidia does not provide the corresponding packages for rpm. In this case, we may need to go with NVIDIA Docker for Cent OS 7 and that only provide a limited versions of CUDA. Another problem we may see is the performance and stability difference on the backend we built since we downgrade GLIBC from 2.19 to 2.17
List of CUDA that NVIDIA supporting for Cent OS 7:
CUDA 10, 9.2, 9.1, 9.0, 8.0, 7.5
Drop the support for Cent OS 7 and Amazon Linux and keep it with 16.04 build
We still support build-from-source instructions for the users using these two systems.
gcc/gfortran version upgrade (Important)
Currently, we use the GCC version 4.8 to build all of our dependencies in order to compatible with different CUDA versions. However, some future architecture require gcc 5.0 or above to build together such as Horovod. In this case, we need to make them compatible. There maybe unforeseen problems such as backward compatibility or stability issue.
We simply upgrade our GCC version from 4.8 to 5.x to make them compatible
static library version control (Improvement)
As Frank Liu discovered in here: MXNet build dependencies, we are facing issues with the static library. The version chosen here are questionable and could not be easily maintained. Apart from that, some dependencies such as libzmq holds a GPL License and Apache Legal forbid that from using. In this case, we need to find an alternative way to build the dependencies here.
For example, we are currently using a beta version for lib-turbo. We use a non-stable openblas which should be downgrade to a stable version. In this case, we should choose a stable release for them for the best performance. We need to dig in and clarify the reasons behind our choices of different versions of the packages.
There is no ideal ways to automate this process and require manual check and benchmark to choose the best performance set. We also need to get rid of the usage of libzmq or consult with legal team to see if there are any alternatives. We need to take action on PS-LITE side to make the change.
Number of packages supporting (Good to have)
We are currently the 'beast' on Pypi, along with Tensorflow that taken over 40% of the total package sizes. It is due to the matrix supporting of our packages. We offer a bunch of CUDA versions with combination of MKL as well as Python versions. It is a trade-off on widely version support and maintenance nightmare. There is no clear solution how we should handle this whether to reduce the number of packages we publish or keep it as it is.
We bump with CUDA.
- How to automate the pom file injection?
The GPG publish requires a user input section in the Maven file, we use a python script to inject these credentials
- License requirements of the various dependencies.
Currently, these license are added by this PR.
- How I can test the build procedure?
You can tested it follow this instruction