This page tracks proposals from the community focusing on how we can speed up PR verifications in our Jenkins CI system. Speeding up test runs provides the community with a lot of value as it (1) lowers the cost of the CI system, and (2) decreases the amount of time it takes to get feedback on code changes. This page was created to discuss pros/cons to different test speed improvement approaches, and to capture different proposals from the community. This page also serves as a call-to-action to the community. If any community member is interested in devops or performance, and would like to help us research or implement on of these improvements please feel free. If any member would like to suggest a different approach for speeding up tests runs please add it to this page. Contributions from the community are welcome.
Current Testing Timings
Date: Oct 4, 2018
Time spent in Sanity Check: 1m47s
Time spent in Build: 29m43s
Time spent in Tests: 1h11m29s
Proposals for Speeding up Builds
Use ccache with nvcc
Builds are currently (as of Oct 4th, 2018) bottlenecked by Linux GPU compilation. Adding ccache with support for nvcc should dramatically reduce GPU Linux build times. Note: nvcc support is already present for ARM / Linux CPU builds which is why build times are as low as 50s. This feature is currently WIP at https://github.com/apache/incubator-mxnet/pull/11520
Proposals for Speeding up Tests
Run some python tests in parallel
Many (but not all) Python tests can be run in parallel. Such tests could be annotated and then executed with nose's parallel test plugin. This should speed up test runs dramatically. For stability reasons, some tests will still be required to be run in sequence (for example non-idempotent tests). We will have to identify all of these tests, erring on the side of caution, and mark them as tests that cannot be run in parallel.
Move some tests to nightly
On a case-by-case basis, as approved by core developers/committers we could move some of the integration tests to nightly builds. This could include tests such as test_continuous_profile_and_instant_marker that currently take a lot of resources to run and are unlikely to break, but should be tested periodically to ensure compatibility. Ideally these tests will be replaced by faster running tests to maintain coverage.
Statically assert expected results rather than dynamically computing them in numpy/scipy
We currently have several long running numpy/scipy computations that are calculated each time tests are run, and then used as a basis to assert correct behavior of MXNet code. This was a good pattern in the past, but it causes issues when running some GPU tests. The result of this pattern is that GPU instances spend a lot of time running numpy calculations on their relatively slow CPUs in order to verify results. We should instead attempt to store the results of these calculations in a readable way, and assert on these stored results.
Run tests conditionally
It can be frustrating for developers to make changes that affect one part of the codebase (say documentation, or python) which then trigger a full regression test of the entire codebase. Ideally we could work backwards from code coverage reports and understand exactly which tests are required to ensure quality based on a given code change. This is difficult in MXNet with its wide support of different languages. However, it is likely that some basic heuristic would allow us cut back on tests in many cases.
Execute jobs in the correct stage
At the moment, various jobs mix their main task with the creation of prerequisites instead of separating concerns into different stages. Some examples are:
- Doc generation: Compile documentation during build stage instead of publish stage. Duration has increased from 1min to 9min (critical path).
- Scala/Julia: MXNet native library is compiled during test stage. Dependencies are downloaded every time. This adds about 5 minutes each.
- R: Dependencies are downloaded every time. This adds about 8 minutes each.
This can be solved by installing dependencies in the Docker install stage (which we are caching) and precompiling during build stage. This is especially important because CPU heavy tasks should not be executed on GPU instances.
Speed up Windows slaves
Windows slaves have a high start up time (about 30 minutes) and are slower at executing tests. Python 3 GPU, for example, takes 28 minutes on Ubuntu while the raw execution time on Windows is 45 minutes. The former can be resolved by having a larger warm pool while the latter has to be investigated and might be a performance bottleneck that has to be investigated.
7 Comments
Naveen Swamy
Thanks for profiling and finding concrete action items, implemented this will greatly help in reducing the PR cycle while reducing the cost of CI.
Marco de Abreu
Thanks for setting up this document! I have added a few ideas and comments.
Kellen Sunderland
Awesome, thanks for the addition. Great that you're already looking at nvccache .
Qing Lan
About Run tests conditionally. There was a conversation between Marco de Abreuand me that it is possible to create a Jenkins plugin which would find the folder diff of the code. We can make use of that and create a policy for tests. Some of them can be avoided and save time.
Steffen Rochel
Have we looked at open source or commercial solutions to run tests conditionally? Do we have to invent our own?
Qing Lan
I think we can try with some 3party plugin:
https://wiki.jenkins.io/display/JENKINS/Conditional+BuildStep+Plugin
If these solutions doesn't work I think we should invent our own toolkit.
Marco de Abreu
I think we should not underestimate the effort here. Up and downstream dependencies like test coverage detection, build dependencies as well as test dependencies would have to be remodeled - and the procedure could be error prone, resulting in possible test misses. We have to consider whether these risks and investments are worth it.