Objective

Facilitate tracking of unit-tests on CloudWatch CI/CD dashboard

Current State

Currently, there's no dashboard that tracks `nosetests` or similar runs. As of now, everytime the CI pipeline is triggered and wherever the artifacts are generated (nosetest_unittest.xml, nosetest_train,xml, etc). However, that's that.

Proposal

Unit test tracking can be achieved in 2 stages:

  1. Upload unittest artifacts on S3
  2. Pulling S3 artifacts into Cloudwatch via Lambda
  3. Creating dashboard in Cloudwatch

Stage 1:

Unittest-specfiic timing information is stored in job-name-xml files using the `archiveArtifact` utility (provided by Jenkins Job DSL plugin https://jenkinsci.github.io/job-dsl-plugin/)

Uploading artifacts to S3 is provided by Pipeline:AWS Steps plugin (https://github.com/jenkinsci/pipeline-aws-plugin)

Steps followed:

  1. Add Pipeline:AWS Steps plugin to Jenkins CI prod account
  2. Create bucket on S3 CI prod account
  3. Handling permissions
    1. Create a write policy specific for this bucket
      Currently, we only want to write to this specific bucket. To make it secure, only "PutObject" permission is given to this specific bucket (via ARN)
    2. Attach the policy to bucket
    3. Attach the policy to IAM Role

      IAM Role : jenkins_slave_role
      Policy Name : s3-ci-prod-upload-unit-test-artifact
      Action: s3: Write
      Resource connected (ARN of S3 bucket) : arn:aws:s3:::mxnet-ci-unittest-artifact-repository/*


  4. Set Global property on Jenkins console
    1. Jenkins → Manage Jenkins → Configure System → Global Properties (helps setup global environment variables visible in every job)
      Name : Value
      MXNET_CI_UNITTEST_ARTIFACT_BUCKET: mxnet-ci-unittest-artifact-repository

  5. Call S3 upload with required parameters
    1. Functions:
      1. ci/Jenkinsfile_utils.groovy → collect_test_results_unix()
      2. ci/Jenkinsfile_utils.groovy → collect_test_results_windows()
    2. PR : https://github.com/apache/incubator-mxnet/pull/16336


Design Choices

  • S3 Upload will be triggered for every build, every job, every branch.

Choice : Specific branches vs All branches

Decision : All branches

Instead of uploading unittest data to S3 only once per PR (PR-merge commit), we upload it for every commit on PR.

Why?
More data uploads → Increased sampling size → More basis for identifying the cause of unit-test slowdown
More data, better granularity and hence clarity.

Instead of limiting it for PRs+Master branch, we let it run for all branches (leaving the filtering to be done on Cloudwatch)

Instead of uploading unittest data to S3 for every single PR, design choice was made to upload it only for PR-merge (when PR gets merged into the master i.e. commit made to master).

Why? Tracking for every commit of PRs would be unfeasible, and unstable. On the other hand, PR merges are generally stable and would be "relevant" to tracking health of the master build.

  • S3 Directory structure

We have 2 approaches available:

Job / Branch / Build / File Branch / Build / Job / File

Decision:

Aim is to have per-job metrics. Not relevant to have metrics per-PR. Combining based on jobs ensures comparable values lie in the same folder.
Basically, individual PR runs will act as sample points. Ultimately, we rely on Job-level metrics. (Useful for Cloudwatch dashboard design too)

  • Context file

Context file would store context specific to this PR - user, source branch, target branch, timestamp, etc


FAQ

Q. Why upload to S3 instead of directly putting in CloudWatch?

A. Jenkins Master/Slave are already burdened by the load of running CI pipelines for branches & PRs. Moreover, data storage of unit-tests on S3 allows permanence. Offloading data to S3 resolves both storage and load problems.

Q. What happens to S3 upload if jobs fails?

A. As it stands, if the job fails, S3 upload fails. We catch the exception. We don't want the CI job to fail if data doesn't get uploaded to S3. Master build will be re triggered for next PR merge.

Stage 2: WIP

Stage 3: WIP


  • No labels