WIP staging repo - https://github.com/sandeep-krishnamurthy/dl-operator-benchmark

Link to dev list discussion

https://lists.apache.org/thread.html/a7a2b27ff2bd3069eecac1b84d6a11e5a1a7845c3d6e90e9b84a1654@%3Cdev.mxnet.apache.org%3E

Feature Shepherd

Lin Yuan (https://github.com/apeforest)

Problem Statement

A deep learning framework like MXNet supports 100s of operators (~250). Benchmarking and profiling a standard neural network and use-case, like ResNet-50 based image classification, is not fully sufficient and does not guarantee to maintain the health and performance of all the supported operators under different settings (Hardware, Accelerator, Data etc...). We need to have an easy to use utility to run benchmarks and profile each operator individually. Having an operator level benchmarks will help us in - fine grained understanding of performance of operators under different settings (Hardware, Accelerator, Data etc...), automated CI/CD performance tests, plan performance optimization tasks and more. In this document, we present an utility for MXNet operator benchmark and profiling.

Motivation

A deep learning framework like MXNet supports 100s of operators (~250). Some operators are used as a layer in the neural network (ex: Conv2D), some operators work in combination to form a layer in the neural network (ex: dot, sum => Dense), and many more just used independently outside a neural network (ex: tensor creation/shape change/indexing, logical operators) mostly for data processing and tensor manipulation.

An operator is highly heterogeneous w.r.t supported precision (fp32, fp64, Int64 etc.), accelerators (mkldnn, cuda, cudnn, mxnet native only), different behaviors based on data (ex: broadcast sum behavior on a large square (1024, 1024) tensor is different than on a skewed tensor (10, 10000) and more). Below, we see few areas why we believe operator benchmarks are useful:

  1. Users use a lot more operators that are not part of a standard network like ResNet. Example: Tensor manipulation operators like mean, max, topk, argmax, sort etc.
  2. A standard Network Architecture like ResNet-50 is made up of many operators Ex: Convolution2D, Softmax, Dense, Pooling etc... Observing only the end to end performance can hide individual operator regressions for long time.
  3. We need to know on different hardware infrastructure (Ex: CPU with MKLDNN, GPU with NVIDIA CUDA and cuDNN) how different operators perform. With these details, we can plan the optimization work at operator level, which could exponentially boost the end to end performance.
  4. Operator behavior varies based on different data load:
    1. For example, MXNet's reduction operations work seamlessly with balanced tensor like (1024, 1024), however, performance behavior changes when the input tensor is skewed (1024, 10). Similar observations can be made when comparing Int32 v/s Int64 indexing of Tensor.
    2. See this issue - #14725 which talks about performance regression in FC layer backward pass with CUDA 10 based on input tensor shape - https://github.com/apache/incubator-mxnet/issues/14725#issuecomment-486016229
  5. You want to have nightly performance tests across all operators in a deep learning framework to catch regressions early.
  6. We can integrate this framework with a CI/CD system to run per operator performance tests for PRs. Ex: When a PR modifies the kernel of TransposeConv2D, we can run benchmarks of TransposeConv2D operator to verify performance.
  7. Useful insights can be derived to plan for operator performance improvements. Example - Argmax is much slower compared to max operator on a GPU. This is an area that we can work on to improve the performance of Argmax operator.  https://github.com/apache/incubator-mxnet/issues/11337
  8. Benchmarking operator performance in MXNet comparing with other Deep Learning frameworks such as PyTorch. (Not in current scope)

Hence, in this utility, we will build the functionality to allow users and developers of deep learning frameworks to easily run benchmarks for individual operators across varying settings.

Requirements

  1. Benchmarks for Apache MXNet operators.
  2. Individual operator benchmarks to capture - time for operator execution (speed), memory usage.
  3. Fine grained individual operator benchmarks to capture - time for forward pass, time for backward pass and both.
  4. Ability to run operator benchmarks with default inputs, randomly generated inputs or customize with user specific inputs.
  5. Ability to run operator benchmarks on CPU/GPU with different flavors of MXNet (mxnet-mkl, mxnet-cu90mkl etc.)
  6. Benchmarks for operators with varying inputs to uncover any performance issues due to skewed input data. Ex: Measuring operator performance on small input tensors, large input tensors along with average normally used tensor sizes.
  7. Ability to run one, group or all operator benchmarks.
  8. Ability to extract results in multiple usable format - Python Dictionary, JSON, CSV, MD
  9. Statistics:
    1. Mean, Median, P50, P90, P99
  10. Reproducible tests
  11. Common combination (Fused) of operators. Conv + Relu, Conv + BatchNorm. (Not in current scope)

Design Tenets

  1. Defaults => Common use cases should be extremely easy, customized complex use cases should be possible.
    1. Example: I should be able to run Add operator benchmarks without specifying any inputs and library should provide benchmarks on valid default inputs. At the same time, as a power user, I should be able to provide my own inputs such as Tensor Shapes and context to run the benchmarks.
  2. Minimum Learning Curve => Keep APIs same or close to native NDArray / Gluon Operators being benchmarked.
    1. Example: If I am doing benchmarks on nd.add(lhs, rhs) operator, interface in the benchmark utility should be similar with zero learning curve.
  3. Modular and Reusable
  4. For a programmer or an automated system
    1. Example: Developer using the library or integration with CI/CD

Proposed Approach

  1. Provide a generic utility for executing an operator benchmarks and performance tests.
    1. This is responsible to creating input tensors of required shape on a given dtype, context.
    2. Execute the provided operator - forward or forward + backward.
    3. This generic utility will be integrated with MXNet profiler.
    4. Captures the profile output from MXNet profiler - time, memory.
    5. Return a dictionary of results.
  2. Input for the performance tests will be a key/value config.

Below is an example of performance runs for operators. It uses a base utility `run_performance_test`.

"""
MXNet operator performance benchmarks.

NOTE:
1. You can pass list of input dictionary to run benchmarks for an operator with different input configuration.
2. Results are dictionary of time, memory for the benchmark runs.
"""

# Run performance test for Add operator
results = run_performance_test(F=mx.nd.add, ctx=mx.cpu(), warmup=10, runs=50, inputs=[{"lhs": (1024, 1024),
                              												          "rhs": (1024, 1024),
                              												          "initializer": nd.normal,
                              												          "run_backward": True,
                              												          "dtype": "float32"}])

# Run performance test for Conv2D operator
results += run_performance_test(F=nn.gluon.Conv2D, ctx=mx.cpu(), warmup=10, runs=50, inputs = [{"data": (32, 3, 256, 256),
                              																  "data_initializer": nd.normal,
                              																  "channels": 64,
                              																  "kernel_size": (3, 3),
                              																  "strides": (1, 1),
                              																  "padding": (0, 0),
                              																  "dilation": (1, 1),
                              																  "layout": "NCHW",
                              																  "activation": None,
                              																  "run_backward": True,
                              																  "dtype": "float32"]}

How does the backend profiling utility code looks like?

Below we take an example of profiling Add operator.

# Configurations
warmup = 25
runs = 50
run_backward = True

# Operator to benchmark
F = mx.nd.add

# Prepare data for the operator
lhs = mx.nd.ones(shape=(1024, 1024))
rhs = mx.nd.ones(shape=(1024, 1024))
lhs.attach_grad()
rhs.attach_grad()
mx.nd.waitall()

# Warmup
print("Warming up....")
for _ in range(warmup):
    with mx.autograd.record():
        res = mx.nd.add(lhs, rhs)
    res.backward()
    mx.nd.waitall()
print("Done warming up....")

# Run Performance Runs
print("Running performance runs....")
profiler.set_config(profile_all=True, aggregate_stats=True)
# Start Profiler
profiler.set_state('run')
for _ in range(runs):
    with mx.autograd.record():
        res = mx.nd.add(lhs1, rhs1)
    res.backward()
    mx.nd.waitall()

# Stop Profiler 
profiler.set_state('stop')

# Fetch Results from Profiler
# We will add a new API in Profiler - profiler.get_summary(reset=True)
# profiler.get_summary() => Return a JSON string representing the output as shown below.
#                        => Resets all the counter in the current profiler.

print("Done Running performance runs....")
print(profiler.dumps(reset=True))


Pros

  1. No need to write 1 class per operator to set up a performance test. Whenever a new operator is created, developer needs to add a `run_performance_test(..)` line with a list of inputs to run performance tests. A generic utility will handle the execution.
  2. Less code, easy to maintain.
  3. More control for users - default inputs, random inputs, specific user defined inputs.
  4. Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
  5. More accurate benchmark results - Time and Memory because we use MXNet profiler.
  6. With Python interface:
    1. Easy to maintain and develop.
    2. Reflects the performance as seen by the users. (Majority users using Python interface)
    3. Fastest way to get performance tests in place. We do not have any tests in place as of today.

Cons

  1. Different operator will have different input names. For example, see above, add operator requires tensors with name lhs, rhs. However, Conv2D operator requires a tensor with data. The base performance executor utility will need to understand it and create tensors appropriately i.e., If it is one single executor, generalization across operator performance may make logic complex to manage.
  2. Not easily extensible:
    1. Hard to integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.

Addition of new Module

We propose to add this utility as a new module (opperf) under incubator-mxnet/benchmark as "incubator-mxnet/benchmark/opperf". Note that, this does not generate any user facing APIs, this is a utility under incubator-mxnet/benchmark folder for general use by community.

Addition of new API

We propose to add a new API to MXNet Profiler for easily fetching operator profile for processing programmatically.

1) mxnet.profiler.get_summary(reset=False)

Current Behavior:

Users can either use `mxnet.profiler.dump()` to output the profiler as a JSON file. Or, use `mxnet.profiler.dumps(reset=False)` API to print the summary on console.

Suggested Addition:

In order to enable easy programmatic usage of MXNet profiler output, we propose to introduce a new API to return the summary as JSON string. This enables users to run profiler, get summary output, perform analysis programmatically.


mxnet.profiler.get_summary(reset=False)
	"""Gets current profiler summary as a JSON string. If reset is True, resets all the aggregate statistics collected up to this point i.e., it clears all    the profiler counters.
    
	Parameters:
    -----------
    reset: boolean, If True, resets all profiler statistics collected up to this point.
    """

Output:

We can visualize the output of this API as a JSON representation of the output from `mxnet.profiler.dumps(reset=False)API as shown below.

However, please note that, below, Memory profile output is not the total bytes allocated. Current output from dumps is providing number of memory allocation calls made.

In the new suggested API, we will be adding additional Summary - Memory => Total Bytes Allocated (Per Device).

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

  1. General User, Automated Nightly tests
    1. Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.
  2. Power User, PR validation tests
    1. Run benchmark with customized Inputs

Use Case 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

python incubator-mxnet/benchmark/opperf/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

  1. output-format : json or md for markdown file output or csv.
  2. ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
  3. dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

Output for the above benchmark run, on a CPU machine, would look something like below:

{
    "MX_Multiply_Forward_Backward_Time": 0.025911798477172853,
    "MX_Gluon_Imperative_RNN_Forward_Backward_Time": 0.011011338233947754,
    "MX_Gluon_Imperative_MaxPool2D_Forward_Backward_Time": 0.1580966854095459,
    "MX_Gluon_Imperative_Conv1D_Forward_Backward_Time": 0.03413449287414551,
    "MX_Ones_Forward_Time": 0.002405076026916504,
    "MX_Modulo_Forward_Backward_Time": 0.049943366050720216,
    "MX_Subtract_Forward_Backward_Time": 0.01635995864868164,
    "MX_ArgMin_Forward_Backward_Time": 0.01545732021331787,
    "MX_Logical_Xor_Forward_Backward_Time": 0.018084139823913575,
    "MX_Zeros_Like_Forward_Time": 0.0027973604202270507,
    "MX_Inplace_Multiply_Forward_Time": 0.005555639266967774,
    "MX_ArgSort_Forward_Time": 0.13972537994384765,
    "MX_Arange_Forward_Time": 0.00010946273803710938,
........
........
}

Use Case 2 - Power user - Run benchmarks for specific operator

As a power user, let us assume, you want to run benchmarks on Add operator with on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

Use Case 2.1 - Customize Inputs for Operators

results = run_performance_test(F=mx.nd.add, ctx=mx.cpu(), warmup=10, runs=50, inputs=[{"lhs": (1024, 1024),
                              												          "rhs": (1024, 1024),
                              												          "initializer": nd.normal,
                              												          "run_backward": True,
                              												          "dtype": "float64"}])

Output for the above benchmark run, on a CPU machine, would look something like below:

MX_Add_Forward_Backward_Time - 0.025401 seconds

Use Case 3 - Nightly CI Tests

  1. We will maintain a JSON file of expected performance for each operator under "incubator-mxnet/benchmark/opperf".
  2. These expected results are captured on different configuration such as - FP32/64/16, MKL, No MKL, CUDA10, instances (c5.16x, p3.8x).
  3. Runs all the operator performance runs and gets the results JSON.
  4. Compares with the expected results +/- % threshold.

Development Plan / Milestones

Phase 1

  1. ~150 most commonly used operators will be tested on CPU(with and without MKL), GPU, FP32, FP64. See Appendix 1 for list of operators.
  2. Operators will be tested with NDArray and Gluon interface only i.e., symbol interface is not used for testing owing to plans of deprecation.
  3. Python interface is used along with MXNet profiler.
  4. Time and Memory usage are measured to start with.
  5. Statistics - Mean of the metric.

Phase 2

  1. Cover remaining operators left out from Phase 1.
  2. Add more statistics - p50, p90, p99, min, max.

Phase 3

  1. Explore and have CPP performance tests for most commonly used operators. This will give the true measurements compared to using Python Interface.
  2. Integrate with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes and inputs.

Current Status

See this repo for more details - https://github.com/sandeep-krishnamurthy/dl-operator-benchmark

  1. 134 operators are supported:
    1. All Gluon Layers - Activation, Loss, Normalization, Basic like Dense, Convolutions, Recurrent (RNN, LSTM, GRU)
    2. NDArray operators like creation, random sampling, arithmetic, logical, comparison etc...
  2. Able to run individual operator benchmarks or use high level drivers to run all tests.
  3. Able to generate results as JSON.
  4. Timing metric - forward only, forward+backward operation.

Alternate Solutions

Alternate Solution 1 - Use Python Classes for each Operator instead of Config

Approach

  1. This benchmark utility will be built on top of MXNet's ND and Gluon interface.
  2. For each operator in ND and Gluon Block, there will be a corresponding Benchmarking operator in the library with a list of default inputs, functionality to process results. See below example for Add operator benchmarks.
  3. High-level drivers are provided to run operator benchmarks in bulk. Example: run_all_mxnet_operator_benchmarks(), run_all_arithmetic_operations_benchmarks() etc.
  4. Results can be generated as a python dictionary/JSON/CSV for upstream system (Ex: CI, Automated Performance Monitoring System) consumption.
class Add(MXNetOperatorBenchmarkBase):
    """Helps to Benchmark Tensor Add operation.

    By default benchmark both forward and backward element_wise tensor addition
    of 1024*1024 tensor of precision - 'float32'.

    """

    def __init__(self, ctx=mx.cpu(), warmup=10, runs=50, inputs=None):
        # Set the default Inputs
        default_parameters = {"lhs": (1024, 1024),
                              "rhs": (1024, 1024),
                              "initializer": nd.normal,
                              "run_backward": True,
                              "dtype": "float32"}

        super().__init__(ctx=ctx, warmup=warmup, runs=runs, default_parameters=default_parameters,
                         custom_parameters=inputs)

        self.lhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["lhs"],
                                  dtype=self.inputs["dtype"],
                                  initializer=self.inputs["initializer"],
                                  attach_grad=self.inputs["run_backward"])
        self.rhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["rhs"],
                                  dtype=self.inputs["dtype"],
                                  initializer=self.inputs["initializer"],
                                  attach_grad=self.inputs["run_backward"])

    def run_benchmark(self):
        # Warm up, ignore execution time value
        _, _ = nd_forward_backward_and_time(F=nd.add, runs=self.warmup, lhs=self.lhs, rhs=self.rhs)
        # Run Benchmarks
        exe_time, _ = nd_forward_backward_and_time(F=nd.add, runs=self.runs, lhs=self.lhs, rhs=self.rhs)

        self.results["MX_Add_Forward_Backward_Time"] = exe_time / self.runs

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

  1. General User, Automated Nightly tests
    1. Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.
  2. Power User, PR validation tests
    1. Run benchmark

USE CASE 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

python dl-operator-benchmark/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

  1. output-format : json or md for markdown file output or csv.
  2. ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
  3. dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

USE CASE 2 - Run benchmarks for all the operators in a specific category

For example, you want to run benchmarks for all NDArray Arithmetic Operators, the library will be providing drivers to easily run benchmarks on operators of specific categories.

from mxnet_benchmarks.nd import run_all_arithmetic_operations_benchmarks
# Run all Arithmetic operations benchmarks with default input values
run_all_arithmetic_operations_benchmarks()

Output for the above benchmark run, on a CPU machine, would look something like below:

MX_Add_Forward_Backward_Time - 0.015201 seconds
MX_Multiply_Forward_Backward_Time - 0.021678 seconds
MX_Subtract_Forward_Backward_Time - 0.016154 seconds
MX_Divide_Forward_Backward_Time - 0.024327 seconds
MX_Modulo_Forward_Backward_Time - 0.045726 seconds
MX_Power_Forward_Backward_Time - 0.077152 seconds
MX_Negative_Forward_Backward_Time - 0.014472 seconds
MX_Inplace_Add_Forward_Time - 0.003824 seconds
MX_Inplace_Subtract_Forward_Time - 0.004137 seconds
MX_Inplace_Multiply_Forward_Time - 0.006589 seconds
MX_Inplace_Division_Forward_Time - 0.003869 seconds
MX_Inplace_Modulo_Forward_Time - 0.018180 seconds

Use Case 3 - Power user - Run benchmarks for specific operator

As a power user, you want to run benchmarks for nd.add operator in MXNet, you just run the following python script.
Note that, we maintain same name and spec as the underlying MXNet operator. For example - to benchmark nd.add, we can use mxnet_benchmarks.nd.Add().

Use CASE 3.1 - Default Inputs for Operators

from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add()
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

MX_Add_Forward_Backward_Time - 0.015201 seconds

USE CASE 3.2 - Customize Inputs for Operators

As a power user, let us assume, you want to run benchmarks on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

MX_Add_Forward_Backward_Time - 0.025405 seconds

NOTE: You can print the input parameters used for a benchmark as shown below.

from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})print(add_benchmark.inputs)

Output


{'lhs': (1024, 1024), 'rhs': (1024, 1024), 'initializer': <function normal at 0x117b607b8>, 'run_backward': True, 'dtype': 'float64'}

Pros

  1. More control for users - default inputs, random inputs, specific user defined inputs.
  2. Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
  3. With Python interface:
    1. Easy to maintain and develop.
    2. Reflects the performance as seen by the users. (Majority users using Python interface)
    3. Fastest way to get performance tests in place. We do not have any tests in place as of today.
    4. Ability to run and compare benchmarks from other deep learning frameworks.
  4. Extensible:
    1. Can be integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.

Cons

  1. Need to write base tests for every new operator. If a new operator is added to MXNet, then a new performance test class for the operator needs to be added in this library with default inputs for that new operator to run performance tests.
  2. It is ideal to capture performance close to Kernel. Call from Python operator APIs may hide performance regression when operator computation is small.

Alternate Solution 2 - Autogenerate test with Property Based Testing Technique

(Credits - Thanks to Pedro Larroy for this suggestion)

Approach

  1. Automatically query all operators registered with MXNet engine.
  2. Infer the inputs and outputs for the operators.
  3. Use property based testing technique and library such as Hypothesis to generate random inputs and run the tests.

Pros

  1. Any new operator added to MXNet, will be automatically queried. Hence, no need to write tests explicitly for every operator.
  2. Inputs are randomly generated. Hence, better suited to capture performance regression on corner cases.

Cons

  1. Non deterministic inputs. Hence, better suitable for functionality testing. It will be hard to use this technique for performance tests.
  2. Still requires us to write many custom strategies or conditional property files. Example:
    1. For testing Add operator, we need to set conditions on input to generate same shapes or broadcastable shapes for lhs and rhs.
    2. For Convolution operator, we need to match Kernel, Padding and other parameter shapes appropriately.
  3. Querying operators and inferring the input conditions may be hard and complex logic.
    1. Example: add is an operator, that takes 2 input tensors - lhs, rhs. Now we need to infer that lhs and rhs tensor should of same size or broadcastable. Logic to handle such conditions may soon become complex enough to not give us advantage of auto generated operator benchmarks.
    2. MXNet currently do no support a standard way of querying the registered operators. It would be ideal if MXNet can expose NNVM APIs for querying registered operators and expected inputs, outputs, types and more.
  4. Complex and time consuming. We do not have any operator performance tests for MXNet. It would be ideal to revisit this approach for future enhancement.

Alternate Solution 3 - Extend existing unit tests to cover performance parameters

<To add more details> In summary, it is hard and complex to modify all unit tests to measure performance along with currently designed way of writing tests which is designed towards - consistency across context, correctness, gradient checks.

Appendix

Phase 1

Functionality supported:

  1. Able to run benchmarks for critical MXNet operators (see below for the list) with:
    1. Default inputs - Varying input shapes. Ex: Small Tensors, Large Tensors, Skewed Tensor, common tensor shapes.
    2. Custom inputs - Users should be able to specify input tensors to run benchmarks
    3. Random inputs - Randomly generate input tensors for operator benchmarks
  2. Able to save results as JSON or get results as Python Dictionary
  3. Able to run benchmarks on CPU/GPU.

Below are the operators covered in Phase 1

  1. Hardware
    1. CPU - C5.18X
    2. GPU - P3.2X (Single GPU)
  2. NDArray Operations
    1. Copy, CopyTo, as_in_context, asnumpy, asscalar, astype
    1. zeros, zeros_like, ones, ones_like, full, arange
    1. Transpose (T), shape_array, size_array, reshape, reshape, reshape_like, flatten, expand_dims, split, diag
    2. tile, pad
    1. sum, nansum, prod, nanprod, mean, max, min, norm
    1. sort, argsort, topk, argmax, argmin, argmax_channel
    1. add, sub, neg, mul, div, mod, pow
    1. iadd (+=), isub (-=), imul (*=), idiv (/=), imod (%=)
    1. lesser, lesser_equal, greater, greater_equal, equal, not_equal
    1. get_item (x[i]), set_item (x[i]=)
    2. slice, slice_axis, take, batch_take, pick
    3. one_hot
    1. exp, log
    1. sqrt, square
    1. concat, split, stack
    1. dot, batch_dot
    1. normal, poisson, uniform, random, randn, randint
    2. shuffle
    1. clip
    2. where
    3. abs
    1. Precision - FP32
    2. Conversion Operations
    3. Creation Operations
    4. Shape (view) change Operations
    5. Reduction Operations
    6. Sorting and Searching Operations
    7. Arithmetic Operations
    8. Inplace Arithmetic Operations
    9. Comparison Operations
    10. Indexing Operations
    11. Exponents and Logarithms
    12. Powers Operations
    13. Join and Split Operations
    14. GEMM
    15. Random Sampling
    16. Others
    17. Neural Network Operations
  3. Gluon Layers (Neural Network Operations)
    1. Dense, Lambda, Flatten, Embedding
    2. Dropout, BatchNorm
    1. Conv1D, Conv2D
    2. Conv1DTranspose, Conv2DTranspose
    1. MaxPool1D, MaxPool2D, AvgPool1D, AvgPool2D, GlobalMaxPool1D, GlobalMaxPool2D, GlobalAvgPool1D, GlobalAvgPool2D
    1. LeakyRelu, PRelu, Sigmoid, Softmax, Log_Softmax, Activation
    1. RNNCell, LSTMCell, GRUCell, RecurrentCell, SequentialRNNCell, BiDirectionalCell
    1. L1Loss, L2Loss, SigmoidBinaryCrossEntropyLoss, SoftmaxCrossEntropyLoss, KLDivLoss, HuberLoss, HingeLoss, SquaredHingeLoss, LogisticLoss, TripletLoss, CTCLoss
    1. Modes: Imperative (Hybrid will be added in Next Phase)
    2. Basic
    3. Convolutions
    4. Pooling
    5. Activations
    6. Recurrent Cells
    7. Loss
  4. Custom Operator Benchmark

Phase 2

  1. Other DataTypes
    1. INT64, INT8, FP64, FP16
  2. NDArray Operations
    1. tostype
    1. swapaxes, flip, depth_to_space, space_to_depth
    1. round, rint, fix, floor, ceil, trunc
    1. sin, cos, tan, arcsin, arccos, arctan, degrees, radians
    2. sinh, cosh, tahnh, arcsinh, arccosh, arctanh
    1. expm1, log10, log2, log1p
    1. logical_and, logical_or, logical_xor, logical_not
    1. rsqrt, cbrt, rcbrt, reciprocal
    1. exponential, gamma, generalized_negative_binomial, multinomial, negative_binomial,
    1. SequenceLast, SequenceMask, SequenceReverse
    1. unravel_index, ravel_multi_index
    1. Sparse
    2. View Operations
    3. Rounding Operations
    4. Trigonometric Operations
    5. Exponent and Logarithmic Operations
    6. Logical Operations
    7. Powers Operations
    8. Random Operations
    9. Sequence Operations
    10. Others
  3. Gluon
    1. Conv + Relu, Conv + BatchNorm (More to be added when we start this work)
    1. HybridLambda
    2. InstanceNorm, LayerNorm
    1. Conv3D
    2. Conv3DTranspose
    1. MaxPool3D, AvgPool3D, GlobalMaxPool3D, GlobalAvgPool3D
    1. , Elu, Selu, Swish
    1. ZoneOutCell, ResidualCell, DropoutCell
    1. CosineEmbeddingLoss, PoissonNLLoss
    1. Mode: Hybrid Mode for all layers covered in Phase 1. Additional coverage of layers as below.
    2. Fused Operators
    3. Basic
    4. Convolutions
    5. Pooling
    6. Activations
    7. Recurrent
    8. Loss
    9. Important Contrib Layers
  4. Other Items to be explored
    1. Image APIs
    2. Data APIs
    3. Metric APIs
    4. Initializers and Optimizers

Phase 3 (To be discussed/scoped)

Benchmark PyTorch Operators to have neutral baseline for comparing MXNet operator performance. This is yet to be discussed and finalized.

FAQs

Q1) Why not use check_speed(..) utility in MXNet test_util?
A) Supports Symbol APIs only. Do not support benchmarking NDArray, Gluon Blocks. It is a lightweight simple symbol executor and expects users to create symbol graph to execute along with inputs. The proposed library in this document is more sophisticated by supporting benchmarking operators with NDArray, Gluon Block, provide various default inputs, provide high-level drivers, provide interface for users to specify different inputs, prepare results in different formats - Python dictionary, CSV, JSON.

Q2) Why not Symbol execution? Why only NDArray and Gluon?
A) MXNet users are encouraged and are mainly using NDArray APIs or Gluon APIs. MXNet community is moving towards deprecating symbol APIs with works on numpy compatible operators. To measure what our users use and observe, we propose to use NDArray and Gluon blocks for benchmarking operators in this library.

Q3) Why not extend/repurpose current MXNet unit tests?

Q4) Why Python? Shouldn't we benchmark operators as close to Kernel as possible i.e., C++ benchmarks?



  • No labels

22 Comments

  1. Good proposal! 

    Several suggestions:

    • Add the compute utilization or memory bandwidth for tested OP in order to know if it is good enough
    • Import the OPs from the given topology so we can compare the runtime of pure OPs 
    • Compared with other frameworks, like pytorch, tensorflow, for the very common OP, like conv2d
    • Add these OP benchmarking into CI to make sure NO performance regression from new PR
    1. Thanks Patric. 

      Compare with other Frameworks => yes, in the plan.

      Integration to nightly CI, PR builds => yes, in the plan.

      compute utilization, memory usage => Will be added in phase 2.

      Can you please provide additional details on what you meant by - "Import the OPs from the given topology so we can compare the runtime of pure OPs "? 



  2. class Add(MXNetOperatorBenchmarkBase):
    I don't see a need for inheritance, all of this should in the Base class and the arguments can b customized
    1. Agreed. Updated the proposal. Thank you for valuable discussion.

  3. Can we integrate this solution to the profiler ? I think it will be useful to identify the problem when we output as a trace.

    1. We should. Will add more details about it. 

  4. """MXNet operator performance benchmarks. NOTE:1. You can pass list of input dictionary to run benchmarks for an operator with different input configuration.2. Results are dictionary of time, memory for the benchmark runs.""" # Run performance test for Add operatorresults = run_performance_test(F=mx.nd.add, ctx=mx.cpu(), warmup=10, runs=50, inputs=[{"lhs": (1024, 1024), "rhs": (1024, 1024), "initializer": nd.normal, "run_backward": True, "dtype": "float32"}]) # Run performance test for Conv2D operatorresults += run_performance_test(F=nn.gluon.Conv2D, ctx=mx.cpu(), warmup=10, runs=50, inputs = [{"data": (32, 3, 256, 256), "data_initializer": nd.normal, "channels": 64, "kernel_size": (3, 3), "strides": (1, 1), "padding": (0, 0), "dilation": (1, 1), "layout": "NCHW", "activation": None, "run_backward": True, "dtype": "float32"]}
    Will this capture metrics for both forward and backward operations?
    1. Yes. It captures both forward and backward timing.

      However, users will have an option to say - backward=False and expect benchmark to forward only. Mostly usable in operators like "iadd()" which doesn't support backward.

  5. Great document and initiative Sandeep. One question, how do you plan to measure memory usage? I think is a very difficult problem with many ramifications.

    1. I plan to use MXNet's memory profiler to get the bytes allocated - https://github.com/apache/incubator-mxnet/blob/master/src/profiler/storage_profiler.h#L51

      From my initial observation - It is consistent for a given operation and input across runs.

      Also, I understand it might not be exact true memory used by an operator. However, we can take it as a proxy is my assumption.

  6. With regards to Alternative 2 and property based testing, I think they don't need to be mutually exclusive. The level of abstraction that you propose on Usecase 2.1 looks very good to me and is something one can build on top to automate and abstract further when desired.

  7. mxnet.profiler.get_summary(reset=False)
    Is this not just json loads at the end?
  8. F
    why F? maybe use "op" instead?

    1. Makes sense. F was used with same terminology used in Gluon. But, as you pointed 'op' is better. Will change.

  9. mx.nd.add
    shouldnt this be "F(lhs, rhs)"?
  10. Are there common abstractions that we can utilize for model level benchmarks? it would be good to have a common API and tooling for both operator level benchmark and model level benchmark, or what are your thoughts on this one?

  11. run_all_arithmetic_operations_benchmarks()
    This feature seems to already exist in profiler, as an example
  12. +=
    Does this mean appending to previous result?
  13. Have you thought about having the benchmark loop in native code to avoid python overheads? while I like that the benchmark is driven from python, do you think the call overhead is going to impact the benchmarks too much?

  14. I want to add a note on random input. We would need random shapes, the values of the tensors usually don't have impact on performance excluding corner cases like denormalized floats, unless I'm mistaken.

  15. Summarizing other feedbacks obtained:

    1. We need to make tests reproducible. That is, knowing the seed used for running the benchmarks.
    2. Hypothesis testing listed in alternate solution is very useful for a developer. We should have that support.
    3. Separate profiler.get_summary(reset=False) API into 2 APIs. profiler.get_summary(), profiler.reset()
    4. No need to name a metric. Just return JSON and upstream can choose it to use the way they want to.
    5. For some of the operator, value impacts computation. Ex: short circuiting on certain input v/s regular computation path. Though majority of the operator do not fall into this category, we need to prepare list of operators that gets impacted and describe how to handle such situation.
    6. It would be good idea to think and document at-least one use case (Ex: Nightly benchmark CI/CD system) using this utility to clarify the interface.