MXNet Operator Benchmarks

WIP staging repo - https://github.com/sandeep-krishnamurthy/dl-operator-benchmark

Link to dev list discussion

https://lists.apache.org/thread.html/a7a2b27ff2bd3069eecac1b84d6a11e5a1a7845c3d6e90e9b84a1654@%3Cdev.mxnet.apache.org%3E

Feature Shepherd

Lin Yuan (https://github.com/apeforest)

Problem Statement

A deep learning framework like MXNet supports 100s of operators (~250). Benchmarking and profiling a standard neural network and use-case, like ResNet-50 based image classification, is not fully sufficient and does not guarantee to maintain the health and performance of all the supported operators under different settings (Hardware, Accelerator, Data etc...). We need to have an easy to use utility to run benchmarks and profile each operator individually. Having an operator level benchmarks will help us in - fine grained understanding of performance of operators under different settings (Hardware, Accelerator, Data etc...), automated CI/CD performance tests, plan performance optimization tasks and more. In this document, we present an utility for MXNet operator benchmark and profiling.

Motivation

A deep learning framework like MXNet supports 100s of operators (~250). Some operators are used as a layer in the neural network (ex: Conv2D), some operators work in combination to form a layer in the neural network (ex: dot, sum => Dense), and many more just used independently outside a neural network (ex: tensor creation/shape change/indexing, logical operators) mostly for data processing and tensor manipulation.

An operator is highly heterogeneous w.r.t supported precision (fp32, fp64, Int64 etc.), accelerators (mkldnn, cuda, cudnn, mxnet native only), different behaviors based on data (ex: broadcast sum behavior on a large square (1024, 1024) tensor is different than on a skewed tensor (10, 10000) and more). Below, we see few areas why we believe operator benchmarks are useful:

Users use a lot more operators that are not part of a standard network like ResNet. Example: Tensor manipulation operators like mean, max, topk, argmax, sort etc.
A standard Network Architecture like ResNet-50 is made up of many operators Ex: Convolution2D, Softmax, Dense, Pooling etc... Observing only the end to end performance can hide individual operator regressions for long time.
We need to know on different hardware infrastructure (Ex: CPU with MKLDNN, GPU with NVIDIA CUDA and cuDNN) how different operators perform. With these details, we can plan the optimization work at operator level, which could exponentially boost the end to end performance.
Operator behavior varies based on different data load:

For example, MXNet's reduction operations work seamlessly with balanced tensor like (1024, 1024), however, performance behavior changes when the input tensor is skewed (1024, 10). Similar observations can be made when comparing Int32 v/s Int64 indexing of Tensor.
See this issue - #14725 which talks about performance regression in FC layer backward pass with CUDA 10 based on input tensor shape - https://github.com/apache/incubator-mxnet/issues/14725#issuecomment-486016229

You want to have nightly performance tests across all operators in a deep learning framework to catch regressions early.
We can integrate this framework with a CI/CD system to run per operator performance tests for PRs. Ex: When a PR modifies the kernel of TransposeConv2D, we can run benchmarks of TransposeConv2D operator to verify performance.
Useful insights can be derived to plan for operator performance improvements. Example - Argmax is much slower compared to max operator on a GPU. This is an area that we can work on to improve the performance of Argmax operator. https://github.com/apache/incubator-mxnet/issues/11337
Benchmarking operator performance in MXNet comparing with other Deep Learning frameworks such as PyTorch. (Not in current scope)

Hence, in this utility, we will build the functionality to allow users and developers of deep learning frameworks to easily run benchmarks for individual operators across varying settings.

Requirements

Benchmarks for Apache MXNet operators.
Individual operator benchmarks to capture - time for operator execution (speed), memory usage.
Fine grained individual operator benchmarks to capture - time for forward pass, time for backward pass and both.
Ability to run operator benchmarks with default inputs, randomly generated inputs or customize with user specific inputs.
Ability to run operator benchmarks on CPU/GPU with different flavors of MXNet (mxnet-mkl, mxnet-cu90mkl etc.)
Benchmarks for operators with varying inputs to uncover any performance issues due to skewed input data. Ex: Measuring operator performance on small input tensors, large input tensors along with average normally used tensor sizes.
Ability to run one, group or all operator benchmarks.
Ability to extract results in multiple usable format - Python Dictionary, JSON, CSV, MD
Statistics:
1. Mean, Median, P50, P90, P99
Reproducible tests
Common combination (Fused) of operators. Conv + Relu, Conv + BatchNorm. (Not in current scope)

Design Tenets

Defaults => Common use cases should be extremely easy, customized complex use cases should be possible.

Example: I should be able to run Add operator benchmarks without specifying any inputs and library should provide benchmarks on valid default inputs. At the same time, as a power user, I should be able to provide my own inputs such as Tensor Shapes and context to run the benchmarks.

Minimum Learning Curve => Keep APIs same or close to native NDArray / Gluon Operators being benchmarked.

Example: If I am doing benchmarks on nd.add(lhs, rhs) operator, interface in the benchmark utility should be similar with zero learning curve.

Modular and Reusable
For a programmer or an automated system

Example: Developer using the library or integration with CI/CD

Proposed Approach

Provide a generic utility for executing an operator benchmarks and performance tests.
1. This is responsible to creating input tensors of required shape on a given dtype, context.
2. Execute the provided operator - forward or forward + backward.
3. This generic utility will be integrated with MXNet profiler.
4. Captures the profile output from MXNet profiler - time, memory.
5. Return a dictionary of results.
Input for the performance tests will be a key/value config.

Below is an example of performance runs for operators. It uses a base utility `run_performance_test`.

"""
MXNet operator performance benchmarks.

NOTE:
1. You can pass list of input dictionary to run benchmarks for an operator with different input configuration.
2. Results are dictionary of time, memory for the benchmark runs.
"""

# Run performance test for Add operator
results = run_performance_test(F=mx.nd.add, ctx=mx.cpu(), warmup=10, runs=50, inputs=[{"lhs": (1024, 1024),
                              												          "rhs": (1024, 1024),
                              												          "initializer": nd.normal,
                              												          "run_backward": True,
                              												          "dtype": "float32"}])

# Run performance test for Conv2D operator
results += run_performance_test(F=nn.gluon.Conv2D, ctx=mx.cpu(), warmup=10, runs=50, inputs = [{"data": (32, 3, 256, 256),
                              																  "data_initializer": nd.normal,
                              																  "channels": 64,
                              																  "kernel_size": (3, 3),
                              																  "strides": (1, 1),
                              																  "padding": (0, 0),
                              																  "dilation": (1, 1),
                              																  "layout": "NCHW",
                              																  "activation": None,
                              																  "run_backward": True,
                              																  "dtype": "float32"]}

How does the backend profiling utility code looks like?

Below we take an example of profiling Add operator.

# Configurations
warmup = 25
runs = 50
run_backward = True

# Operator to benchmark
F = mx.nd.add

# Prepare data for the operator
lhs = mx.nd.ones(shape=(1024, 1024))
rhs = mx.nd.ones(shape=(1024, 1024))
lhs.attach_grad()
rhs.attach_grad()
mx.nd.waitall()

# Warmup
print("Warming up....")
for _ in range(warmup):
    with mx.autograd.record():
        res = mx.nd.add(lhs, rhs)
    res.backward()
    mx.nd.waitall()
print("Done warming up....")

# Run Performance Runs
print("Running performance runs....")
profiler.set_config(profile_all=True, aggregate_stats=True)
# Start Profiler
profiler.set_state('run')
for _ in range(runs):
    with mx.autograd.record():
        res = mx.nd.add(lhs1, rhs1)
    res.backward()
    mx.nd.waitall()

# Stop Profiler 
profiler.set_state('stop')

# Fetch Results from Profiler
# We will add a new API in Profiler - profiler.get_summary(reset=True)
# profiler.get_summary() => Return a JSON string representing the output as shown below.
#                        => Resets all the counter in the current profiler.

print("Done Running performance runs....")
print(profiler.dumps(reset=True))

Pros

No need to write 1 class per operator to set up a performance test. Whenever a new operator is created, developer needs to add a `run_performance_test(..)` line with a list of inputs to run performance tests. A generic utility will handle the execution.
Less code, easy to maintain.
More control for users - default inputs, random inputs, specific user defined inputs.
Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
More accurate benchmark results - Time and Memory because we use MXNet profiler.
With Python interface:
1. Easy to maintain and develop.
2. Reflects the performance as seen by the users. (Majority users using Python interface)
3. Fastest way to get performance tests in place. We do not have any tests in place as of today.

Cons

Different operator will have different input names. For example, see above, add operator requires tensors with name lhs, rhs. However, Conv2D operator requires a tensor with data. The base performance executor utility will need to understand it and create tensors appropriately i.e., If it is one single executor, generalization across operator performance may make logic complex to manage.
Not easily extensible:
1. Hard to integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.

Addition of new Module

We propose to add this utility as a new module (opperf) under incubator-mxnet/benchmark as "incubator-mxnet/benchmark/opperf". Note that, this does not generate any user facing APIs, this is a utility under incubator-mxnet/benchmark folder for general use by community.

Addition of new API

We propose to add a new API to MXNet Profiler for easily fetching operator profile for processing programmatically.

1) mxnet.profiler.get_summary(reset=False)

Current Behavior:

Users can either use `mxnet.profiler.dump()` to output the profiler as a JSON file. Or, use `mxnet.profiler.dumps(reset=False)` API to print the summary on console.

Suggested Addition:

In order to enable easy programmatic usage of MXNet profiler output, we propose to introduce a new API to return the summary as JSON string. This enables users to run profiler, get summary output, perform analysis programmatically.

mxnet.profiler.get_summary(reset=False)
	"""Gets current profiler summary as a JSON string. If reset is True, resets all the aggregate statistics collected up to this point i.e., it clears all    the profiler counters.
    
	Parameters:
    -----------
    reset: boolean, If True, resets all profiler statistics collected up to this point.
    """

Output:

We can visualize the output of this API as a JSON representation of the output from `mxnet.profiler.dumps(reset=False)` API as shown below.

However, please note that, below, Memory profile output is not the total bytes allocated. Current output from dumps is providing number of memory allocation calls made.

In the new suggested API, we will be adding additional Summary - Memory => Total Bytes Allocated (Per Device).

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

General User, Automated Nightly tests

Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.

Power User, PR validation tests

Run benchmark with customized Inputs

Use Case 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

python incubator-mxnet/benchmark/opperf/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

output-format : json or md for markdown file output or csv.
ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

Output for the above benchmark run, on a CPU machine, would look something like below:

{
    "MX_Multiply_Forward_Backward_Time": 0.025911798477172853,
    "MX_Gluon_Imperative_RNN_Forward_Backward_Time": 0.011011338233947754,
    "MX_Gluon_Imperative_MaxPool2D_Forward_Backward_Time": 0.1580966854095459,
    "MX_Gluon_Imperative_Conv1D_Forward_Backward_Time": 0.03413449287414551,
    "MX_Ones_Forward_Time": 0.002405076026916504,
    "MX_Modulo_Forward_Backward_Time": 0.049943366050720216,
    "MX_Subtract_Forward_Backward_Time": 0.01635995864868164,
    "MX_ArgMin_Forward_Backward_Time": 0.01545732021331787,
    "MX_Logical_Xor_Forward_Backward_Time": 0.018084139823913575,
    "MX_Zeros_Like_Forward_Time": 0.0027973604202270507,
    "MX_Inplace_Multiply_Forward_Time": 0.005555639266967774,
    "MX_ArgSort_Forward_Time": 0.13972537994384765,
    "MX_Arange_Forward_Time": 0.00010946273803710938,
........
........
}

Use Case 2 - Power user - Run benchmarks for specific operator

As a power user, let us assume, you want to run benchmarks on Add operator with on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

Use Case 2.1 - Customize Inputs for Operators

results = run_performance_test(F=mx.nd.add, ctx=mx.cpu(), warmup=10, runs=50, inputs=[{"lhs": (1024, 1024),
                              												          "rhs": (1024, 1024),
                              												          "initializer": nd.normal,
                              												          "run_backward": True,
                              												          "dtype": "float64"}])

Output for the above benchmark run, on a CPU machine, would look something like below:

MX_Add_Forward_Backward_Time - 0.025401 seconds

Use Case 3 - Nightly CI Tests

We will maintain a JSON file of expected performance for each operator under "incubator-mxnet/benchmark/opperf".
These expected results are captured on different configuration such as - FP32/64/16, MKL, No MKL, CUDA10, instances (c5.16x, p3.8x).
Runs all the operator performance runs and gets the results JSON.
Compares with the expected results +/- % threshold.

Development Plan / Milestones

Phase 1

~150 most commonly used operators will be tested on CPU(with and without MKL), GPU, FP32, FP64. See Appendix 1 for list of operators.
Operators will be tested with NDArray and Gluon interface only i.e., symbol interface is not used for testing owing to plans of deprecation.
Python interface is used along with MXNet profiler.
Time and Memory usage are measured to start with.
Statistics - Mean of the metric.

Phase 2

Cover remaining operators left out from Phase 1.
Add more statistics - p50, p90, p99, min, max.

Phase 3

Explore and have CPP performance tests for most commonly used operators. This will give the true measurements compared to using Python Interface.
Integrate with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes and inputs.

Current Status

See this repo for more details - https://github.com/sandeep-krishnamurthy/dl-operator-benchmark

134 operators are supported:

All Gluon Layers - Activation, Loss, Normalization, Basic like Dense, Convolutions, Recurrent (RNN, LSTM, GRU)
NDArray operators like creation, random sampling, arithmetic, logical, comparison etc...

Able to run individual operator benchmarks or use high level drivers to run all tests.
Able to generate results as JSON.
Timing metric - forward only, forward+backward operation.

Alternate Solutions

Alternate Solution 1 - Use Python Classes for each Operator instead of Config

Approach

This benchmark utility will be built on top of MXNet's ND and Gluon interface.
For each operator in ND and Gluon Block, there will be a corresponding Benchmarking operator in the library with a list of default inputs, functionality to process results. See below example for Add operator benchmarks.
High-level drivers are provided to run operator benchmarks in bulk. Example: run_all_mxnet_operator_benchmarks(), run_all_arithmetic_operations_benchmarks() etc.
Results can be generated as a python dictionary/JSON/CSV for upstream system (Ex: CI, Automated Performance Monitoring System) consumption.

class Add(MXNetOperatorBenchmarkBase):
    """Helps to Benchmark Tensor Add operation.

    By default benchmark both forward and backward element_wise tensor addition
    of 1024*1024 tensor of precision - 'float32'.

    """

    def __init__(self, ctx=mx.cpu(), warmup=10, runs=50, inputs=None):
        # Set the default Inputs
        default_parameters = {"lhs": (1024, 1024),
                              "rhs": (1024, 1024),
                              "initializer": nd.normal,
                              "run_backward": True,
                              "dtype": "float32"}

        super().__init__(ctx=ctx, warmup=warmup, runs=runs, default_parameters=default_parameters,
                         custom_parameters=inputs)

        self.lhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["lhs"],
                                  dtype=self.inputs["dtype"],
                                  initializer=self.inputs["initializer"],
                                  attach_grad=self.inputs["run_backward"])
        self.rhs = get_mx_ndarray(ctx=self.ctx, in_tensor=self.inputs["rhs"],
                                  dtype=self.inputs["dtype"],
                                  initializer=self.inputs["initializer"],
                                  attach_grad=self.inputs["run_backward"])

    def run_benchmark(self):
        # Warm up, ignore execution time value
        _, _ = nd_forward_backward_and_time(F=nd.add, runs=self.warmup, lhs=self.lhs, rhs=self.rhs)
        # Run Benchmarks
        exe_time, _ = nd_forward_backward_and_time(F=nd.add, runs=self.runs, lhs=self.lhs, rhs=self.rhs)

        self.results["MX_Add_Forward_Backward_Time"] = exe_time / self.runs

API / User Experience

We can define 2 types of users of the library and describe API interface for each of these users.

General User, Automated Nightly tests

Run benchmarks on all the operators or on specific categories of operators. Use default inputs provided by the library.

Power User, PR validation tests

Run benchmark

USE CASE 1 - Run benchmarks for all the operators

A driver to run all the MXNet operators (NDArray and Gluon) benchmarks with default inputs and saves the final result as JSON in the provided file.

python dl-operator-benchmark/run_all_mxnet_operator_benchmarks.py --output-format json --output-file mxnet_operator_benchmark_results.json

Other Driver Script CLI Options:

output-format : json or md for markdown file output or csv.
ctx : By default, cpu on CPU machine, gpu(0) on GPU machine. You can override and set the global context for all operator benchmarks. Example: --ctx gpu(2).
dtype : By default, float32. You can override and set the global dtype for all operator benchmarks. Example: --dtype float64.

USE CASE 2 - Run benchmarks for all the operators in a specific category

For example, you want to run benchmarks for all NDArray Arithmetic Operators, the library will be providing drivers to easily run benchmarks on operators of specific categories.

from mxnet_benchmarks.nd import run_all_arithmetic_operations_benchmarks
# Run all Arithmetic operations benchmarks with default input values
run_all_arithmetic_operations_benchmarks()

Output for the above benchmark run, on a CPU machine, would look something like below:

MX_Add_Forward_Backward_Time - 0.015201 seconds
MX_Multiply_Forward_Backward_Time - 0.021678 seconds
MX_Subtract_Forward_Backward_Time - 0.016154 seconds
MX_Divide_Forward_Backward_Time - 0.024327 seconds
MX_Modulo_Forward_Backward_Time - 0.045726 seconds
MX_Power_Forward_Backward_Time - 0.077152 seconds
MX_Negative_Forward_Backward_Time - 0.014472 seconds
MX_Inplace_Add_Forward_Time - 0.003824 seconds
MX_Inplace_Subtract_Forward_Time - 0.004137 seconds
MX_Inplace_Multiply_Forward_Time - 0.006589 seconds
MX_Inplace_Division_Forward_Time - 0.003869 seconds
MX_Inplace_Modulo_Forward_Time - 0.018180 seconds

Use Case 3 - Power user - Run benchmarks for specific operator

As a power user, you want to run benchmarks for nd.add operator in MXNet, you just run the following python script.
Note that, we maintain same name and spec as the underlying MXNet operator. For example - to benchmark nd.add, we can use mxnet_benchmarks.nd.Add().

Use CASE 3.1 - Default Inputs for Operators

from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add()
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

MX_Add_Forward_Backward_Time - 0.015201 seconds

USE CASE 3.2 - Customize Inputs for Operators

As a power user, let us assume, you want to run benchmarks on a float64 tensor instead of a default float32.
NOTE: Similarly, you could also specify the input tensors to use for benchmarking.

from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})
add_benchmark.run_benchmark()
add_benchmark.print_benchmark_results()

Output for the above benchmark run, on a CPU machine, would look something like below:

MX_Add_Forward_Backward_Time - 0.025405 seconds

NOTE: You can print the input parameters used for a benchmark as shown below.

from mxnet_benchmarks.nd import Add
# Run all Arithmetic operations benchmarks with default input values
add_benchmark = Add(inputs={"dtype": "float64"})print(add_benchmark.inputs)

Output

{'lhs': (1024, 1024), 'rhs': (1024, 1024), 'initializer': <function normal at 0x117b607b8>, 'run_backward': True, 'dtype': 'float64'}

Pros

More control for users - default inputs, random inputs, specific user defined inputs.
Deterministic and better suited for performance benchmarks, reproducibility and CI integration.
With Python interface:
1. Easy to maintain and develop.
2. Reflects the performance as seen by the users. (Majority users using Python interface)
3. Fastest way to get performance tests in place. We do not have any tests in place as of today.
4. Ability to run and compare benchmarks from other deep learning frameworks.
Extensible:
1. Can be integrated with property based testing libraries like Hypothesis, to randomly generate test cases with different tensor shapes.

Cons

Need to write base tests for every new operator. If a new operator is added to MXNet, then a new performance test class for the operator needs to be added in this library with default inputs for that new operator to run performance tests.
It is ideal to capture performance close to Kernel. Call from Python operator APIs may hide performance regression when operator computation is small.

Alternate Solution 2 - Autogenerate test with Property Based Testing Technique

(Credits - Thanks to Pedro Larroy for this suggestion)

Approach

Automatically query all operators registered with MXNet engine.
Infer the inputs and outputs for the operators.
Use property based testing technique and library such as Hypothesis to generate random inputs and run the tests.

Pros

Any new operator added to MXNet, will be automatically queried. Hence, no need to write tests explicitly for every operator.
Inputs are randomly generated. Hence, better suited to capture performance regression on corner cases.

Cons

Non deterministic inputs. Hence, better suitable for functionality testing. It will be hard to use this technique for performance tests.
Still requires us to write many custom strategies or conditional property files. Example:
1. For testing Add operator, we need to set conditions on input to generate same shapes or broadcastable shapes for lhs and rhs.
2. For Convolution operator, we need to match Kernel, Padding and other parameter shapes appropriately.
Querying operators and inferring the input conditions may be hard and complex logic.
1. Example: add is an operator, that takes 2 input tensors - lhs, rhs. Now we need to infer that lhs and rhs tensor should of same size or broadcastable. Logic to handle such conditions may soon become complex enough to not give us advantage of auto generated operator benchmarks.
2. MXNet currently do no support a standard way of querying the registered operators. It would be ideal if MXNet can expose NNVM APIs for querying registered operators and expected inputs, outputs, types and more.
Complex and time consuming. We do not have any operator performance tests for MXNet. It would be ideal to revisit this approach for future enhancement.

Alternate Solution 3 - Extend existing unit tests to cover performance parameters

<To add more details> In summary, it is hard and complex to modify all unit tests to measure performance along with currently designed way of writing tests which is designed towards - consistency across context, correctness, gradient checks.

Appendix

Phase 1

Functionality supported:

Able to run benchmarks for critical MXNet operators (see below for the list) with:
1. Default inputs - Varying input shapes. Ex: Small Tensors, Large Tensors, Skewed Tensor, common tensor shapes.
2. Custom inputs - Users should be able to specify input tensors to run benchmarks
3. Random inputs - Randomly generate input tensors for operator benchmarks
Able to save results as JSON or get results as Python Dictionary
Able to run benchmarks on CPU/GPU.

Below are the operators covered in Phase 1

Hardware

CPU - C5.18X
GPU - P3.2X (Single GPU)

NDArray Operations
1. Copy, CopyTo, as_in_context, asnumpy, asscalar, astype
1. zeros, zeros_like, ones, ones_like, full, arange
1. Transpose (T), shape_array, size_array, reshape, reshape, reshape_like, flatten, expand_dims, split, diag
2. tile, pad
1. sum, nansum, prod, nanprod, mean, max, min, norm
1. sort, argsort, topk, argmax, argmin, argmax_channel
1. add, sub, neg, mul, div, mod, pow
1. iadd (+=), isub (-=), imul (*=), idiv (/=), imod (%=)
1. lesser, lesser_equal, greater, greater_equal, equal, not_equal
1. get_item (x[i]), set_item (x[i]=)
2. slice, slice_axis, take, batch_take, pick
3. one_hot
1. exp, log
1. sqrt, square
1. concat, split, stack
1. dot, batch_dot
1. normal, poisson, uniform, random, randn, randint
2. shuffle
1. clip
2. where
3. abs

Precision - FP32
Conversion Operations
Creation Operations
Shape (view) change Operations
Reduction Operations
Sorting and Searching Operations
Arithmetic Operations
Inplace Arithmetic Operations
Comparison Operations
Indexing Operations
Exponents and Logarithms
Powers Operations
Join and Split Operations
GEMM
Random Sampling
Others
Neural Network Operations

Gluon Layers (Neural Network Operations)
1. Dense, Lambda, Flatten, Embedding
2. Dropout, BatchNorm
1. Conv1D, Conv2D
2. Conv1DTranspose, Conv2DTranspose
1. MaxPool1D, MaxPool2D, AvgPool1D, AvgPool2D, GlobalMaxPool1D, GlobalMaxPool2D, GlobalAvgPool1D, GlobalAvgPool2D
1. LeakyRelu, PRelu, Sigmoid, Softmax, Log_Softmax, Activation
1. RNNCell, LSTMCell, GRUCell, RecurrentCell, SequentialRNNCell, BiDirectionalCell
1. L1Loss, L2Loss, SigmoidBinaryCrossEntropyLoss, SoftmaxCrossEntropyLoss, KLDivLoss, HuberLoss, HingeLoss, SquaredHingeLoss, LogisticLoss, TripletLoss, CTCLoss

Modes: Imperative (Hybrid will be added in Next Phase)
Basic
Convolutions
Pooling
Activations
Recurrent Cells
Loss

Custom Operator Benchmark

Phase 2

Other DataTypes

INT64, INT8, FP64, FP16

NDArray Operations
1. tostype
1. swapaxes, flip, depth_to_space, space_to_depth
1. round, rint, fix, floor, ceil, trunc
1. sin, cos, tan, arcsin, arccos, arctan, degrees, radians
2. sinh, cosh, tahnh, arcsinh, arccosh, arctanh
1. expm1, log10, log2, log1p
1. logical_and, logical_or, logical_xor, logical_not
1. rsqrt, cbrt, rcbrt, reciprocal
1. exponential, gamma, generalized_negative_binomial, multinomial, negative_binomial,
1. SequenceLast, SequenceMask, SequenceReverse
1. unravel_index, ravel_multi_index

Sparse
View Operations
Rounding Operations
Trigonometric Operations
Exponent and Logarithmic Operations
Logical Operations
Powers Operations
Random Operations
Sequence Operations
Others

Gluon
1. Conv + Relu, Conv + BatchNorm (More to be added when we start this work)
1. HybridLambda
2. InstanceNorm, LayerNorm
1. Conv3D
2. Conv3DTranspose
1. MaxPool3D, AvgPool3D, GlobalMaxPool3D, GlobalAvgPool3D
1. , Elu, Selu, Swish
1. ZoneOutCell, ResidualCell, DropoutCell
1. CosineEmbeddingLoss, PoissonNLLoss

Mode: Hybrid Mode for all layers covered in Phase 1. Additional coverage of layers as below.
Fused Operators
Basic
Convolutions
Pooling
Activations
Recurrent
Loss
Important Contrib Layers

Other Items to be explored

Image APIs
Data APIs
Metric APIs
Initializers and Optimizers

Phase 3 (To be discussed/scoped)

Benchmark PyTorch Operators to have neutral baseline for comparing MXNet operator performance. This is yet to be discussed and finalized.

FAQs

Q1) Why not use check_speed(..) utility in MXNet test_util?
A) Supports Symbol APIs only. Do not support benchmarking NDArray, Gluon Blocks. It is a lightweight simple symbol executor and expects users to create symbol graph to execute along with inputs. The proposed library in this document is more sophisticated by supporting benchmarking operators with NDArray, Gluon Block, provide various default inputs, provide high-level drivers, provide interface for users to specify different inputs, prepare results in different formats - Python dictionary, CSV, JSON.

Q2) Why not Symbol execution? Why only NDArray and Gluon?
A) MXNet users are encouraged and are mainly using NDArray APIs or Gluon APIs. MXNet community is moving towards deprecating symbol APIs with works on numpy compatible operators. To measure what our users use and observe, we propose to use NDArray and Gluon blocks for benchmarking operators in this library.

Q3) Why not extend/repurpose current MXNet unit tests?

Q4) Why Python? Shouldn't we benchmark operators as close to Kernel as possible i.e., C++ benchmarks?

Page tree

Link to dev list discussion

Feature Shepherd

Problem Statement

Motivation

Requirements

Design Tenets

Proposed Approach

How does the backend profiling utility code looks like?

Addition of new Module

Addition of new API

1) mxnet.profiler.get_summary(reset=False)

API / User Experience

Use Case 1 - Run benchmarks for all the operators

Use Case 2 - Power user - Run benchmarks for specific operator

Use Case 2.1 - Customize Inputs for Operators

Use Case 3 - Nightly CI Tests

Development Plan / Milestones

Current Status

Alternate Solutions

Alternate Solution 1 - Use Python Classes for each Operator instead of Config

API / User Experience

USE CASE 1 - Run benchmarks for all the operators

USE CASE 2 - Run benchmarks for all the operators in a specific category

Use Case 3 - Power user - Run benchmarks for specific operator

Use CASE 3.1 - Default Inputs for Operators

USE CASE 3.2 - Customize Inputs for Operators

Pros

Cons

Alternate Solution 2 - Autogenerate test with Property Based Testing Technique

Alternate Solution 3 - Extend existing unit tests to cover performance parameters

Appendix

Phase 1

Phase 2

Phase 3 (To be discussed/scoped)

FAQs

22 Comments