Note: Please feel free to comment on the Google Doc and we will merge the revised proposal back here in the end.

https://docs.google.com/document/d/1gDtzXbpK79PLgNyvFIzaRVIBFYRRjLiD7Km2uQcuhGQ/edit?usp=sharing

Introduction

This proposal is intended for enabling users to visualize MXNet data using the TensorFlow's TensorBoard. We plan to develop a logging tool bundled in MXNet python package for users to log data in the format that the TensorBoard can render in browsers. The typical workflow of using the logging tool is explained in the following figure. Users would need to install MXNet and TensorFlow's TensorBoard to visualize the data. The project will be divided into two phases:

1. Synchronized logger. This is straightforward implementation in Python. The downside is that logging NDArrays is blocking in the main Python thread as it internally calls asnumpy() to convert NDArrays to numpy.ndarrays for logging.

2. Asynchronized logger. We can leverage the MXNet dependency engine to schedule logging operations when data is ready. This implementation requires much more engineering work in C++ and still have some unresolved difficulties to be discussed.

We will focus our efforts in the first phase and explore further the possibility of impelmemnting an asynchronized logger.

Synchronized Logger

This work will be based upon the contributions from the following three GitHub repositories to whose authors we should give credits.

TeamHG-Memex/tensorboard_logger. The author of this repo implemented the encrypting algorithm for logging data in event files loaded by the TensorBoard for rendering. This is the key component that enables us to develop a logger independent of TensorFlow.
dmlc/tensorboard. Zihao Zheng is the primary author of this repo and also a DMLC member. The idea of making a simple logging tool comes from our multiple discussions with him. He carved out from TensorFlow necessary protobuf definitions and designed low level logging interfaces for building a standalone logging and rendering tool.
lanpa/tensorboard-pytorch. The author of this repo adopted the idea from dmlc/tensorboard and implemented a standalone logging tool for Pytorch users. Our synchronized logger will be implemented based upon the basic design of this tool to support MXNet data types.

High Level Design

We plan to support most of data types that are already supported in TensorBoard: audio, embedding, histogram, image, scalar, text, and graph, where the interface of logging graph is TBD since it depends on the implementation of converting between MXNet symbols and onnx format is done. The user level APIs is defined in the following figure. The naming follows the convention in TensorFlow.

summary: A placeholder of any NDArray, scalar, symbols that are loggable in MXNet including their metadata
event: A placeholder of objects to be written to an event file. It may contain summary, LogMessage, SessionLog, etc. In our use case, even though we only care about summary data types and the last two are TensorFlow specific, we keep the naming consistent with TensorFlow.

The way it works are as the following:

The user creates a SummaryWriter object instance by providing the constructor with a path representing the location where data is going to be logged. For example: sw = SummaryWriter(logdir='./logs').
The user call the corresponding API to push the data to be logged into the event queue. For example, sw.add_histogram(tag='my_hist', values=grad, bins=100). Once the loggable is pushed into the event queue, the function returns and the python main thread continues to run the rest of the code. It blocks only when asnumpy() is called or the event queue is full.
In parallel, a logging thread is constantly checking whether the event queue is empty or not. If it's not empty, it pops the item from the queue and starts writing it to the event file; if empty, it blocks until there are new items pushed into the queue.

One can visualize the process in the following diagram for better understanding.

User-level Python APIs

mxnet.summary.SummaryWriter(logdir='./runs')
- logdir : str, path where the data is logged
SummaryWriter.add_audio(tag, audio, sample_rate=44100, global_step=None)
- tag : str, unique id for distinguishing different logging data
- audio : NDArray or np.ndarray, audio data to be logged
- sample_rate : int, sampling rate
- global_step : int, used as timestamp for logging the data
SummaryWriter.add_embedding(tag, embedding, labels=None, images=None, global_step=None)
- tag : str, unique id for distinguishing different logging data, users can choose the desired data for visualization using this str in browser
- embedding : NDArray or np.ndarray, a 2D matrix representing a batch of embeddings
- labels : list, NDArray, or np.ndarray, labels corresponding to the examples in the embedding matrix. The content of it will be converted If provided, users can toggle on labels for labeling data points in visualization.
- images : NDArray or np.ndarray, images corresponding to the examples in the embedding matrix. If provided, users can toggle on images for labeling data points in visualization.
- global_step : int, timestamp for logging the data
SummaryWriter.add_histogram(tag, values, bins=10, global_step=None)
- tag : str, unique id for the histogram data
- values : NDArrary or np.ndarray, data used for building the histogram
- bins : int or str, same meaning as in np.histogram
- global_step : int, timestamp for logging the data
SummaryWriter.add_image(tag, image, global_step=None)
- tag : str, unique id for the image
- image : NDArray or np.ndarray, image data to be logged
- global_step : int, timestamp for logging the data
SummaryWriter.add_scalar(tag, value, global_step=None)
- tag : str, unique id for the scalar value, the scalar values with the same name will be in the same plot
- value : scalar, value to be logged
- global_step : int, timestamp
SummaryWriter.add_text(tag, text, global_step=None)
- tag : str, unique id for the text
- text : str, text to be logged
- global_step : int, timestamp for the text data
SummaryWriter.add_graph : Interface TBD as it depends mxnet-onnx

Prototype Demo

We have made a standalone logger prototype based upon the repo tensorboard-pytorch. One could follow the following steps to try examples we provided in the repo: https://github.com/reminisce/tensorboard-mxnet-logger/tree/add_mxnet_logging_apis.

Install Apache MXNet.
Install TensorFlow's TensorBoard. If you don't want to install them by yourself, the easiest way is to use AWS Deep Learning AMI. https://aws.amazon.com/machine-learning/amis/
Download this repo branch: https://github.com/reminisce/tensorboard-mxnet-logger/tree/add_mxnet_logging_apis.
cd into the tensorboard-mxnet-logger you just downloaded and type "sh compile.sh". This will generate the protobuf python classes for serializing data to be logged.
Under the folder tensorboard-mxnet-logger, type "python setup.py install". This will install the logger package under your path path.
Under the folder tensorboard-mxnet-logger, type "python demo_mxnet_training.py". This will run the Gluon MNIST training script and log training and validation accuracy, gradients distribution, and first batch of images under the log folder "./logs".
Launch tensorboard by typing "tensorboard --logdir=./logs --host=127.0.0.1 --port=8888"
Open the browser, type in address 127.0.0.1:8888, and you should be able to navigate to see visualization coming out.
To visualize embeddings, just type "python demo_mxnet_embedding.py". This will treat 2000 MNIST images as embeddings and log them under the folder "./logs". You can visualize them in the browser after the execution finishes.

TODO

If this design proposal is approved in the community, we will start integrating the prototype logging tool into Apache MXNet.
Need to align with developers working on mxnet-onnx converter to define the interface for logging network structures.

Asynchronized Logger

As there are still many unresolved technical difficulties of implementing an asynchronized logger, we will just put our thoughts here for discussion. We welcome any comments, suggestions, corrections, and contributions from the community.

One option of making an async logger is implementing functions of logging as operators to leverage the MXNet dependency engine. This would require us to:

Move all the protobuf classes from frontend (Python) to the backend (C++). (medium risk)
Since the MXNet NDArray in C++ is not serializable to protobuf, we would need to define an NDArrayProto class capturing the data and metadata to be logged in an NDArray instance for serialization. (medium risk)
Implement the encoding algorithm in C++ for writing protobuf objects into event files. (low risk)
For certain types of data to be logged, such as histogram, image, and audio, they are currently preprocessed in the frontend (Python) by third part libs before passing to protobuf object creation. For example, to make a histogram from an NDArray, we call numpy.histogram. If we move the logic to the benkend (C++), we would need to implement a histogram operator in the backend. This seems straightforward for this data type, but for image or audio, more specialized libs such as PIL and wave are used. Not sure if we could implement all of them in the backend C++. (high risk)

In the meantime, there is one technical difficulty to be resolved:

When writing certain data types into the event file, some config files may be created in the log directory at the same time and they may be shared with other logging threads. For example, when the engine is logging two NDArrays independent of each other at the same time, two threads may be writing to the same config file concurrently. At operator level, we don't have the control of locking/unlocking the config files.

Acknowledgements and References

Special thanks to Zihao Zheng, author of dmlc/tensorboard, who provided insightful ideas, technical supports, and many hours of discussion. We would also like to thank the authors of the following two repos who pioneered on making logging independent of TensorFlow come true.

Page tree

Logging MXNet Data for Visualization in TensorBoard