Link to Dev List discussion

Feature Shepherd

Naveen Swamy(@nswamy), Lai Wei


Currently, Apache MXNet does not provide an Out of the Box feature to load Audio (wav) files into NDArrays and make an AudioDataset to perform simple audio data related tasks such as multi-class classification.


User Experience is limited when it comes to performing some Machine Learning tasks such as Multi class classification on audio data. Frameworks like PyTorch support loading audio files and performing these tasks. Enabling this feature for at least one audio format will be a good start which can be further extended to supporting multiple audio file formats, allowing extensive audio transforms like audio splitting and making randomized chunks(split) from the audio file and so on.

Goals/Use cases

Phase 1 

As a user, I would like to have an out of the box feature of audio data loader and some popular audio transforms in MXNet, that would allow me :

  • to be able to load audio (only .wav files supported currently) files and make a Gluon AudioDataset (NDArrays),

  • apply some popular audio transforms on the audio data( example scaling, MEL, MFCC etc.),

  • load the Dataset using Gluon's DataLoader, train a neural network ( Ex: MLP) with this transformed audio dataset,

  • perform a simple audio data related task such as sounds classification - 1 audio clip with 1 label( Multi-class sound classification problem).

  • have an end to end example for a task (Urban Sounds Classification) including:

    • reading audio files from a folder location (can be extended to S3 bucket later) and load it into the AudioDataset

    • apply audio transforms

    • train a model - neural network with the AudioDataset or DataLoader

    • perform the multi class classification - conduct inference

     Note: The plan is to have a working piece of this model with an example into the contrib package of MXNet before it is agreed upon to move this implementation to the module.

Open Questions

There are a few questions/points, I would like the suggestions from the Apache MXNet community. These are the following:

  1. Is there any other library that performs better than librosa to be able to load audio(wav files) into numpy arrays ?
  2. Does any other library which is more lightweight than librosa that supports more popular formats like .mp3, .aac, .m4r ? 
  3. As a personal experience, in my POC which uses librosa to load as well as extract some features( say mfcc - Mel frequency cepstral coefficients), loading takes the bulk of the time (70% - 90%). Is it advisable to create an AudioDataset with these pre-loaded numpy arrays as data, instead of having only audio filenames and labels and later loading these audio files to arrays using Gluon transforms?
  4. Can we think of a preprocessing function that reads all the wav files into numpy arrays and saving the audio files as compressed .npz file of arrays and from there on operating on .npz files instead of wav files completely. This may avoid loading wav files every time.

Proposed Approach

  1. Extend the api to add the audio package to include AudioFolderDataset - this class will allow to make dataset for audio data stored in local file system,
  2. Add a module where multiple transform functions are defined to extract various kind of features from the audio file (passed as data from the dataset)
  3. Develop an end to end training and predicting example using the pre-downloaded Urban sounds classification dataset, make AudioDataset, apply some audio transforms, and perform multi-class classification on the audio dataset.

      Class Design:


  •  A package audio is introduced inside package, to separate Audio datasets from the vision-related one's

  •  It also ensures consistency in the package naming which is similar to the vision package that houses Vision related data sets like ImageFolderDataset, MNIST, FashionMNIST, ImageRecordDataset and  so on.

     2. : 
This package will house some commonly used Audio data transforms like MFCC, Scale, PadOrTrim, Stereo2Mono and so on. This list can be extended with new transforms coming in. 


  •  Each of the transforms is implemented as a class which will inherit from and will override the forward()  method which will carry out the required transform computation and return an  NDArray,

     Design Considerations:

      Some of the design considerations that have been considered and followed in this design are as follows:

  1. The AudioFolderDataset has a function to initialize the dataset with the filename(audio) and the corresponding label, in line with the ImageFolderDataset,
  2. The design permits passing a transform function(available in package while creating a dataset which is called with lazy = False to perform the transform in one shot for all the data.
  3. Any transform defined, is of type Block and extracts features(example 'mfcc') out of it and returns the transformed NDArray.
              a) For now the Transform classes are not extended from HybridBlock as librosa is used to load and compute features which requires conversion from input NDArray to numpy for Librosa’s feature extraction and then back to NDArray for iteration using DataLoader.

Addition of New APIs

class AudioFolderDataset( :

"""A dataset for loading Audio files stored in a folder structure like::




Files(wav) and a csv file that has filename and associated label



root : str

Path to root directory.


transform : callable, default None

A function that takes data and label and transforms them:

transform = lambda data, label

has_csv: default True

If True, it means that a csv file has filename and its corresponding label

If False, we have folder like structure

train_csv: str, default None

If has_csv is True, train_csv should be populated by the training csv filename



synsets : list

List of class names. `synsets[i]` is the name for the integer label `i`

items : list of tuples

List of all audio in (filename, label) pairs.


def _list_audio_files(self, root):


Populates the data in the dataset, making tuples of (filename, label)


def __getitem__(self, idx):


Retrieves the item (data, label) stored at idx in items

Note:  data is a filename, label is the class to which it belongs,

encoded using sklearn's label encoder.


def __len__(self):


Retrieves the number of items in the dataset


class MFCC(Block):
Extracts Mel frequency cepstrum coefficients from the audio data file
More details :

returns: An NDArray after extracting mfcc features from the input
def __init__(self, **kwargs):
super(MFCC, self).__init__(**kwargs)

def forward(self, x):

Applying the transform logic on x the input.

return nd.array(audio_tmp)

API usage

# creating a Dataset specifying the source folder of audio

tick = time.time()

aud_dataset = AudioFolderDataset('./Train', has_csv=True, train_csv='./train.csv')

tock = time.time()

print("Loading the dataset taking ",(tock-tick), " seconds.")

#Defining the transforms to use

audio_transforms = Compose([ ])

#Defining the Data Loader specifying the transform composed above

audio_train_loader =, lazy=False), batch_size=32, shuffle=True)

Backward compatibility

Every API - classes or functions are proposed on top of the existing APIs. Should not be a blocker or a cause of failure of regression tests.

Performance Considerations

There are some performance considerations in this design too. They are:

  1. The reason for passing lazy = false is that currently, calling transform one by one per item in the dataset during training, slows down the training heavily. So, the transforms passed to the Dataset is applied at the time it is passed to the Gluon's DataLoader, before actually iterating the dataset for training.

    Example Scenario: 
    For now, given a dataset of 5435 files (audio file with .wav extension), if lazy = False, the loading of audio data into dataloader takes 4 to 7 minutes and training an MLP ( 2 Dense layers with 256 nodes each) takes 1 minute. So, the end to end process from data loading to training the model takes in between 7 to 8 minutes.
    The same transform when applied lazily(lazy = True) takes 3 to 4 hours to have a trained model, even though just the data load is momentary.

  2. The same transform when applied lazily(lazy = True) takes 3 to 4 hours to have a trained model, even though just the data load is momentary. The approach currently, will be always calling the transform_first() function with the argument lazy set to False.
  3. Since Librosa operates on numpy arrays and we would need NDArrays for our tasks, the conversion on every sample takes a toll on the performance. When the size of data scales this may need to be addressed.

Test Plan

I am planning to test the new AudioFolderDataset class and transforms by using a simple, but popular Audio data related task, Urban Sounds Classification using a dataset which has about 5435 audio samples labelled with sounds like 'siren', 'children_playing', 'dog_barking', 'street_music'  and similar classes (10 labels in total).

Alternative Approaches

  1. Alternate approach was to use another library to load the file however, feature extraction was not provided out of the box for audio, hence I ended up using librosa which is a popular audio framework for music and audio analysis.
  2. Another tweak to the data loading that was considered and is open for discussion too, is loading the audio to numpy array using librosa during the time we initialize the AudioFolderDataset itself instead of loading the file in the transforms.

Technical Challenges

  1. Ability to find a library that does audio load and feature extraction for audio. There are some other libraries like pySox, libsox which provide a way to load audio but feature to extract audio features/apply operators is skewed,

  2. finding some ways to apply transforms to audio data like resampling, scaling, stereo to mono conversion, solved by using librosa which does the following things by default but configurable:
    a) resampling the audio that it loads to 22.05 khz which is important when you are loading multiple audio files from different sources.
    b) scaling the values of the samples in the audio to a range -1.0 to 1.0 which is important to weigh the audio files on the same scale
    c) converts stereo audio( number of channels used for recording = 2 or more) to a mono audio.


The milestones could be split into two phases:

1) Phase 1  - where we have this design implemented and have an example to use the AudioFolderDataset to perform the Sounds classification task with minimal transforms applied to the audio dataset - Done

2) Phase 2 - a more rich variety of audio transforms that can be used with some important loss functions which are instrumental in dealing with audio data - Done

Current Implementation status:

Currently, the same design is used to build an example in MXNet Gluon to perform an audio data related task - Multi class classification. The PR that was opened is

PR Status - Merged.

Future Scope:

  • This design can be further iterated upon to build an API for AudioDataset in MXNet Gluon contrib package to make a generic audio dataset which can be loaded using Gluon's OOTB Dataloader.
  • Although librosa is used by internal Amazon customers to load and extract features from audio, after the support of MXNet's FFT operator on CPU, some transforms(like MEL, MFCC) can be implemented as a HybridBlock which will enhance the overall feature extraction performance.
  • Additional transforms can be added which could be later made as operators in MXNet. 
    Examples: MuLaw Encoding, Mu Law decoding and so on.
  • A Dataloader can be designed or extended from the existing Gluon's dataloader, which will allow to create chunks (collection of audio samples) providing some overlap. This will be required when splitting audio files that are long into chunks with some context. This can be applied to some ASR(Automatic Speech Recognition) and TTS(Text to Speech) applications.


  1.  Librosa library to load audio(wav) files and extract features out of it. - here

       2. MXNet Documentation - here

  • No labels


  1. 1. Is there any analysis comparing LibROSA with other libraries? w.r.t features, performance, community usage in audio data domain.
    2. What is the recommendation of LibROSA dependency? Part of MXNet PyPi or ask the user to install if required? I prefer the latter, similar to protobuf in ONNX-MXNet.
    3. I see LibROSA is a fully Python-based library. Are we getting blocked on the dependency for future use cases when we want to make transformations as operators and allow for cross-language support?
    4. In performance design considerations, with lazy=True / False the performance difference is too scary ( 8 minutes to 4 hours!!) This requires some more analysis. If we known turning a flag off/on has 24X performance degradation, should we need to provide that control to user? What is the impact of this on Memory usage?
    5. I see LibROSA has ISC license ( which says free to use with same license notification. I am not sure if this is ok. I request other committers/mentors to suggest.

    1. Hi Sandeep!

      Thank you for taking time to review the design. I have replied to these in the dev thread here:

  2. Additional points:
    1. Can we include some simple classic audio dataset for users to directly import and try out? like MNIST in vision. (e.g.:
    2. Librosa provides some good audio feature extractions, we can use it for now. But it's slow as you have to do conversions between ndarray and numpy. In the long term, can we make transforms to use mxnet operators and change your transforms to hybrid blocks? For example, mxnet FFT operator can be used in a hybrid block transformer, which will be a lot faster.

    Some additional references on users already using mxnet on audio, we should aim to make it easier and automate the file load/preprocess/transform process.