Gluon - Audio

Link to Dev List discussion

https://lists.apache.org/thread.html/e2568b8b492fbeeafbe73a8abe9ca814e66d288977635c7a62cfa121@%3Cdev.mxnet.apache.org%3E

Feature Shepherd

Naveen Swamy(@nswamy), Lai Wei

Problem

Currently, Apache MXNet does not provide an Out of the Box feature to load Audio (wav) files into NDArrays and make an AudioDataset to perform simple audio data related tasks such as multi-class classification.

UserExperience

User Experience is limited when it comes to performing some Machine Learning tasks such as Multi class classification on audio data. Frameworks like PyTorch support loading audio files and performing these tasks. Enabling this feature for at least one audio format will be a good start which can be further extended to supporting multiple audio file formats, allowing extensive audio transforms like audio splitting and making randomized chunks(split) from the audio file and so on.

Goals/Use cases

Phase 1

As a user, I would like to have an out of the box feature of audio data loader and some popular audio transforms in MXNet, that would allow me :

to be able to load audio (only .wav files supported currently) files and make a Gluon AudioDataset (NDArrays),
apply some popular audio transforms on the audio data( example scaling, MEL, MFCC etc.),
load the Dataset using Gluon's DataLoader, train a neural network ( Ex: MLP) with this transformed audio dataset,
perform a simple audio data related task such as sounds classification - 1 audio clip with 1 label( Multi-class sound classification problem).
have an end to end example for a task (Urban Sounds Classification) including:

reading audio files from a folder location (can be extended to S3 bucket later) and load it into the AudioDataset
apply audio transforms
train a model - neural network with the AudioDataset or DataLoader
perform the multi class classification - conduct inference

Note: The plan is to have a working piece of this model with an example into the contrib package of MXNet before it is agreed upon to move this implementation to the gluon.data module.

Open Questions

There are a few questions/points, I would like the suggestions from the Apache MXNet community. These are the following:

Is there any other library that performs better than librosa to be able to load audio(wav files) into numpy arrays ?
Does any other library which is more lightweight than librosa that supports more popular formats like .mp3, .aac, .m4r ?
As a personal experience, in my POC which uses librosa to load as well as extract some features( say mfcc - Mel frequency cepstral coefficients), loading takes the bulk of the time (70% - 90%). Is it advisable to create an AudioDataset with these pre-loaded numpy arrays as data, instead of having only audio filenames and labels and later loading these audio files to arrays using Gluon transforms?
Can we think of a preprocessing function that reads all the wav files into numpy arrays and saving the audio files as compressed .npz file of arrays and from there on operating on .npz files instead of wav files completely. This may avoid loading wav files every time.

Proposed Approach

Extend the gluon.contrib.data api to add the audio package to include AudioFolderDataset - this class will allow to make dataset for audio data stored in local file system,
Add a module where multiple transform functions are defined to extract various kind of features from the audio file (passed as data from the dataset)
Develop an end to end training and predicting example using the pre-downloaded Urban sounds classification dataset, make AudioDataset, apply some audio transforms, and perform multi-class classification on the audio dataset.

Class Design:

gluon.contrib.data.audio.AudioFolderDataset

A package audio is introduced inside gluon.contrib.data package, to separate Audio datasets from the vision-related one's
It also ensures consistency in the package naming which is similar to the vision package that houses Vision related data sets like ImageFolderDataset, MNIST, FashionMNIST, ImageRecordDataset and so on.

2. gluon.contrib.data.audio.transforms :
This package will house some commonly used Audio data transforms like MFCC, Scale, PadOrTrim, Stereo2Mono and so on. This list can be extended with new transforms coming in.

Each of the transforms is implemented as a class which will inherit from gluon.data.Block and will override the forward() method which will carry out the required transform computation and return an NDArray,

Design Considerations:

Some of the design considerations that have been considered and followed in this design are as follows:

The AudioFolderDataset has a function to initialize the dataset with the filename(audio) and the corresponding label, in line with the ImageFolderDataset,
The design permits passing a transform function(available in gluon.contrib.data.audio.transforms package while creating a dataset which is called with lazy = False to perform the transform in one shot for all the data.
Any transform defined, is of type Block and extracts features(example 'mfcc') out of it and returns the transformed NDArray.
a) For now the Transform classes are not extended from HybridBlock as librosa is used to load and compute features which requires conversion from input NDArray to numpy for Librosa’s feature extraction and then back to NDArray for iteration using DataLoader.

Addition of New APIs

class AudioFolderDataset(gluon.data.dataset.Dataset) :

"""A dataset for loading Audio files stored in a folder structure like::

root/children_playing/0.wav

root/siren/23.wav

OR

Files(wav) and a csv file that has filename and associated label

Parameters

----------

root : str

Path to root directory.

transform : callable, default None

A function that takes data and label and transforms them:

transform = lambda data, label

has_csv: default True

If True, it means that a csv file has filename and its corresponding label

If False, we have folder like structure

train_csv: str, default None

If has_csv is True, train_csv should be populated by the training csv filename

Attributes

----------

synsets : list

List of class names. `synsets[i]` is the name for the integer label `i`

items : list of tuples

List of all audio in (filename, label) pairs.

"""

def _list_audio_files(self, root):

"""

Populates the data in the dataset, making tuples of (filename, label)

"""

def __getitem__(self, idx):

"""

Retrieves the item (data, label) stored at idx in items

Note: data is a filename, label is the class to which it belongs,

encoded using sklearn's label encoder.

"""

def __len__(self):

"""

Retrieves the number of items in the dataset

"""

class MFCC(Block):
"""
Extracts Mel frequency cepstrum coefficients from the audio data file
More details : https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html

returns: An NDArray after extracting mfcc features from the input
"""
def __init__(self, **kwargs):
super(MFCC, self).__init__(**kwargs)

def forward(self, x):
"""

Applying the transform logic on x the input.

"""
return nd.array(audio_tmp)

API usage

# creating a Dataset specifying the source folder of audio

tick = time.time()

aud_dataset = AudioFolderDataset('./Train', has_csv=True, train_csv='./train.csv')

tock = time.time()

print("Loading the dataset taking ",(tock-tick), " seconds.")

#Defining the transforms to use

audio_transforms = Compose([gluon.contrib.data.audio.transforms.MFCC ])

#Defining the Data Loader specifying the transform composed above

audio_train_loader = gluon.data.DataLoader(aud_dataset.transform_first(audio_transforms, lazy=False), batch_size=32, shuffle=True)

Backward compatibility

Every API - classes or functions are proposed on top of the existing APIs. Should not be a blocker or a cause of failure of regression tests.

Performance Considerations

There are some performance considerations in this design too. They are:

The reason for passing lazy = false is that currently, calling transform one by one per item in the dataset during training, slows down the training heavily. So, the transforms passed to the Dataset is applied at the time it is passed to the Gluon's DataLoader, before actually iterating the dataset for training.

Example Scenario:
For now, given a dataset of 5435 files (audio file with .wav extension), if lazy = False, the loading of audio data into dataloader takes 4 to 7 minutes and training an MLP ( 2 Dense layers with 256 nodes each) takes 1 minute. So, the end to end process from data loading to training the model takes in between 7 to 8 minutes.The same transform when applied lazily(lazy = True) takes 3 to 4 hours to have a trained model, even though just the data load is momentary.
The same transform when applied lazily(lazy = True) takes 3 to 4 hours to have a trained model, even though just the data load is momentary. The approach currently, will be always calling the transform_first() function with the argument lazy set to False.
Since Librosa operates on numpy arrays and we would need NDArrays for our tasks, the conversion on every sample takes a toll on the performance. When the size of data scales this may need to be addressed.

Test Plan

I am planning to test the new AudioFolderDataset class and transforms by using a simple, but popular Audio data related task, Urban Sounds Classification using a dataset which has about 5435 audio samples labelled with sounds like 'siren', 'children_playing', 'dog_barking', 'street_music' and similar classes (10 labels in total).

Alternative Approaches

Alternate approach was to use another library scipy.io.wavfile to load the file however, feature extraction was not provided out of the box for audio, hence I ended up using librosa which is a popular audio framework for music and audio analysis.
Another tweak to the data loading that was considered and is open for discussion too, is loading the audio to numpy array using librosa during the time we initialize the AudioFolderDataset itself instead of loading the file in the transforms.

Technical Challenges

Ability to find a library that does audio load and feature extraction for audio. There are some other libraries like pySox, libsox which provide a way to load audio but feature to extract audio features/apply operators is skewed,
finding some ways to apply transforms to audio data like resampling, scaling, stereo to mono conversion, solved by using librosa which does the following things by default but configurable:
a) resampling the audio that it loads to 22.05 khz which is important when you are loading multiple audio files from different sources.
b) scaling the values of the samples in the audio to a range -1.0 to 1.0 which is important to weigh the audio files on the same scale
c) converts stereo audio( number of channels used for recording = 2 or more) to a mono audio.

Milestones

The milestones could be split into two phases:

1) Phase 1 - where we have this design implemented and have an example to use the AudioFolderDataset to perform the Sounds classification task with minimal transforms applied to the audio dataset - Done

2) Phase 2 - a more rich variety of audio transforms that can be used with some important loss functions which are instrumental in dealing with audio data - Done

Current Implementation status:

Currently, the same design is used to build an example in MXNet Gluon to perform an audio data related task - Multi class classification. The PR that was opened is https://github.com/apache/incubator-mxnet/pull/13325.

PR Status - Merged.

Future Scope:

This design can be further iterated upon to build an API for AudioDataset in MXNet Gluon contrib package to make a generic audio dataset which can be loaded using Gluon's OOTB Dataloader.
Although librosa is used by internal Amazon customers to load and extract features from audio, after the support of MXNet's FFT operator on CPU, some transforms(like MEL, MFCC) can be implemented as a HybridBlock which will enhance the overall feature extraction performance.
Additional transforms can be added which could be later made as operators in MXNet.
Examples: MuLaw Encoding, Mu Law decoding and so on.
A Dataloader can be designed or extended from the existing Gluon's dataloader, which will allow to create chunks (collection of audio samples) providing some overlap. This will be required when splitting audio files that are long into chunks with some context. This can be applied to some ASR(Automatic Speech Recognition) and TTS(Text to Speech) applications.

References

Librosa library to load audio(wav) files and extract features out of it. - here

2. MXNet Documentation - here

Page tree