Naveen Swamy(@nswamy), Lai Wei
Currently, Apache MXNet does not provide an Out of the Box feature to load Audio (wav) files into NDArrays and make an AudioDataset to perform simple audio data related tasks such as multi-class classification.
User Experience is limited when it comes to performing some Machine Learning tasks such as Multi class classification on audio data. Frameworks like PyTorch support loading audio files and performing these tasks. Enabling this feature for at least one audio format will be a good start which can be further extended to supporting multiple audio file formats, allowing extensive audio transforms like audio splitting and making randomized chunks(split) from the audio file and so on.
Phase 1
As a user, I would like to have an out of the box feature of audio data loader and some popular audio transforms in MXNet, that would allow me :
to be able to load audio (only .wav files supported currently) files and make a Gluon AudioDataset (NDArrays),
apply some popular audio transforms on the audio data( example scaling, MEL, MFCC etc.),
load the Dataset using Gluon's DataLoader, train a neural network ( Ex: MLP) with this transformed audio dataset,
perform a simple audio data related task such as sounds classification - 1 audio clip with 1 label( Multi-class sound classification problem).
have an end to end example for a task (Urban Sounds Classification) including:
reading audio files from a folder location (can be extended to S3 bucket later) and load it into the AudioDataset
apply audio transforms
train a model - neural network with the AudioDataset or DataLoader
perform the multi class classification - conduct inference
Note: The plan is to have a working piece of this model with an example into the contrib package of MXNet before it is agreed upon to move this implementation to the gluon.data module.
There are a few questions/points, I would like the suggestions from the Apache MXNet community. These are the following:
Class Design:
A package audio is introduced inside gluon.contrib.data package, to separate Audio datasets from the vision-related one's
It also ensures consistency in the package naming which is similar to the vision package that houses Vision related data sets like ImageFolderDataset, MNIST, FashionMNIST, ImageRecordDataset and so on.
2. gluon.contrib.data.audio.transforms :
This package will house some commonly used Audio data transforms like MFCC, Scale, PadOrTrim, Stereo2Mono and so on. This list can be extended with new transforms coming in.
Design Considerations:
Some of the design considerations that have been considered and followed in this design are as follows:
class AudioFolderDataset(gluon.data.dataset.Dataset) : """A dataset for loading Audio files stored in a folder structure like:: root/children_playing/0.wav root/siren/23.wav OR Files(wav) and a csv file that has filename and associated label Parameters ---------- root : str Path to root directory.
transform : callable, default None A function that takes data and label and transforms them: transform = lambda data, label has_csv: default True If True, it means that a csv file has filename and its corresponding label If False, we have folder like structure train_csv: str, default None If has_csv is True, train_csv should be populated by the training csv filename Attributes ---------- synsets : list List of class names. `synsets[i]` is the name for the integer label `i` items : list of tuples List of all audio in (filename, label) pairs. """ def _list_audio_files(self, root): """ Populates the data in the dataset, making tuples of (filename, label) """ def __getitem__(self, idx): """ Retrieves the item (data, label) stored at idx in items Note: data is a filename, label is the class to which it belongs, encoded using sklearn's label encoder. """ def __len__(self): """ Retrieves the number of items in the dataset """ |
---|
class MFCC(Block):
Applying the transform logic on x the input. """ |
---|
# creating a Dataset specifying the source folder of audio tick = time.time() aud_dataset = AudioFolderDataset('./Train', has_csv=True, train_csv='./train.csv') tock = time.time() print("Loading the dataset taking ",(tock-tick), " seconds.") |
---|
#Defining the transforms to use audio_transforms = Compose([gluon.contrib.data.audio.transforms.MFCC ]) |
---|
#Defining the Data Loader specifying the transform composed above audio_train_loader = gluon.data.DataLoader(aud_dataset.transform_first(audio_transforms, lazy=False), batch_size=32, shuffle=True) |
---|
Every API - classes or functions are proposed on top of the existing APIs. Should not be a blocker or a cause of failure of regression tests.
There are some performance considerations in this design too. They are:
The reason for passing lazy = false is that currently, calling transform one by one per item in the dataset during training, slows down the training heavily. So, the transforms passed to the Dataset is applied at the time it is passed to the Gluon's DataLoader, before actually iterating the dataset for training.
Example Scenario:
For now, given a dataset of 5435 files (audio file with .wav extension), if lazy = False, the loading of audio data into dataloader takes 4 to 7 minutes and training an MLP ( 2 Dense layers with 256 nodes each) takes 1 minute. So, the end to end process from data loading to training the model takes in between 7 to 8 minutes.The same transform when applied lazily(lazy = True) takes 3 to 4 hours to have a trained model, even though just the data load is momentary.
I am planning to test the new AudioFolderDataset class and transforms by using a simple, but popular Audio data related task, Urban Sounds Classification using a dataset which has about 5435 audio samples labelled with sounds like 'siren', 'children_playing', 'dog_barking', 'street_music' and similar classes (10 labels in total).
Ability to find a library that does audio load and feature extraction for audio. There are some other libraries like pySox, libsox which provide a way to load audio but feature to extract audio features/apply operators is skewed,
The milestones could be split into two phases:
1) Phase 1 - where we have this design implemented and have an example to use the AudioFolderDataset to perform the Sounds classification task with minimal transforms applied to the audio dataset - Done
2) Phase 2 - a more rich variety of audio transforms that can be used with some important loss functions which are instrumental in dealing with audio data - Done
Current Implementation status:
Currently, the same design is used to build an example in MXNet Gluon to perform an audio data related task - Multi class classification. The PR that was opened is https://github.com/apache/incubator-mxnet/pull/13325.
PR Status - Merged.
Future Scope:
Librosa library to load audio(wav) files and extract features out of it. - here
2. MXNet Documentation - here