Ressources for the Audio Application⚓

Ressources for Session 1⚓

Dataset⚓

For Lab Session 1, we use all training examples from two classes of the ESC-50 dataset.

The two classes are "Dog barking" and "Fireworks".
Duration of sounds are 5 seconds
Sampling rate is 44.1 kHz

More details can be found in the original paper.

Visualisation - Listening to a few sounds⚓

Dog

Fireworks

Latent Space⚓

The 80 sounds have been put in a latent space using CNN14, a model from this paper : PANNs. We will dive into the details of feature extraction, deep learning, and foundation models in courses 4 and 5.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab1.npz.

Work to do⚓

Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.

What can you conclude about this matrix, and about the embeddings?

Ressources for Session 2⚓

Dataset⚓

For Lab Session 2, we use a subset of the Audio MNIST dataset. We have extracted only two digits, and the task we propose here is to classify the accent of the speaker.

Latent Space⚓

We have put the data into the latent space of an audio foundation model, wav2vec2 using the HuggingFace models wrapper. We will delve into the details of Deep Learning and feature extraction from course 4.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab2.npz. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test". The names of the different classes (accents) are directly the ones in the yvectors.

In that regard, files are to be loaded using the following snipet code:

import numpy as np
train_test_dataset = np.load(PATH_TO_EMBEDDINGS_LAB2)

X_train, X_test, y_train, y_test = train_test_dataset['X_train'], train_test_dataset['X_test'], train_test_dataset['y_train'], train_test_dataset['y_test']

# 
class_names = np.unique(y_train)
print(f"Class names : {class_names}")

# X_train should have a shape of 
# (1200,768), i.e. (number of samples x embedding dimension)
# y_test should have a shape of (300), i.e. the number of test samples. 
print(f"Shape of X_train: {X_train.shape}"), print(f"Shape of y_test: {y_test.shape}")

Work to do⚓

Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.

As an example, using the K-Nearest Neighbour algorithm with K=1 you should have the following results :

	Precision	Recall	F1-score
Brasilian	0.77	0.89	0.83
Chinese	0.63	0.80	0.71
Egyptian_American?	0.45	0.65	0.53
English	0.53	0.53	0.53
French	0.71	0.62	0.67
German/Spanish	0.60	0.63	0.62
Italian	0.60	0.60	0.60
Levant	0.64	0.45	0.53
South African	0.92	0.46	0.61
South Korean	1.00	0.80	0.89
Spanish	0.61	0.64	0.62
Tamil	0.65	0.71	0.68

You should be able to replicate this table using the function classification_report from scikit-learn:

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Ressources for Session 3⚓

For Lab Session 3, we use another subset of the Audio MNIST dataset, as in Lab Session 2.

We have extracted only two digits, but not the same ones than for Lab Session 2 so it's useless to try to use the previous labels!

Your task is to perform unsupervised learning on the data.

Latent Space⚓

Similarly as in Lab Session 2, we have put the data into the latent space of an audio foundation model, wav2vec2 using the HuggingFace models wrapper. We will delve into the details of Deep Learning and feature extraction from course 4.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab3.npz. This file is a dictionary, whose values are indexed as "X_train", and "X_test" (no labels).

In that regard, files are to be loaded using the following code snipett :

import numpy as np
train_test_dataset = np.load(PATH_TO_EMBEDDINGS_LAB3)

X_train, X_test = train_test_dataset['X_train'], train_test_dataset['X_test']


# X_train should have a shape of 
# (1200,768), i.e. (number of samples x embedding dimension)
# X_test should have a shape of 
# (300,768), i.e. (number of samples x embedding dimension)

print(f"Shape of X_train: {X_train.shape}"), print(f"Shape of X_test: {X_test.shape}")

Using K-means clustering with K=20 clusters and random_seed=0 you should obtain the follow (unsupervised) clustering metrics.

Name of subset	Inertia	Silhouette
X_train	1066.602	0.128
X_test	~	0.098

Note that inertia is only computed by default with sklearn when training the kmeans, that is why we do not report it for the test set.

Ressources for Session 4⚓

Follow the link given in the main course page (An introduction to dealing with audio data)

Ressources for Session 5⚓

Goals for session 5 :

Download the MNIST-audio dataset using the HuggingFace datasets wrapper
Visualize the waveform and the spectrogram of a few audio files. You can use the librosa functions wavshow and specshow
Generate the embeddings using the HuggingFace models wrapper
Concatenate all embeddings to a numpy array, and reproduce results from Session 2.

In order to reproduce similar conditions than in Session 2, there is a few additional steps to take :

Once you obtain embeddings, average the first dimension of each vector - this corresponds to the temporal dimension.
keep only two digits out of ten. We won't tell you which ones, in order to avoid students to overfit for session 3.
remove indices that correspond to the accent of value 'German', because the accent is over-represented

from datasets import load_dataset
from transformers import AutoProcessor, Wav2Vec2Model

## Dataset downloading and setup
ds = load_dataset("gilkeyio/AudioMNIST")

## Model definitions
processor = AutoProcessor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2Model.from_pretrained("facebook/wav2vec2-base-960h")

Info

Use the different fields of the ds data structure in order to filter relevant columns. Check the links of Lab 4 about audio :

Info

To listen to some audios, several possibilities : - write an audio as as mp3 or other formats using the soundfile.write function, - using iPython Audio inside a Jupyter Notebook on the 1D vector - use a dataset viewer directly from the HuggingFace dataset structure, such as the one on the main HF MNIST audio page.