Ressources for the Audio Application⚓
Ressources for Session 1⚓
Dataset⚓
For Lab Session 1, we use all training examples from two classes of the ESC-50 dataset.
- The two classes are "Dog barking" and "Fireworks".
- Duration of sounds are 5 seconds
- Sampling rate is 44.1 kHz
More details can be found in the original paper.
Visualisation - Listening to a few sounds⚓
Dog
Fireworks
Latent Space⚓
The 80 sounds have been put in a latent space using CNN14, a model from this paper : PANNs. We will dive into the details of feature extraction, deep learning, and foundation models in courses 4 and 5.
For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab1.npz.
Work to do⚓
Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.
What can you conclude about this matrix, and about the embeddings?
Ressources for Session 2⚓
Dataset⚓
For Lab Session 2, we use a subset of the Audio MNIST dataset. We have extracted only two digits, and the task we propose here is to classify the accent of the speaker.
Latent Space⚓
We have put the data into the latent space of an audio foundation model, wav2vec2 using the HuggingFace models wrapper. We will delve into the details of Deep Learning and feature extraction from course 4.
For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab2.npz. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test". The names of the different classes (accents) are directly the ones in the y
vectors.
In that regard, files are to be loaded using the following snipet code:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Work to do⚓
Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.
As an example, using the K-Nearest Neighbour algorithm with K=1 you should have the following results :
Precision | Recall | F1-score | |
---|---|---|---|
Brasilian | 0.77 | 0.89 | 0.83 |
Chinese | 0.63 | 0.80 | 0.71 |
Egyptian_American? | 0.45 | 0.65 | 0.53 |
English | 0.53 | 0.53 | 0.53 |
French | 0.71 | 0.62 | 0.67 |
German/Spanish | 0.60 | 0.63 | 0.62 |
Italian | 0.60 | 0.60 | 0.60 |
Levant | 0.64 | 0.45 | 0.53 |
South African | 0.92 | 0.46 | 0.61 |
South Korean | 1.00 | 0.80 | 0.89 |
Spanish | 0.61 | 0.64 | 0.62 |
Tamil | 0.65 | 0.71 | 0.68 |
You should be able to replicate this table using the function classification_report
from scikit-learn:
1 2 |
|
Ressources for Session 3⚓
For Lab Session 3, we use another subset of the Audio MNIST dataset, as in Lab Session 2.
We have extracted only two digits, but not the same ones than for Lab Session 2 so it's useless to try to use the previous labels!
Your task is to perform unsupervised learning on the data.
Latent Space⚓
Similarly as in Lab Session 2, we have put the data into the latent space of an audio foundation model, wav2vec2 using the HuggingFace models wrapper. We will delve into the details of Deep Learning and feature extraction from course 4.
For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab3.npz. This file is a dictionary, whose values are indexed as "X_train", and "X_test" (no labels).
In that regard, files are to be loaded using the following code snipett :
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Using K-means clustering with K=20
clusters and random_seed=0
you should obtain the follow (unsupervised) clustering metrics.
Name of subset | Inertia | Silhouette |
---|---|---|
X_train | 1066.602 | 0.128 |
X_test | ~ | 0.098 |
Note that inertia is only computed by default with sklearn when training the kmeans, that is why we do not report it for the test set.
Ressources for Session 4⚓
Follow the link given in the main course page (An introduction to dealing with audio data)
Ressources for Session 5⚓
Goals for session 5 :
- Download the MNIST-audio dataset using the HuggingFace datasets wrapper
- Visualize the waveform and the spectrogram of a few audio files. You can use the librosa functions wavshow and specshow
- Generate the embeddings using the HuggingFace models wrapper
- Concatenate all embeddings to a numpy array, and reproduce results from Session 2.
In order to reproduce similar conditions than in Session 2, there is a few additional steps to take :
- Once you obtain embeddings, average the first dimension of each vector - this corresponds to the temporal dimension.
- keep only two digits out of ten. We won't tell you which ones, in order to avoid students to overfit for session 3.
- remove indices that correspond to the accent of value 'German', because the accent is over-represented
1 2 3 4 5 6 7 8 9 |
|
Info
Use the different fields of the ds
data structure in order to filter relevant columns. Check the links of Lab 4 about audio :
Info
To listen to some audios, several possibilities : - write an audio as as mp3 or other formats using the soundfile.write function, - using iPython Audio inside a Jupyter Notebook on the 1D vector - use a dataset viewer directly from the HuggingFace dataset structure, such as the one on the main HF MNIST audio page.