Ressources for the Audio Application⚓
Ressources for Session 1⚓
Dataset⚓
For Lab Session 1, we use all training examples from two classes of the ESC-50 dataset.
- The two classes are "Dog barking" and "Fireworks".
- Duration of sounds are 5 seconds
- Sampling rate is 44.1 kHz
More details can be found in the original paper.
Visualisation - Listening to a few sounds⚓
Dog
Fireworks
Latent Space⚓
The 80 sounds have been put in a latent space using CNN14, a model from this paper : PANNs. We will dive into the details of feature extraction, deep learning, and foundation models in courses 4 and 5.
For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab1.npz.
Work to do⚓
Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.
What can you conclude about this matrix, and about the embeddings?
Ressources for Session 2⚓
Dataset⚓
For Lab Session 2, we use a subset of the Audio MNIST dataset. We have extracted only two digits, and the task we propose here is to classify the accent of the speaker.
Latent Space⚓
We have put the data into the latent space of an audio foundation model, wav2vec2 using the HuggingFace models wrapper. We will delve into the details of Deep Learning and feature extraction from course 4.
For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab2.npz. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test". The names of the different classes (accents) are directly the ones in the y
vectors.
In that regard, files are to be loaded using the following snipet code:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Work to do⚓
Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.
As an example, using the K-Nearest Neighbour algorithm with K=1 you should have the following results :
Precision | Recall | F1-score | |
---|---|---|---|
Brasilian | 0.77 | 0.89 | 0.83 |
Chinese | 0.63 | 0.80 | 0.71 |
Egyptian_American? | 0.45 | 0.65 | 0.53 |
English | 0.53 | 0.53 | 0.53 |
French | 0.71 | 0.62 | 0.67 |
German/Spanish | 0.60 | 0.63 | 0.62 |
Italian | 0.60 | 0.60 | 0.60 |
Levant | 0.64 | 0.45 | 0.53 |
South African | 0.92 | 0.46 | 0.61 |
South Korean | 1.00 | 0.80 | 0.89 |
Spanish | 0.61 | 0.64 | 0.62 |
Tamil | 0.65 | 0.71 | 0.68 |
You should be able to replicate this table using the function classification_report
from scikit-learn:
1 2 |
|
Ressources for Session 3⚓
For Lab Session 3, we use another subset of the Audio MNIST dataset, as in Lab Session 2.
We have extracted only two digits, but not the same ones than for Lab Session 2 so it's useless to try to use the previous labels!
Your task is to perform unsupervised learning on the data.
Latent Space⚓
Similarly as in Lab Session 2, we have put the data into the latent space of an audio foundation model, wav2vec2 using the HuggingFace models wrapper. We will delve into the details of Deep Learning and feature extraction from course 4.
For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab3.npz. This file is a dictionary, whose values are indexed as "X_train", and "X_test" (no labels).
In that regard, files are to be loaded using the following code snipett :
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Using K-means clustering with K=20
clusters and random_seed=0
you should obtain the follow (unsupervised) clustering metrics.
Name of subset | Inertia | Silhouette |
---|---|---|
X_train | 1066.602 | 0.128 |
X_test | ~ | 0.098 |
Note that inertia is only computed by default with sklearn when training the kmeans, that is why we do not report it for the test set.
Ressources for Session 4⚓
Follow the link given in the main course page (An introduction to dealing with audio data)
Ressources for Session 5⚓
Goals for session 5 :
- manipulate raw audio data (visualization, listening, ...)
- generate embeddings using a pretrained convolutional neural network
- reproduce results from session 3
We have uploaded the audio files corresponding to the embeddings used in Session 3 in this zip file.
Warning
The size of the audios_lab3.zip
file is 6 GB. It might be quite long to download it depending on your connection, better use a wired / Ethernet (e.g. computers in the classroom)
Note
You can directly download a from python code and a URL by using the following code snippet. This is particularly useful if you use Google Colab, as it should be faster and it will directly put it on your google drive.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
Things to do :
- Visualize the waveform and the spectrogram of a few audio files. You can use the librosa functions wavshow and specshow
- Generate embeddings usings panns_inference ; you can use the inference method of AudioTagging class and get the
embedding
output. - Concatenate all embeddings to a numpy array, and reproduce results from Session 3.
Warning
Depending on your setup, if this is too long to download / run, you can also do the same process to regerenate embeddings for Session 1. For this, you need to :
- clone the ESC-50 repository.
- Read the
meta.csv
to select only files corresponding to the target classes Dog Bark and Fireworks. - Generate embeddings from the audios using the selected files