Aller au contenu

Ressources for the Audio Application

Ressources for Session 1

Dataset

For Lab Session 1, we use all training examples from two classes of the ESC-50 dataset.

  • The two classes are "Dog barking" and "Fireworks".
  • Duration of sounds are 5 seconds
  • Sampling rate is 44.1 kHz

More details can be found in the original paper.

Visualisation - Listening to a few sounds

Dog

Fireworks

Latent Space

The 80 sounds have been put in a latent space using CNN14, a model from this paper : PANNs. We will dive into the details of feature extraction, deep learning, and foundation models in courses 4 and 5.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab1.npz.

Work to do

Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.

What can you conclude about this matrix, and about the embeddings?

Ressources for Session 2

Dataset

For Lab Session 2, we use a subset of the Audio MNIST dataset. We have extracted only two digits, and the task we propose here is to classify the accent of the speaker.

Latent Space

We have put the data into the latent space of an audio foundation model, wav2vec2 using the HuggingFace models wrapper. We will delve into the details of Deep Learning and feature extraction from course 4.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab2.npz. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test". The names of the different classes (accents) are directly the ones in the yvectors.

In that regard, files are to be loaded using the following snipet code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import numpy as np
train_test_dataset = np.load(PATH_TO_EMBEDDINGS_LAB2)

X_train, X_test, y_train, y_test = train_test_dataset['X_train'], train_test_dataset['X_test'], train_test_dataset['y_train'], train_test_dataset['y_test']

# 
class_names = np.unique(y_train)
print(f"Class names : {class_names}")

# X_train should have a shape of 
# (1200,768), i.e. (number of samples x embedding dimension)
# y_test should have a shape of (300), i.e. the number of test samples. 
print(f"Shape of X_train: {X_train.shape}"), print(f"Shape of y_test: {y_test.shape}")

Work to do

Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.

As an example, using the K-Nearest Neighbour algorithm with K=1 you should have the following results :

Precision Recall F1-score
Brasilian 0.77 0.89 0.83
Chinese 0.63 0.80 0.71
Egyptian_American? 0.45 0.65 0.53
English 0.53 0.53 0.53
French 0.71 0.62 0.67
German/Spanish 0.60 0.63 0.62
Italian 0.60 0.60 0.60
Levant 0.64 0.45 0.53
South African 0.92 0.46 0.61
South Korean 1.00 0.80 0.89
Spanish 0.61 0.64 0.62
Tamil 0.65 0.71 0.68

You should be able to replicate this table using the function classification_report from scikit-learn:

1
2
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Ressources for Session 3

For Lab Session 3, we use another subset of the Audio MNIST dataset, as in Lab Session 2.

We have extracted only two digits, but not the same ones than for Lab Session 2 so it's useless to try to use the previous labels!

Your task is to perform unsupervised learning on the data.

Latent Space

Similarly as in Lab Session 2, we have put the data into the latent space of an audio foundation model, wav2vec2 using the HuggingFace models wrapper. We will delve into the details of Deep Learning and feature extraction from course 4.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab3.npz. This file is a dictionary, whose values are indexed as "X_train", and "X_test" (no labels).

In that regard, files are to be loaded using the following code snipett :

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import numpy as np
train_test_dataset = np.load(PATH_TO_EMBEDDINGS_LAB3)

X_train, X_test = train_test_dataset['X_train'], train_test_dataset['X_test']


# X_train should have a shape of 
# (1200,768), i.e. (number of samples x embedding dimension)
# X_test should have a shape of 
# (300,768), i.e. (number of samples x embedding dimension)

print(f"Shape of X_train: {X_train.shape}"), print(f"Shape of X_test: {X_test.shape}")

Using K-means clustering with K=20 clusters and random_seed=0 you should obtain the follow (unsupervised) clustering metrics.

Name of subset Inertia Silhouette
X_train 1066.602 0.128
X_test ~ 0.098

Note that inertia is only computed by default with sklearn when training the kmeans, that is why we do not report it for the test set.

Ressources for Session 4

Follow the link given in the main course page (An introduction to dealing with audio data)

Ressources for Session 5

Goals for session 5 :

  • manipulate raw audio data (visualization, listening, ...)
  • generate embeddings using a pretrained convolutional neural network
  • reproduce results from session 3

We have uploaded the audio files corresponding to the embeddings used in Session 3 in this zip file.

Warning

The size of the audios_lab3.zipfile is 6 GB. It might be quite long to download it depending on your connection, better use a wired / Ethernet (e.g. computers in the classroom)

Note

You can directly download a from python code and a URL by using the following code snippet. This is particularly useful if you use Google Colab, as it should be faster and it will directly put it on your google drive.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import requests

url = 'https://filesender.renater.fr/?s=download&token=4d56a385-97d5-41b5-927c-ab2f7d71bb05'
local_filename = 'file.zip'

def download_file(url, local_filename):
    # Send a GET request to the URL
    with requests.get(url, stream=True) as response:
        response.raise_for_status()  # Check if the request was successful
        # Open a local file with write-binary mode
        with open(local_filename, 'wb') as file:
            # Write the content of the response to the local file in chunks
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)

download_file(url, local_filename)
print(f"Downloaded {local_filename}") 

Things to do :

  • Visualize the waveform and the spectrogram of a few audio files. You can use the librosa functions wavshow and specshow
  • Generate embeddings usings panns_inference ; you can use the inference method of AudioTagging class and get the embedding output.
  • Concatenate all embeddings to a numpy array, and reproduce results from Session 3.

Warning

Depending on your setup, if this is too long to download / run, you can also do the same process to regerenate embeddings for Session 1. For this, you need to :

  • clone the ESC-50 repository.
  • Read the meta.csv to select only files corresponding to the target classes Dog Bark and Fireworks.
  • Generate embeddings from the audios using the selected files