Aller au contenu

Ressources for the Audio Application

This application is based on sound recordings, mostly from the Silent Cities Dataset (except Session 1), from which we extract subdatasets specifically for this course.

Ressources for Session 5

Goals for session 5 :

  • manipulate raw audio data (visualization, listening, ...)
  • generate embeddings using a pretrained convolutional neural network
  • reproduce results from session 3

We have uploaded the audio files corresponding to the embeddings used in Session 3 in this zip file.

Warning

The size of the audios_lab3.zipfile is 6 GB. It might be quite long to download it depending on your connection, better use a wired / Ethernet (e.g. computers in the classroom)

Note

You can directly download a from python code and a URL by using the following code snippet. This is particularly useful if you use Google Colab, as it should be faster and it will directly put it on your google drive.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import requests

url = 'https://filesender.renater.fr/?s=download&token=4d56a385-97d5-41b5-927c-ab2f7d71bb05'
local_filename = 'file.zip'

def download_file(url, local_filename):
    # Send a GET request to the URL
    with requests.get(url, stream=True) as response:
        response.raise_for_status()  # Check if the request was successful
        # Open a local file with write-binary mode
        with open(local_filename, 'wb') as file:
            # Write the content of the response to the local file in chunks
            for chunk in response.iter_content(chunk_size=8192):
                file.write(chunk)

download_file(url, local_filename)
print(f"Downloaded {local_filename}") 

Things to do :

  • Visualize the waveform and the spectrogram of a few audio files. You can use the librosa functions wavshow and specshow
  • Generate embeddings usings panns_inference ; you can use the inference method of AudioTagging class and get the embedding output.
  • Concatenate all embeddings to a numpy array, and reproduce results from Session 3.

Warning

Depending on your setup, if this is too long to download / run, you can also do the same process to regerenate embeddings for Session 1. For this, you need to :

  • clone the ESC-50 repository.
  • Read the meta.csv to select only files corresponding to the target classes Dog Bark and Fireworks.
  • Generate embeddings from the audios using the selected files

Ressources for Session 3

The data for Lab Session 3 is based on sound recordings, from the Silent Cities Dataset, site 059. A few features :

  • Audio was recorded automatically one minute every 10 minutes at 48 kHz using an Open Acoustics Audiomoth device
  • We kept all sounds recorded the during daytime, excluding nighttime
  • Excerpts of one minute are divided in 10 second segments
  • Total : 14250 segments

As for Lab 1, the sounds have been put in a latent space using CNN14, a deep learning model from this paper : PANNs.

Open the numpy array containing all samples in the latent space from the embeddings-audio-lab3.npz. There is a single key in the dictionnary, "X_train", yielding a 14250x2048 array with embeddings.

We also provide a metadata csv file that contains a lot of other descriptors of the sounds, including date and time, "tags" (automatic recognition from another deep learning algorithm), as well as other indices (described in the paper). We recommend using Pandas to open the csv file and easily read its columns, for example :

1
2
3
4
import pandas as pd 

Df = pd.read_csv(path_to_csvfile)
y = Df['dB'] ## corresponds to sound pressure level in the sounds 

Check out the different columns, and use them to help visualizing the results of your unsupervised learning techniques.

You can compute pseudo-labels, and use them for the computation of metrics with ground-truth, using the following code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def find_largest_index(row):
    """
    Find the index of the largest value in the row.
    Parameters:
    - row (pandas.Series): A row of a DataFrame.
    Returns:
    - str: The index of the largest value in the row.
    """
    # Subset of the tags
    tags = ['tag_wind', 'tag_rain', 'tag_river', 'tag_wave', 'tag_thunder', 'tag_bird', 'tag_amphibian',
            'tag_insect', 'tag_mammal', 'tag_reptile', 'tag_walking', 'tag_cycling', 'tag_beep', 'tag_car',
            'tag_carhonk', 'tag_motorbike', 'tag_plane', 'tag_helicoptere', 'tag_boat', 'tag_others_motors',
            'tag_shoot', 'tag_bell', 'tag_talking', 'tag_music', 'tag_dog_bark', 'tag_kitchen_sounds',
            'tag_rolling_shutter', 'geophony', 'biophony', 'anthropophony']
    # Find the index of the largest value in the row
    max_index = row[tags].idxmax()
    # Return the index
    return max_index

# Apply the function to the DataFrame
Df['largest_index'] = Df.apply(find_largest_index, axis=1)

# Generate the pseudo-labels, based on the largest index
pseudo_labels = Df['largest_index'].values

Note that you may also compute other pseudo-labels, to see if this better aligns with your clustering result.

Results can be estimated using several metrics (see the sklearn documentation for details):

  • Using labels, you can use the random index, homogeneity, completeness and v-measure.
  • Without labels, you may use the inertia and the silhouette score.

It is generally informative to use both these metrics.

In the end, using the KMeans algorithm on these embeddings with \(K=30\) results in:

  • inertia: 101457.14
  • random index: 0.70

You should try varying \(K\) and see what happens!

Ressources for Session 2

Dataset

The data for Lab Session 2 is based on sound recordings, from the Silent Cities Dataset, site 059. A few features :

  • Audio was recorded automatically one minute every 10 minutes at 48 kHz using an Open Acoustics Audiomoth device
  • We selected 100 sounds that include a bird song, and 100 sounds that do NOT include a bird song
  • The data was split for train and test (80/20)

Latent Space

As for Lab 1, the sounds have been put in a latent space using CNN14, a deep learning model from this paper : PANNs. We will delve into the details of Deep Learning and feature extraction from course 4.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab2.npz. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test". Class names are also included in the key "class_names".

In that regard, files are to be loaded using the following snipet code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
train_test_dataset = np.load(PATH_TO_EMBEDDINGS_LAB2)

X_train, X_test, y_train, y_test = train_test_dataset['X_train'], train_test_dataset['X_test'], train_test_dataset['y_train'], train_test_dataset['y_test']


# Class names are also included
class_names = train_test_dataset['class_names']
print(f"Class names : {class_names}")

# X_train should have a shape of 
# (160,2048), i.e. (number of samples x embedding dimension)
# y_test should have a shape of (40), i.e. the number of test samples. Each sample's label is indexed via an integer.
print(f"Shape of X_train: {X_train.shape}"), print(f"Shape of y_test: {y_test.shape}")

Work to do

Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.

As an example, using the K-Nearest Neighbour algorithm with K=10 gives the following (poor!) results :

Precision Recall F1-score
bird 0.52 0.48 0.50
other 0.37 0.41 0.17

You should be able to replicate these results using the function classification_report from scikit-learn:

1
2
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

This example is a quite difficult one, as a default method gives results which may not be better than chance. Try to see how far you get, especially for the "bird" class, as the "other" class may be very difficult. Good luck !

Ressources for Session 1

Dataset

For Lab Session 1, we use all training examples from two classes of the ESC-50 dataset.

  • The two classes are "Dog barking" and "Fireworks".
  • Duration of sounds are 5 seconds
  • Sampling rate is 44.1 kHz

More details can be found in the original paper.

Visualisation - Listening to a few sounds

Dog

Fireworks

Latent Space

The 80 sounds have been put in a latent space using CNN14, a deep learning model from this paper : PANNs. We will delve into the details of Deep Learning and feature extraction from course 4.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-audio-lab1.npz.

Work to do

Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.