Aller au contenu

Ressources for the Text Application

This application is based on Wikipedia articles, from which some articles were extracted. This is a Kaggle dataset, and not a dataset we created for this class.

Resources for Session 5

First you should have a look at the DBpedia Classes dataset based on Wikipedia articles. It is a kaggle dataset and a popular baseline for text classification tasks.

You can download it using kagglehub

1
2
3
4
5
!pip install kagglehub
import kagglehub
# Download latest version
path = kagglehub.dataset_download("danofer/dbpedia-classes")
print("Path to dataset files:", path)

Using pandas.read_csv, you can read the dataset and visualize its structure (for example with data.head())

1
2
3
4
5
6
import pandas as pd
import numpy as np
import sklearn
import matplotlib

# !! TO DO

Here is an exemple of how selecting the high level classes

1
2
high_level_classes =(data["l1"].unique())   
high_level_classes

The DBpedia dataset has threee levels of granularity going from high (l1) to low (l3). We are interested in extracting 100 examples for each of the four classes of Lab 2 and 3, belonging to the high level class Species.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
level = 'l3'
l1_class= 'Species'
nb_samples = 100

class_one = "Conifer"
class_two = "Fern"
class_three = "Moss"
class_four = "Grape"

# !!! TO DO
# Select all the "Species" Classes using the .unique() built in function
We have now to perform a bit of preprocessing on raw text in order to remove some words from the description that would make the classification task too trivial (e.g. the name of the class itself), using this function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def remove_words_from_descriptions(descriptions, words_to_remove, max_length=None):
    """
    Remove specific words from a list of descriptions and limit the length of the final description.

    Parameters:
    descriptions (list of str): List of descriptions.
    words_to_remove (list of str): List of words to remove from the descriptions.
    max_length (int, optional): Maximum length of the final description. Defaults to None.

    Returns:
    list of str: List of descriptions with the specified words removed and limited in length.
    """
    words_to_remove_set = set(words_to_remove)
    cleaned_descriptions = []

    for description in descriptions:
        cleaned_description = ' '.join(
            word for word in description.split() 
            if word.lower().strip('.,!?;:()[]{}"\'') not in words_to_remove_set
        )
        if max_length is not None:
            cleaned_description = ' '.join(cleaned_description.split()[:max_length])
        cleaned_descriptions.append(cleaned_description)

    return cleaned_descriptions
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
class_one_stop_words = [class_one, class_two, class_three, class_four]
class_two_stop_words = [class_one, class_two, class_three, class_four]
class_three_stop_words = [class_one, class_two, class_three, class_four]
class_four_stop_words = [class_one, class_two, class_three, class_four]

all_class_one = data[data[level]==class_one]
random_class_one_subset = all_class_one.sample(n=nb_samples, random_state=42)
class_one_description = random_class_one_subset["text"].values
class_one_description = remove_words_from_descriptions(class_one_description, class_one_stop_words, max_length)

all_class_two = data[data[level]==class_two]
random_class_two_subset = all_class_two.sample(n=nb_samples, random_state=42)
class_two_description = random_class_two_subset["text"].values

class_two_description = remove_words_from_descriptions(class_two_description, class_two_stop_words, max_length)
all_class_three = data[data[level]==class_three]
random_class_three_subset = all_class_three.sample(n=nb_samples, random_state=42)
class_three_description = random_class_three_subset["text"].values

class_three_description = remove_words_from_descriptions(class_three_description, class_three_stop_words, max_length)

Now that you have preprocessed the raw texts for the three selected classes, you can load the pre-trained model (RoBERTa from hugging face.

1
2
3
4
5
6
7
8
9
from transformers import AutoTokenizer, RobertaModel
import torch
import tqdm

checkpoint_name = "FacebookAI/roberta-base"

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint_name)
model = RobertaModel.from_pretrained(checkpoint_name).to(device_general)
Define a function embed_text that takes in input the text to embed, the model and the tokenizer and return the CLS token (outputs.last_hidden_state[:,0,;]) and the full embeddings (outputs.last_hidden_state). What's the difference between them? What does the CSL is useful for?

1
# !! TO DO

Compute the embeddings for the four classes using the compute_embeddings function, then concatenate embeddings for different classes and save them

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
all_embeddings = []

def compute_embeddings(list_text, name_file_to_save, batch_size = 10):

    model, tokenizer, checkpoint_name = load_model()
    all_embeddings = []
    for i in tqdm.tqdm(range(0,len(list_text),batch_size)):

        if (i+batch_size > len(list_text)):
            text_embeddings = embed_this_text(list(list_text[i:]), model, tokenizer)
        else:
            text_embeddings = embed_this_text(list(list_text[i:i+batch_size]), model, tokenizer)

        all_embeddings.append(text_embeddings)
    np.savez_compressed(f'{folder_store_data_path}/{name_file_to_save}.npz', embeddings = np.vstack(all_embeddings))

    return np.vstack(all_embeddings)

 batch_size = 1
 # Compute the embeddings for class one, two and three
 class_one_embeddings = compute_embeddings(class_one_description, f"{class_one.lower()}_description_embeddings", batch_size = batch_size)

 # ......
 # !! TO DO

Now, using the same approach as for Lab 2, divide your embedding in a training and test set (test_size=0.2) and check that you have similar classification performances. You can also use an unsupervised approach (UMAP, t-SNE) to visualize them.

Ressources for Session 4

TODO: - Define a dataset and a dataloader. The rest should follow.

Ressources for Session 3

Use the same data as for Session 2 in order to perform unsupervised learning. If you use clustering, you can use the labels to evaluate the quality of your clustering.

Because this is unsupervised, you should used both X_train and X_test in your model, in a large X concatenating both. If you want to compare results between the supervised model of Lab 2 and the Unsupervised model of Lab 3 though, don't forget to compare on the same data! (i.e. the test dataset)

Results can be estimated using several metrics (see the sklearn documentation for details):

  • Using labels, you can use the random index, homogeneity, completeness and v-measure.
  • Without labels, you may use the inertia and the silhouette score.

It is generally informative to use both these metrics.

In the end, using the KMeans algorithm on all the embeddings with \(K=3\) results in:

  • inertia: 84.78
  • random index: 0.55

You should try varying \(K\) and see what happens!

Ressources for Session 2

Dataset

Main features :

  • 300 texts, containing only the first 30 words of each description.
  • On the 300 texts, 100 of them are from the "Conifer" class, 100 from the "Fern" class, and 100 from the "Moss" class.
  • "Conifer", "Fern" and "Moss" are respectively labelled as class "0", "1", and "2".
  • For each class, 80 out of the 100 examples are used for training, and 20 for testing. The training/test split was made randomly, and examples are already given to you as training and testing ones.

Latent Space

As for Lab 1, the 300 texts have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.

The training and testing datasets are available in the embeddings-text-lab2.npz file. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test".

In that regard, files are to be loaded using the following snipet code:

1
2
3
4
5
6
7
train_test_dataset = np.load(PATH_TO_EMBEDDINGS_LAB2)

X_train, X_test, y_train, y_test = train_test_dataset['X_train'], train_test_dataset['X_test'], train_test_dataset['y_train'], train_test_dataset['y_test']

# X_train should have a shape of (240, 768), i.e. (number of samples x embedding dimension)
# y_test should have a shape of (60), i.e. the number of test samples. Each sample's label is indexed via an integer. 
print(f"Shape of X_train: {X_train.shape}"), print(f"Shape of y_test: {y_test.shape}")

Work to do

Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.

As an example, using the K-Nearest Neighbour algorithm with K=10 for classification already achieves good performance, with an accuracy of 0.7. Detailed results can be found in the following table:

Precision Recall F1-score
0 0.71 0.75 0.73
1 0.71 0.60 0.65
2 0.68 0.75 0.71

You should be able to replicate these results using the function classification_report from scikit-learn:

1
2
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Ressources for Session 1

Dataset

Main features :

  • 200 texts
  • On the 200 texts, 100 of them are from the "bird" class, and are Wikipedia pages from birds. The other 100 are selected at random.

Visualisation of a few examples

These are some samples in the dataset. They are extracted from Wikipedia pages:

  • 'The pygmy white-eye (Oculocincta squamifrons), also known as the pygmy ibon, is a species of bird in the white-eye family Zosteropidae. It is monotypic within the genus Oculocincta.'

  • 'Joey Altman is an American chef, restaurateur, TV host and writer.'

  • 'GSAT-5P, or GSAT-5 Prime, was an Indian communications satellite which was lost in a launch failure in December 2010. Part of the Indian National Satellite System, it was intended to operate in geosynchronous orbit as a replacement for INSAT-3E.'

Latent Space

The 200 text have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-text-lab1.npz. The data is contained under the "embeddings" name in the dictionary.

Work to do

Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.