Aller au contenu

Resources for the Text Application

This application is based on Wikipedia articles, from which some articles were extracted. This is a Kaggle dataset, and not a dataset we created for this class.

Resources for Session 1

Dataset

Main features :

  • 200 texts
  • On the 200 texts, 100 of them are from the "bird" class, and are Wikipedia pages from birds. The other 100 are selected at random.

Visualisation of a few examples

These are some samples in the dataset. They are extracted from Wikipedia pages:

  • 'The pygmy white-eye (Oculocincta squamifrons), also known as the pygmy ibon, is a species of bird in the white-eye family Zosteropidae. It is monotypic within the genus Oculocincta.'

  • 'Joey Altman is an American chef, restaurateur, TV host and writer.'

  • 'GSAT-5P, or GSAT-5 Prime, was an Indian communications satellite which was lost in a launch failure in December 2010. Part of the Indian National Satellite System, it was intended to operate in geosynchronous orbit as a replacement for INSAT-3E.'

Model used for the embeddings

The 200 text have been put in a latent space using RoBERTa, a foundation model from this paper. We will dive into the details of feature extraction, deep learning, and foundation models in courses 4 and 5.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-text-lab1.npz. The data is contained under the "embeddings" name in the dictionary.

Work to do

Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.

What can you conclude about this matrix, and about the embeddings?

Resources for Session 2

Dataset

Main features :

  • 300 texts, containing only the first 30 words of each description.
  • On the 300 texts, 100 of them are from the "Conifer" class, 100 from the "Fern" class, and 100 from the "Moss" class.
  • "Conifer", "Fern" and "Moss" are respectively labelled as class "0", "1", and "2".
  • For each class, 80 out of the 100 examples are used for training, and 20 for testing. The training/test split was made randomly, and examples are already given to you as training and testing ones.

Model used for the embeddings

As for Lab 1, the 300 texts have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.

The training and testing datasets are available in the embeddings-text-lab2.npz file. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test".

In that regard, files are to be loaded using the following snipet code:

1
2
3
4
5
6
7
train_test_dataset = np.load(PATH_TO_EMBEDDINGS_LAB2)

X_train, X_test, y_train, y_test = train_test_dataset['X_train'], train_test_dataset['X_test'], train_test_dataset['y_train'], train_test_dataset['y_test']

# X_train should have a shape of (240, 768), i.e. (number of samples x embedding dimension)
# y_test should have a shape of (60), i.e. the number of test samples. Each sample's label is indexed via an integer. 
print(f"Shape of X_train: {X_train.shape}"), print(f"Shape of y_test: {y_test.shape}")

Work to do

Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.

As an example, using the K-Nearest Neighbour algorithm with K=10 for classification already achieves good performance, with an accuracy of 0.7. Detailed results can be found in the following table:

Precision Recall F1-score
0 0.71 0.75 0.73
1 0.71 0.60 0.65
2 0.68 0.75 0.71

You should be able to replicate these results using the function classification_report from scikit-learn:

1
2
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

Resources for Session 3

Dataset & Model used for the embeddings

You should use the same data as for Session 2. As a recall, data consists of 300 texts processed by the RoBERTa model.

Note on train and test data

Because this is unsupervised, the model does not need to be trained before inference. As a consequence, you may/should used both X_train and X_test in your model, in a large X concatenating both.

Warning

If you want to compare results between the supervised model of Lab 2 and the Unsupervised model of Lab 3 though, don't forget to compare on the same data! (i.e. the test dataset)

Work to do

Perform unsupervised learning on the aformentioned dataset.

Note

If you use clustering, you can use the labels to evaluate the quality of your clustering.

Results can be estimated using several metrics (see the sklearn documentation for details):

  • Using labels, you can use the random index, homogeneity, completeness and v-measure.
  • Without labels, you may use the inertia and the silhouette score.

It is generally informative to use both these metrics.

In the end, using the KMeans algorithm on all the embeddings with \(K=3\) results in:

Inertia Random Index
\(K=3\) 84.78 0.55

You should try varying \(K\) and see what happens!

Resources for Session 4

Follow the link given in the main course page (An introduction to Tokenization)

Resources for Session 5

Dataset

The dataset of interest is called DBpedia Classes, and is based on Wikipedia articles. It is a kaggle dataset and a popular baseline for text classification tasks.

Note

This is the dataset which was used in the Labs 2 and 3.

You can download it using kagglehub:

1
2
3
4
5
!pip install kagglehub
import kagglehub
# Download latest version
path = kagglehub.dataset_download("danofer/dbpedia-classes")
print("Path to dataset files:", path)

Using pandas.read_csv, you can read the dataset and visualize its structure (for example with data.head())

1
2
3
4
5
6
import pandas as pd
import numpy as np
import sklearn
import matplotlib

# !! TO DO

Granularity of the data

The DBpedia dataset has three levels of granularity going from high (l1) to low (l3).

Here is an exemple of how selecting the high level classes:

1
2
high_level_classes =(data["l1"].unique())   
high_level_classes

For Lab 2 ad 3, we extracted 100 examples for each of the three classes "Conifer", "Fern" and "Moss", belonging to the low level class Species.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
level = 'l3'
l1_class= 'Species'
nb_samples = 100

class_one = "Conifer"
class_two = "Fern"
class_three = "Moss"

# !!! TO DO
# Select all the "Species" Classes using the .unique() built in function

Work to do

Preprocessing the data

We will now perform a bit of preprocessing on raw text in order to remove some words from the description that would make the classification task too trivial (e.g. the name of the class itself), using this function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def remove_words_from_descriptions(descriptions, words_to_remove, max_length=None):
    """
    Remove specific words from a list of descriptions and limit the length of the final description.

    Parameters:
    descriptions (list of str): List of descriptions.
    words_to_remove (list of str): List of words to remove from the descriptions.
    max_length (int, optional): Maximum length of the final description. Defaults to None.

    Returns:
    list of str: List of descriptions with the specified words removed and limited in length.
    """
    words_to_remove_set = set(words_to_remove)
    cleaned_descriptions = []

    for description in descriptions:
        cleaned_description = ' '.join(
            word for word in description.split() 
            if word.lower().strip('.,!?;:()[]{}"\'') not in words_to_remove_set
        )
        if max_length is not None:
            cleaned_description = ' '.join(cleaned_description.split()[:max_length])
        cleaned_descriptions.append(cleaned_description)

    return cleaned_descriptions
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
stop_words = [class_one, class_two, class_three]

all_class_one = data[data[level]==class_one]
random_class_one_subset = all_class_one.sample(n=nb_samples, random_state=42)
class_one_description = random_class_one_subset["text"].values
class_one_description = remove_words_from_descriptions(class_one_description, stop_words, max_length)

all_class_two = data[data[level]==class_two]
random_class_two_subset = all_class_two.sample(n=nb_samples, random_state=42)
class_two_description = random_class_two_subset["text"].values
class_two_description = remove_words_from_descriptions(class_two_description, stop_words, max_length)

all_class_three = data[data[level]==class_three]
random_class_three_subset = all_class_three.sample(n=nb_samples, random_state=42)
class_three_description = random_class_three_subset["text"].values
class_three_description = remove_words_from_descriptions(class_three_description, stop_words, max_length)

Load the Foundation model

Now that you have preprocessed the raw texts for the three selected classes, you can load the pre-trained model RoBERTa from hugging face.

1
2
3
4
5
6
7
8
9
from transformers import AutoTokenizer, RobertaModel
import torch
import tqdm

checkpoint_name = "FacebookAI/roberta-base"

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(checkpoint_name)
model = RobertaModel.from_pretrained(checkpoint_name).to(device_general)
Define a function embed_text that takes in input the text to embed, the model and the tokenizer and return the CLS token (outputs.last_hidden_state[:,0,;]) and the full embeddings (outputs.last_hidden_state). What's the difference between them? What does the CLS is useful for?

1
# !! TO DO

Generate the embeddings

Compute the embeddings for the three classes using the compute_embeddings function, then concatenate embeddings for different classes and save them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
all_embeddings = []

def compute_embeddings(list_text, name_file_to_save, batch_size = 10):

    model, tokenizer, checkpoint_name = load_model()
    all_embeddings = []
    for i in tqdm.tqdm(range(0,len(list_text),batch_size)):

        if (i+batch_size > len(list_text)):
            text_embeddings = embed_this_text(list(list_text[i:]), model, tokenizer)
        else:
            text_embeddings = embed_this_text(list(list_text[i:i+batch_size]), model, tokenizer)

        all_embeddings.append(text_embeddings)
    np.savez_compressed(f'{folder_store_data_path}/{name_file_to_save}.npz', embeddings = np.vstack(all_embeddings))

    return np.vstack(all_embeddings)

batch_size = 10

# Compute the embeddings for class one, two and three
class_one_embeddings = compute_embeddings(class_one_description, f"{class_one.lower()}_description_embeddings", batch_size = batch_size)

# ......
# !! TO DO

Reproduce results from Lab 2

Now, using the same approach as for Lab 2, divide your embedding in a training and test set (test_size=0.2) and check that you have similar classification performances. You can also use an unsupervised approach (UMAP, t-SNE) to visualize or cluster them.