Resources for the Text Application⚓
This application is based on Wikipedia articles, from which some articles were extracted. This is a Kaggle dataset, and not a dataset we created for this class.
Resources for Session 1⚓
Dataset⚓
Main features :
- 200 texts
- On the 200 texts, 100 of them are from the "bird" class, and are Wikipedia pages from birds. The other 100 are selected at random.
Visualisation of a few examples⚓
These are some samples in the dataset. They are extracted from Wikipedia pages:
-
'The pygmy white-eye (Oculocincta squamifrons), also known as the pygmy ibon, is a species of bird in the white-eye family Zosteropidae. It is monotypic within the genus Oculocincta.'
-
'Joey Altman is an American chef, restaurateur, TV host and writer.'
-
'GSAT-5P, or GSAT-5 Prime, was an Indian communications satellite which was lost in a launch failure in December 2010. Part of the Indian National Satellite System, it was intended to operate in geosynchronous orbit as a replacement for INSAT-3E.'
Model used for the embeddings⚓
The 200 text have been put in a latent space using RoBERTa, a foundation model from this paper. We will dive into the details of feature extraction, deep learning, and foundation models in courses 4 and 5.
For now, you can just open the numpy array containing all samples in the latent space from the embeddings-text-lab1.npz. The data is contained under the "embeddings" name in the dictionary.
Work to do⚓
Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.
What can you conclude about this matrix, and about the embeddings?
Resources for Session 2⚓
Dataset⚓
Main features :
- 300 texts, containing only the first 30 words of each description.
- On the 300 texts, 100 of them are from the "Conifer" class, 100 from the "Fern" class, and 100 from the "Moss" class.
- "Conifer", "Fern" and "Moss" are respectively labelled as class "0", "1", and "2".
- For each class, 80 out of the 100 examples are used for training, and 20 for testing. The training/test split was made randomly, and examples are already given to you as training and testing ones.
Model used for the embeddings⚓
As for Lab 1, the 300 texts have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.
The training and testing datasets are available in the embeddings-text-lab2.npz file. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test".
In that regard, files are to be loaded using the following snipet code:
1 2 3 4 5 6 7 |
|
Work to do⚓
Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.
As an example, using the K-Nearest Neighbour algorithm with K=10 for classification already achieves good performance, with an accuracy of 0.7. Detailed results can be found in the following table:
Precision | Recall | F1-score | |
---|---|---|---|
0 | 0.71 | 0.75 | 0.73 |
1 | 0.71 | 0.60 | 0.65 |
2 | 0.68 | 0.75 | 0.71 |
You should be able to replicate these results using the function classification_report
from scikit-learn:
1 2 |
|
Resources for Session 3⚓
Dataset & Model used for the embeddings⚓
You should use the same data as for Session 2. As a recall, data consists of 300 texts processed by the RoBERTa model.
Note on train and test data⚓
Because this is unsupervised, the model does not need to be trained before inference. As a consequence, you may/should used both X_train and X_test in your model, in a large X concatenating both.
Warning
If you want to compare results between the supervised model of Lab 2 and the Unsupervised model of Lab 3 though, don't forget to compare on the same data! (i.e. the test dataset)
Work to do⚓
Perform unsupervised learning on the aformentioned dataset.
Note
If you use clustering, you can use the labels to evaluate the quality of your clustering.
Results can be estimated using several metrics (see the sklearn documentation for details):
- Using labels, you can use the random index, homogeneity, completeness and v-measure.
- Without labels, you may use the inertia and the silhouette score.
It is generally informative to use both these metrics.
In the end, using the KMeans algorithm on all the embeddings with \(K=3\) results in:
Inertia | Random Index | |
---|---|---|
\(K=3\) | 84.78 | 0.55 |
You should try varying \(K\) and see what happens!
Resources for Session 4⚓
Follow the link given in the main course page (An introduction to Tokenization)
Resources for Session 5⚓
Dataset⚓
The dataset of interest is called DBpedia Classes, and is based on Wikipedia articles. It is a kaggle dataset and a popular baseline for text classification tasks.
Note
This is the dataset which was used in the Labs 2 and 3.
You can download it using kagglehub:
1 2 3 4 5 |
|
Using pandas.read_csv
, you can read the dataset and visualize its structure (for example with data.head()
)
1 2 3 4 5 6 |
|
Granularity of the data⚓
The DBpedia dataset has three levels of granularity going from high (l1) to low (l3).
Here is an exemple of how selecting the high level classes:
1 2 |
|
For Lab 2 ad 3, we extracted 100 examples for each of the three classes "Conifer", "Fern" and "Moss", belonging to the low level class Species.
1 2 3 4 5 6 7 8 9 10 |
|
Work to do⚓
Preprocessing the data⚓
We will now perform a bit of preprocessing on raw text in order to remove some words from the description that would make the classification task too trivial (e.g. the name of the class itself), using this function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Load the Foundation model⚓
Now that you have preprocessed the raw texts for the three selected classes, you can load the pre-trained model RoBERTa from hugging face.
1 2 3 4 5 6 7 8 9 |
|
embed_text
that takes in input the text to embed, the model and the tokenizer and return the CLS token (outputs.last_hidden_state[:,0,;]
) and the full embeddings (outputs.last_hidden_state
).
What's the difference between them? What does the CLS is useful for?
1 |
|
Generate the embeddings⚓
Compute the embeddings for the three classes using the compute_embeddings
function, then concatenate embeddings for different classes and save them.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
Reproduce results from Lab 2⚓
Now, using the same approach as for Lab 2, divide your embedding in a training and test set (test_size=0.2) and check that you have similar classification performances. You can also use an unsupervised approach (UMAP, t-SNE) to visualize or cluster them.