Ressources for the Text Application⚓
This application is based on Wikipedia articles, from which some articles were extracted. This is a Kaggle dataset, and not a dataset we created for this class.
Ressources for Session 1⚓
Dataset⚓
Main features :
- 200 texts
- On the 200 texts, 100 of them are from the "bird" class, and are Wikipedia pages from birds. The other 100 are selected at random.
Visualisation of a few examples⚓
These are some samples in the dataset. They are extracted from Wikipedia pages:
-
'The pygmy white-eye (Oculocincta squamifrons), also known as the pygmy ibon, is a species of bird in the white-eye family Zosteropidae. It is monotypic within the genus Oculocincta.'
-
'Joey Altman is an American chef, restaurateur, TV host and writer.'
-
'GSAT-5P, or GSAT-5 Prime, was an Indian communications satellite which was lost in a launch failure in December 2010. Part of the Indian National Satellite System, it was intended to operate in geosynchronous orbit as a replacement for INSAT-3E.'
Latent Space⚓
The 200 text have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.
For now, you can just open the numpy array containing all samples in the latent space from the embeddings-text-lab1.npz. The data is contained under the "embeddings" name in the dictionary.
Work to do⚓
Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.