Aller au contenu

Ressources for the Text Application

This application is based on Wikipedia articles, from which some articles were extracted. This is a Kaggle dataset, and not a dataset we created for this class.

Ressources for Session 1

Dataset

Main features :

  • 200 texts
  • On the 200 texts, 100 of them are from the "bird" class, and are Wikipedia pages from birds. The other 100 are selected at random.

Visualisation of a few examples

These are some samples in the dataset. They are extracted from Wikipedia pages:

  • 'The pygmy white-eye (Oculocincta squamifrons), also known as the pygmy ibon, is a species of bird in the white-eye family Zosteropidae. It is monotypic within the genus Oculocincta.'

  • 'Joey Altman is an American chef, restaurateur, TV host and writer.'

  • 'GSAT-5P, or GSAT-5 Prime, was an Indian communications satellite which was lost in a launch failure in December 2010. Part of the Indian National Satellite System, it was intended to operate in geosynchronous orbit as a replacement for INSAT-3E.'

Latent Space

The 200 text have been put in a latent space using RoBERTa, a deep learning model from this paper. We will delve into the details of Deep Learning in courses 4 and 5.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-text-lab1.npz. The data is contained under the "embeddings" name in the dictionary.

Work to do

Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.