Aller au contenu

Lab Session 1 - Tools and first applications

Note

Before you begin, make sure you have downloaded the latest update of the course slides from here, and keep them close while doing the lab.

Objectives of the lab

This is the first lab session of the Introduction to AI course. The goal is to introduce important concepts that we will use throughout the course, setup the development environment and demonstrate the application you will choose.

Prerequisites

The following skills are strongly recommended before starting the work for the lab session. Make sure of the following :

  • You have a working python installation and you know how to install new packages.
  • You are using an IDE. We strongly recommend VSCode.
  • We recommend creating an environment specifically for this course using virtualenv. Install a recent python version (3.10 or higher)
  • For this session we will need the following python packages : numpy scipy. Install them in the newly created virtual environment.

Tip

If you need a step by step guide on python, IDE, and environments, have a look at this course.

In this course, we provide code snippets as examples, and you need to build your own code to fulfill the objectives. We can provide advice on how to do so, but coding is not an objective by itself.

Table of contents of the lab

  1. Introduction to Numpy
  2. Embeddings, latent space, and foundation models
  3. Application to your modality

Introduction to Numpy

Machine learning involves manupilating datasets as arrays.

Arrays are sets of numbers (often, floating point or integers), and can be as simple as a single number (a singleton), or a vector of length \(N\), or a \(N\times M\) matrix, or more generally a tensor (e.g. a \(N\times M\times P\) array would be called a tensor with three modes).

In this course, we will use two different python packages to manipulate arrays: numpy and pytorch. We will introduce only numpy now, as pytorch is only needed to train deep learning solutions (covered in detail starting from session 4).

You will find in this section a very short introduction to what is needed to begin the course. Feel free to skip this part if you are already comfortable with the notions.

Note

If you need more in depth explanations about numpy, the best place to start with is the official numpy website, and in particular the guide for absolute beginners (highly recommended!!).

Numpy Tutorial : generating synthetic data

This tutorial is in two steps :

  • To familiarize yourself with numpy functions, we give below a list of concepts needed and functions that can be used. Check the numpy documentation using your IDE or online to have a look at detailed parameters and outputs for those functions,
  • Once you are ready, your task is to generate synthetic data, visualize it, and compute a pairwise distance matrix (details below).

Numpy concepts and functions

Creating arrays

  • Goal: Get familiar with NumPy and create basic arrays.
  • Functions to use:
    • Creating arrays with np.array(), np.zeros(), np.ones(), np.arange().
    • Use the method .shape or the np.shape() function on any numpy array to know its shape

The np.array() function converts a Python list (or a list of lists) into a NumPy array. You can also create arrays filled with zeros or ones using np.zeros() and np.ones().

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import numpy as np

# Creating a 1D array from a list
array_1d = np.array([1, 2, 3, 4, 5])
print("1D Array:", array_1d)

# Creating a 2D array (list of lists)
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("2D Array:\n", array_2d)

# Creating a 1D array of zeros
zeros_1d = np.zeros(5)
print("1D Array of zeros:", zeros_1d)

# Creating a 2D array of zeros
zeros_2d = np.zeros((3, 4))  # 3 rows, 4 columns
print("2D Array of zeros:\n", zeros_2d)

# Creating a 2D array of ones
ones_2d = np.ones((2, 3))  # 2 rows, 3 columns (a 2 by 3 matrix)
print("2D Array of ones:\n", ones_2d)

# Creating a 3D array of ones
ones_2d = np.ones((4, 2, 3))  # a 4 by 2 by 3 tensor
print("2D Array of ones:\n", ones_2d)

print(ones_2d.shape)

Generating random arrays

  • Goal: Generate arrays of any by sampling random distributions.
  • Functions to use:
    • Creating arrays with np.random.normal(), np.random.uniform(), np.random.randint()

Mathematical Operations with NumPy

  • Goal: Perform basic operations on arrays.
  • Concepts:
    • Addition, multiplication, np.sum(), np.mean().
    • Check the numpy tutorial here about basic array operations.
    • Note that arrays it's possible to do element-wise operations for arrays that have the same shape. Check also the concept of broadcasting

Array Manipulation and Indexing

  • Goal: Learn the basics of array manipulation and indexing.
  • Concepts:
    • Simple indexing and slicing (array[0], array[1:3]).
    • Boolean masking (array[array > 5]).

Note

Indexing will not be needed for Lab Session 1, but we will heavily use it from Session 2.

Saving and loading arrays

  • Goal: Save and load arrays on disk.
  • Concepts:
    • np.save(),np.savez(), np.savez_compressed() and np.load().

We recommend saving with np.savez_compressed(), so that you can put multiple arrays in the same file, with added compression to save space.

Generate synthetic data, visualize it and compute the pairwise distance matrix

Pairwise Distance Matrix

Euclidean Distance

Because each data point can be represented as a vector in a vector space, the difference between two points can be measured using a distance function.

The simplest and most commonly used distance is the Euclidean distance. In an \( n \)-dimensional space, the Euclidean distance between two vectors is defined as:

\[ d(p, q) = \sqrt{\sum_{i=1}^{n} (q_i - p_i)^2} \]

where \( p = (p_1, p_2, ..., p_n) \) and \( q = (q_1, q_2, ..., q_n) \) are two vectors.

Distance Between Many Points

When dealing with a large dataset, one may want to analyze the differences between all data points to identify patterns or groups within the data.

A useful tool for such applications is the pairwise distance matrix, which computes the distances between every pair of data points and displays them in a matrix format. Given n samples, there are at most n(n - 1)/2 unique pairwise distances (since distances are symmetric).

In this matrix, groups of points with relatively small distances will appear as similar groups, whereas points with high relative distances will stand out as dissimilar. The diagonal of this matrix is equal to 0, because it corresponds to the distance of each point with itself.

Application

As an application exercice for numpy, generate synthetic data with the following specifications :

  • The goal is to generate two clouds of points with different centers, in dimension 2.
  • Each cloud has 100 points
  • Visualize the two cloud points using the scatter function from matplotlib
  • Compute the pairwise distance matrix using the Euclidean distance, using the pdist and squareform functions from scipy.
  • Visualize the distance matrix using the matshow function from matplotlib pyplot.

Once you have plot the matrix, interpret it and call the teacher to talk about it!

Embeddings

Now that you have generated some synthetic data we will do the same for real-world data.

Pariwise distance matrix for real-world data

A first intuitive idea would be to compute the pairwise distance matrix between points of real-world data. Here, as an example, we have taken 200 images, of 2 different classes, and have computed the pariwise distance matrix.

pairwise_distance_matrix_images

Can you explain this result? If not, talk about this result to a teacher!

You would get a similar result for most of the data we will study in this class, regardless of the modality (image, audio, or text).

Important

This is why we do not work directly with data, but we first pre-process them.

Latent space

The latent space is the vector space obtained by applying a foundation model.

Note

In general, any deep learning model can admit a latent space, but in this module we restrict latent spaces to foundation models to make it easier to understand.

As a recall from the introductory course, a foundation model is a certain type of AI algorithm, which has the particularity of leveraging a gigantic collection of data (we talk about internet-size datasets). In this context, the latent space will present interesting properties, such as semantic similarity.

Semantic similarity

Semantic similarity is a property that states that vectors representing similar concepts will be close to each other in a vector space.

For instance, two pictures of a cat may be extremely different as such (cats of different colors, in different backgrounds), but still represent the same thing (a cat). In an ideal vector space, these two images of a cat would be close one to another. Conversely, a picture of a cat and a dog may be relatively similar in terms of pictures (same background, animals of a same color), but represent different concepts. In an ideal vector space, these two images would be very distant one with another.

Semantic similarity is a property that can be observed in latent spaces of foundation models, mostly because they were trained on gigantic amounts of data.

Hence, while evaluating the distance between raw images does not allow to observe interesting patterns, evaluating the distance in a latent space may exhibit patterns and concepts in data. Hence, we would like to project data into this latent space. These projected vectors are called embeddings.

Embeddings: mapping from inputs to the latent space.

We stated before that instead of working with the input/raw data (i.e. with raw text, image, or audio), we will work with their projection in a latent space, called embeddings of the data.

Important

An embedding is the projection of the input data into a latent space, obtained by using a foundation model.

Foundation model and embeddings

Formally, this projection consists of applying a function \(f\) that performs a mapping between two spaces (the input space, and the latent space). The function \(f\) is computed by means of a foundation model, which we will explore in detail in courses 4 and 5 (for the moment, consider that these models are black boxes). Both the input data and embeddings may be represented by arrays (vectors for 1D arrays, matrices for 2D arrays, tensors for higher-dimensional arrays).

In particular:

  • if we consider \(ℝ^d\) as the input space and \(ℝ^{l}\) the latent space,
  • \(f\) is defined as a mapping \(f: ℝ^{d} \rightarrow ℝ^{l}\),
  • a set of \(N\) samples \(x_i \in ℝ^{d}\) can be represented as a matrix \(X \in ℝ^{N \times d}\) with \(N\) rows and \(d\) columns, and the corresponding samples in the latent spaces as a matrix \(Y \in ℝ^{N \times l}\) obtained by applying \(f\) to all \(x_i\).

Note

For example, an image can be encoded as a set of pixels, there the input space is of dimension \(width \times height \times 3\) (Red, Green and Blue). For a square image with \(height=width=256\), the input space dimension is \(d = 196608\).

A latent space for images using a foundation model will typically be of dimension \(l = 512\) or \(l = 1024\).

Note

In practice, the software implementations that we will use can directly be called on such sets of inputs with one function call. This is also called batch processing, and is often computed in parallel or optimized.

Observing semantic similarity

An intuitive way to have a first look at samples in the latent space is to investigate the distances between samples. The assumption we make here is that samples that are close to each other in the latent space are semantically similar in the input space, as explained before.

Important

You will now compute the pairwse distance matrix on some data of your modality (for which we have already computed the embeddings), and conclude about the difference between using the raw data as input (as presented above) and using embeddings.

Application

You have to choose one of the three proposed applications:

  • Computer Vision : biodiversity images from the iNaturalist project,
  • Audio : soundscapes from the ESC50 dataset,
  • Text : based on Wikipedia.

For each application, we provided a small sample dataset (specific for session 1).

Task for session 1

We have already mapped those samples to a latent space. Your task for Session 1 is to compute and visualize a distance matrix between all pairs of samples in the latent space, using the euclidean distance. Exactly like you have done for the synthetic problem above :

  1. Compute the distance matrix
  2. Visualize the distance matrix, for example using matplotlib.pyplot.matshow
  3. Use the information provided in the application specific pages (links above) to interpret the result