Resources for the Vision Application⚓

This application is based on iNaturalist, from which we extract subdatasets specifically for this course.

Resources for Session 1⚓

Dataset⚓

Main features :

200 images
On the 200 images, 100 are insects and the 100 are plants.

Visualisation of a few examples⚓

Plants

Insects

Model used for the embeddings⚓

The 200 images have been put in a latent space using the vision encoder VIT-H/14 from OpenClip, a foundation model from this paper. We will dive into the details of feature extraction, deep learning, and foundation models in courses 4 and 5.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-cv-lab1.npz.

Work to do⚓

Compute, visualize and interpret the distance matrix, as explained in Lab Session 1 main page.

What can you conclude about this matrix, and about the embeddings?

Resources for Session 2⚓

Dataset⚓

Main features :

Seven classes were sampled from the iNaturalist taxonomy.
There are 100 samples for each class, split into approx 80 for train and 20 for test
Class names can be fetched from the embedding file (see below), and you can get examples of images on the iNaturalist website

Model used for the embeddings⚓

As for Lab 1, the images have been put in a latent space using the vision encoder VIT-H/14 from OpenClip, a deep learning model from this paper. We will delve into the details of Deep Learning and feature extraction from course 4.

For now, you can just open the numpy array containing all samples in the latent space from the embeddings-cv-lab2.npz. This file is a dictionary, whose values are indexed as "X_train", "y_train", "X_test", and "y_test".

In that regard, files are to be loaded using the following snipet code:

train_test_dataset = np.load(PATH_TO_EMBEDDINGS_LAB2)

X_train, X_test, y_train, y_test = train_test_dataset['X_train'], train_test_dataset['X_test'], train_test_dataset['y_train'], train_test_dataset['y_test']

# Class names are also included
class_names = train_test_dataset['class_names']
print(f"Class names : {class_names}")

# X_train should have a shape of 
# (560,1024), i.e. (number of samples x embedding dimension)
# y_test should have a shape of (140), i.e. the number of test samples. Each sample's label is indexed via an integer. 
print(f"Shape of X_train: {X_train.shape}"), print(f"Shape of y_test: {y_test.shape}")

Work to do⚓

Compute the classification on this data, using the technique you chosed. Please, refer to Lab Session 2 main page for details.

As an example, here are the results obtained using the K-Nearest Neighbour algorithm with K=10 :

	Precision	Recall	F1-score
Eriogonum	0.70	0.88	0.78
Rubus	0.82	0.78	0.80
Quercus	0.74	1.00	0.85
Ericales	0.80	0.63	0.71
Lamioideae	0.95	0.87	0.91
Ranunculeae	0.62	0.95	0.75
Ranunculaceae	0.67	0.20	0.31

You should be able to replicate these results using the function classification_report from scikit-learn:

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred,target_names)class_names))

Resources for Session 3⚓

Dataset & Model used for the embeddings⚓

You should use the same data as for Session 2. As a recall, data consists of 700 images processed by the vision encoder VIT-H/14 from OpenClip.

Note on train and test data⚓

Because this is unsupervised, the model does not need to be trained before inference. As a consequence, you may/should used both X_train and X_test in your model, in a large X concatenating both.

Warning

If you want to compare results between the supervised model of Lab 2 and the Unsupervised model of Lab 3 though, don't forget to compare on the same data! (i.e. the test dataset)

Work to do⚓

Perform unsupervised learning on the aformentioned dataset.

Note

If you use clustering, you can use the labels to evaluate the quality of your clustering.

Results can be estimated using several metrics (see the sklearn documentation for details):

Using labels, you can use the random index, homogeneity, completeness and v-measure.
Without labels, you may use the inertia and the silhouette score.

It is generally informative to use both these metrics.

In the end, using the KMeans algorithm on all the embeddings with \(K=6\) results in:

	Inertia	Random Index
\(K=6\)	193.07	0.86

You should try varying \(K\) and see what happens!

Resources for Session 4⚓

Follow the link given in the main course page (Preprocessing for Computer Vision)

Resources for Session 5⚓

Dataset⚓

Load the images extracted from iNaturalist for the 7 classes of Lab 2 and 3. Please download it following this link.

Work to do⚓

Setting up the environment⚓

We will work with the torch library that you should already know from Lab 4, and we also need the transformers library from hugging face and PIL which is useful to process images. A couple of other libraries can be useful for downloading the foundation model. Simply install them with pip.

!pip install torch
!pip install transformers
!pip install pillow
!pip install --upgrade ipywidgets  
!pip install tdqm

Load the Foundation model⚓

Import the foundation model CLIP from open-clip github. CLIP (Contrastive Language–Image Pretraining) is a multimodal foundation model from OpenAI that has been pretrained to match images and corresponding text (You'll learn more about this in the upcoming lesson on multimodal foundation models).

Warning

The pretrained OpenCLIP model is 4GB, so ensure you have enough space in your cache directory. You can specify a different download location by setting the cache_dir='' argument in the create_model_and_transforms function.

!pip install open_clip_torch
import open_clip as clip
backbone_name = "ViT-H-14-quickgelu"
checkpoint = 'metaclip_fullcc'

#device = "cuda" if torch.cuda.is_available() else "cpu"
device='cpu'
model,_,preprocess = clip.create_model_and_transforms(backbone_name, pretrained=checkpoint, device=device)
model.eval() 

Why are we using the model in eval mode?

Load the data⚓

You have now to make the necessary imports and build the dataloader from the dataset path (it turns to be quite easy using the torchvision function datasets).

import torch
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms
from transformers import CLIPProcessor
from PIL import Image
from tqdm import tqdm

# Set the path to your dataset
data_dir = ""  # !! TO DO

dataset = datasets.ImageFolder(root=data_dir, transform=preprocess)

Find the train and test size using len(dataset) and setting the test set equal to the 20% of it. Then use random_split to generate the train and test datasets.

1	`# !! TO DO`

# Define data loaders
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

Generate the embeddings⚓

Following the utilisation example given on github, use the function model.encode_image() to generate embeddings for each iteration over the dataloader

# Function to generate embeddings
def generate_embeddings(dataloader):
    embeddings = []
    labels = []
    with torch.no_grad():  # Disable gradient calculations for inference
        for images, label in tqdm(dataloader):
            images = images.to(device)

            # Get the embeddings from the model
            # !! TO DO

            embeddings.append(outputs)
            labels.append(label)

    # Concatenate all embeddings and labels
    embeddings = torch.cat(embeddings, dim=0)
    labels = torch.cat(labels, dim=0)
    return embeddings, labels

Generate embeddings for the train and test set using the function generate_embeddings.

Reproduce results from Lab 2⚓

Now using the same approach as for Lab 2 check that you have similar classification performances on the embeddings. You can also use an unsupervised approach (UMAP, tSNE) to visualize them.