Lab Session 3 - Unsupervised Learning⚓

Note

Before you begin, make sure you have downloaded the latest update of the course slides from here, and keep them close while doing the lab.

Objectives of the lab⚓

At the end of this session, you will be able to :

Understand Unsupervised and Supervised Feature Selection and Feature preprocessing
Fit clustering
Fit a basic decomposition (Principal Components Analysis)
Visualize a low dimensional manifold using T-SNE
Apply unsupervised learning on your application

All examples are based on sklearn :

from sklearn.datasets import load_digits
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA, MiniBatchDictionaryLearning, DictionaryLearning, NMF
from sklearn.manifold import TSNE

import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
np.random.seed(0)

Digits dataset⚓

As a tutorial, we are going to be using the DIGITS dataset. The first thing we are going to do is load the dataset.

As this is unsupervised we will mostly ignore y_digits (it will only be used for visualization).

Load the model

x_digits, y_digits = load_digits(n_class=10, return_X_y=True)

Visualize some examples using the following plot_digits_examples() function

def plot_digits_examples(x_digits, y_digits, n_class=10, n_examples=3):
    """
    Plot some examples of digits from the given dataset.
    Parameters:
    - x_digits (numpy.ndarray): Array of digit images.
    - y_digits (numpy.ndarray): Array of digit labels. Used for selecting examples.
    - n (int): Number of classes. Should be the same than in load_digits. Default is 10.
    """
    fig = plt.figure(figsize=(10,5))

    for idx_class in range(n_class): # Parse all classes
        for idx_example in range(1,n_examples+1): # Select n_examples examples
            # Index of the current example in the final patchwork
            plt.subplot(n_examples, n_class, n_examples*idx_class+idx_example)

            # pick a random digit in the current category
            curX = x_digits[y_digits==idx_class] 
            r = np.random.randint(curX.shape[0])

            # Reshape the image: controling its size.
            curim = curX[r, :].reshape((8,8))

            # Display the digit (and remove ticks)
            plt.imshow(curim, cmap=plt.cm.gray)
            plt.xticks([])
            plt.yticks([])

    plt.show()

Unsupervised Feature Selection⚓

In this part, we will see how to use the VarianceThreshold to remove features with low variance.

The principle is to compute variance for each feature, and remove the features that have a variance smaller than a threshold. By default, the method removes only features with a variance of zero (their values are always identical).

A word of caution though : the variances are not calculated in the same way whether the set threshold is 0 (the default) or a higher value, as just looking for variances of 0 is done with a method with better numerical precision (see the code here for a more in depth explanation).

In the following, we will use a very small threshold because we will vary the threshold.

print(f'Original features shape: {x_digits.shape}')

from sklearn.feature_selection import VarianceThreshold

thr = 0.1 # TODO: you should vary this threshold, and see what happens!
selector = VarianceThreshold(threshold=thr)

# Apply the threshold on the digits data
x_digits_thresholded = selector.fit_transform(x_digits)

print(f'Features shape after threshold: {x_digits_thresholded.shape}')

print(f"{x_digits.shape[1] - x_digits_thresholded.shape[1]} features were removed")

Let's visualize which features have been removed! Generate a boolean mask named mask that equals True if a feature must be kept, and False othewise. This can be done by thresholding the variances vector.

### - Compute a boolean mask based on the previous VarianceThreshold selector.

### Hint: you should use the variances computed by the selector.
### You can use the following attribute:
variances = selector.variances_

We provide the function plot_digits_examples_thresholded() which, given a mask, plots the original digits, the mask, and the digits without the removed features. The final figure is organized as follows:

the first row shows original examples of the digits dataset
the second row shows the mask that select features (selected features are in white)
the third row shows the result of masking the original sample from first row with the mask

def plot_digits_examples_thresholded(x_digits, y_digits, mask, n_class=10):
    """
    Plots examples of digits along with their corresponding masks.
    Parameters:
    - x_digits (numpy.ndarray): Array of digit images.
    - y_digits (numpy.ndarray): Array of digit labels.
    - mask (numpy.ndarray): Array representing the mask to be applied to the digits.
    - n (int): Number of classes. Should be the same than in load_digits. Default is 10.
    """
    fig = plt.figure(figsize=(10,5))

    for idx_class in range(n_class): # Parse all classes
        # Index of the current example in the final patchwork
        plt.subplot(3, n_class, idx_class+1)

        # Pick a random digit in the current category     
        curX = x_digits[y_digits==idx_class]
        r = np.random.randint(curX.shape[0])
        curim = curX[r, :].reshape((8,8))

        # Display the digit (and remove ticks)
        plt.imshow(curim, cmap=plt.cm.gray)
        plt.xticks([])
        plt.yticks([])

        # Do the same, but for the mask
        plt.subplot(3, 10, idx_class+11)
        plt.imshow(mask.reshape((8,8)), cmap=plt.cm.gray)
        plt.xticks([])
        plt.yticks([])

        # Do the same, but for the masked digit
        plt.subplot(3,10,idx_class+21)
        curim_masked =  curim*mask.reshape((8,8))
        plt.imshow(curim_masked,cmap=plt.cm.gray)
        plt.xticks([])
        plt.yticks([])

    plt.show()

By changing the threshold, you can see more or less features being selected (equivalently, features being removed). What can you conclude about the features which are removed? What happens when the threshold is very high?

Selecting the threshold⚓

In order to know how to select the threshold, you should examine the distribution of feature variances. First, let's visualize it using a histogram.

1	`### - Compute a histogram of the feature variances.`

You may take a decision to choose the threshold directly using this plot, but it will not tell you directly how many features are removed.

We can use the numpy percentile (use the equivalent nanpercentile if you may have NaN in your features) in order to find the threshold that removes a certain percentage of features. For instance, you can start by removing 75% of the features.

### - Use the numpy percentile function to find the threshold to remove a given percentages of features to keep.

Up to how many % features removed can you still recognize the digits? (Qualitative, there is no "good answer")

1	`### - Find what is, empirically, the best threshold in your opinion.`

This part was just to demonstrate a very simple way to remove features in an unsupervised way (you don't need labels).

You can also check out other feature selection methods for which you will need the labels. Feel free to experiment.

K-means Clustering⚓

In this part, we demonstrate the use of k-means clustering.

First, please have a look at the course with notes and read it once again.

Done? We will use the KMeans estimator from sklearn. Importantly, here are the key concepts :

We define a kmeans estimator like this : kmeans = KMeans(n_clusters). The number of clusters has to be decided in advance, using the n_clusters parameter
The Kmeans clustering is estimated by using the method fit on input samples

Estimate a KMeans clustering on the digit dataset.

1	`### - Estimate the Kmeans clustering.`

Now we are going to visualize the centroids of the \(10\) clusters. First we have to get the center of each cluster.

After having used the method fit, the coordinates of the found centroids will be stored in an attribute named cluster_centers_, that you can access by doing kmeans.cluster_centers_

1	`### - Find the cluster centroids.`

We will now plot the clusters centroids, using the provided function plot_centroids(). Centroids are points in the same space than the original samples, so we can plot them similarly than the samples of the dataset.

def plot_centroids(centroids):
    """
    Plot the centroids of the clusters.
    Parameters:
    - centroids (numpy.ndarray): Array of centroids.
    """
    fig = plt.figure(figsize=(10,5))

    for i,curcen in enumerate(centroids):

        plt.subplot(2, 5, i+1) # Suppose that you have ten centroids.
        im_cen = curcen.reshape((8,8))
        plt.imshow(im_cen, cmap=plt.cm.gray)
        plt.xticks([])
        plt.yticks([])

    plt.show()

We are now going to take a look at our reconstructions using our KMeans model.

First we take a sample from each class.

# This code allow you to select 10 random examples 
whichex = np.random.randint(low=0,high=100,size=1) 
X_samp = np.concatenate([x_digits[y_digits==i][whichex] for i in range(10)])

# You can take a look at the shape of the examples: 
X_samp.shape

The transform method can be used to estimate distances of samples to the centroids
The predict method, can be used to assign labels of the closest centroid to each sample

Don't hesitate to lookup the help page of these methods using the '?' magic

### TODO:
### - Use the transform method from the kmeans object on X_samp.
### You will obtain an array containing the distances to the centroids. 
### - Use the argmin method from numpy to generate an array containing the index corresponding to the closest centroid to the samples
### - Fetch the corresponding centroid in another array closest_centroids.
### - And finally calculate the distances of each samples to its closest centroid using np.min

### TODO
distances = # TODO (10,10) array containing the distances of each sample to each centroid

idx_closest_centroids = # TODO (10,) array containing the index closest centroid to the samples 
closest_centroids= # TODO (10,64) array containing the closest centroid to each sample 
smallest_distances =   # TODO (10,) array containing the distance of each sample to its closest centroid

print(f'Distances array shape: {distances.shape}, should be (10,10).')
print(f'Indexes of closest centroids array shape {idx_closest_centroids.shape}, should be (10,).')
print(f'Closest centroids array shape {closest_centroids.shape}, should be (10,64).')
print(f'Smallest distances array shape {smallest_distances.shape}, should be (10,).')

### - Check that the indices of your closest centroids are the same than the ones obtained using the predict method.
print(f'Closest centroids according to your code\t: {idx_closest_centroids}')
print(f'Closest centroids according to sklearn\t\t: {scikit_learn_kmeans.predict(X_samp)}')

Now, plot the samples and their closest centroids, for a visual confirmation. You may use the plot_digits_and_closest_centroids() provided below.

def plot_digits_and_closest_centroids(X_samp, closest_centroids, smallest_distances):
    """
    Plot the original digits and their closest centroids.
    Parameters:
    - X_samp (array-like): Array of original digits.
    - closest_centroids (array-like): Array of closest centroids for each digit.
    - smallest_distances (array-like): Array of smallest distances between each digit and its closest centroid.
    """
    plt.figure(figsize=(20,10))
    for i,(im,im_cen,distance) in enumerate(zip(X_samp, closest_centroids, smallest_distances)):

        plt.subplot(4, 6, 1+2*i)
        plt.imshow(im.reshape(8, 8), cmap=plt.cm.gray)
        plt.xticks([])
        plt.yticks([])
        plt.title("Original")

        plt.subplot(4, 6, 2+2*i)
        plt.imshow(im_cen.reshape(8, 8), cmap=plt.cm.gray)
        plt.xticks([])
        plt.yticks([])
        plt.title("Closest centroid, distance %.2E"%distance)
    plt.show()

Visualize the elbow method with inertia, by generating KMeans with clusters ranging from 1 to 99, using random state = 0. Inertia is stored in inertia_.

### TODO:
### - Generate KMeans models with varying n_clusters
### - Fit each model to the data 
### - Add its inertia to a dedicated list
### - Finally, plot it.

Principal Components Analysis⚓

We will now illustrate Principal Components Analysis on the Digits datasets. The first thing is to generate a model using PCA.

Use n_components=\(16\) to instantiate your object.

The method fit is used to compute the principal components. You can then retrieve the components (denoted V in the course) using the attribute components_. The method fit_transform returns the transformed version generated by the PCA (denoted U in the course), i.e. the weights associated with the components.

### TODO
### - Compute PCA.
### - Find the components (V) of PCA.
### - Plot the components using the plot_pca_components function.


# TODO

def plot_pca_components(components):
    """
    Plot the components of the PCA.
    Parameters:
    - components (numpy.ndarray): Array of components.
    """
    fig, axis = plt.subplots(4, 4)
    for i, d in enumerate(components):
        ax = axis[i//4][i%4]
        ax.imshow(d.reshape((8, 8)), cmap=plt.cm.gray, vmin=np.min(components), vmax=np.max(components))
        ax.set_xticks([])
        ax.set_yticks([])

    plt.show()

Here we can see that a sense of the structure of the digits was kept by the components, which means that the digits are always centered and that the rest is a uniform background.

Now we want to generate some reconstructions, so first we are going to generate some samples.

# Randomly select some examples.
whichex = np.random.randint(low=0, high=100, size=1) 
samples = list()
indexes = list()
for i in range(10):
    index = np.where(y_digits==i)[0][whichex]
    samples.append(x_digits[index])
    indexes.append(index)
X_samp = np.concatenate(samples)
indexes = np.array(indexes)

And now we will use the weights obtained using PCA and the indexes used to extract the samples to reconstruct the data by using np.dot function to perform matrix multiplication between the weights and the components.

### TODO:
### - Generate the reconstructions array using the weights and the components.
### Recall that PCA is a matrix decomposition, hence the result of the decomposition may be retrive using matrix product.

Finally, we compare each reconstructed sample with the original one. You may use plot_original_digits_and_pca_reconstruction() for plotting the results.

def plot_original_digits_and_pca_reconstruction(X_samp, reconstructions):
    """
    Plot the original digits and their PCA reconstructions.
    Parameters:
    - X_samp (numpy.ndarray): Array of original digits.
    - reconstructions (numpy.ndarray): Array of reconstructed digits.
    """
    plt.figure(figsize=(20,5))
    for plot_index,(digit,reconstruction) in enumerate(zip(X_samp,reconstructions)):
        plt.subplot(2,10,plot_index*2+1)

        plt.imshow(digit.reshape((8,8)),cmap=plt.cm.gray,vmin=x_digits.min(),vmax=x_digits.max())
        plt.xticks([])
        plt.yticks([])
        plt.title('$x$')

        plt.subplot(2,10,plot_index*2+2)
        plt.imshow(reconstruction.reshape((8,8)),cmap=plt.cm.gray,vmin=reconstructions.min(),vmax=reconstructions.max())
        plt.xticks([])
        plt.yticks([])
        error = np.sum((reconstruction-digit)**2)
        plt.title('${\~x}$, error %.2E' % error)

    plt.show()

T-SNE⚓

To give an example of the importance of manifold learning, we train a TSNE model and plot the 2D visualization. We will not get into more details here, but note that TSNE is a very common tool to study data in an unsupervised manner.

from sklearn.manifold import TSNE

unsup = TSNE(random_state=0)
examples = unsup.fit_transform(x_digits)
plt.scatter(examples[:,0], examples[:,1], c=y_digits)
plt.colorbar()

Application⚓

There are three things to do :

Choose a couple of unsupervised learning algorithms from the following list:
Use synthetic data generation, or a simple "toy" dataset, to explain how the algorithms work, and the influence of their hyperparameters.
Apply these algorithms to your chosen application, by referring to the application pages (section "Ressources for Session 3")

Note

When evaluating a clustering technique, you can use the ground truth labels to evaluate the quality of your clustering. Remember that you can have a different number of clusters than the "true" number of classes.

In Session 4, you will have to present these three things in a 15 minutes presentation.