Aller au contenu

Adapting a Deep Learning model

TODO!

Outline of the methods

The course was segmented into 4 techinques:

  1. Transfer Learning
  2. Fine-Tuning
  3. Parameter-Efficient Fine-Tuning (PEFT)

Requirements

You need to be able to load a HuggingFace model. If you don't know how to do it, refer to the last Lab. You also need to create a Dataloader with your data, refer to the PyTorch tutorial if you need to.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import torch
import torch.nn as nn
import torch.optim as optim

model = # Load from HuggingFace

train_dataloader = # Your PyTorch DataLoader with your custom data.

# 0. PARAMETERS
# Get the dimension of the last layer of the original model, i.e. the dimension of the outputs of the model.
# Note that the syntax may change depending on your model.
hidden_size = model.config.hidden_size 

num_classes = # TODO: the number of classes you want to estimate in your adapted model.

# 0.5 Intermediate layer
# Alternatively, you could decide not to use the last layer but an intermediate one.
# You can either redefine the forward function for stopping at the desired layer, or use the following code:
## Get the embeddings from the model
outputs = model(**inputs, output_hidden_states=True)
## Get the embeddings for the specific the layer
intermediate_layer = outputs.hidden_states[layer_num]
# Change the hidden size accordingly.

In the following, we will assume that you have filled out the code presented above

Transfer Learning

Below is a snippet of code for a transfer learning task:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# 1. FREEZE THE PRE-TRAINED MODEL (BACKBONE) ❄️
# Here, we are freezing the original parameters of the pre-trained model.
# This is the most important step for this method.
for param in model.parameters():
    param.requires_grad = False # PyTorch syntax for freezing weights

# 2. ADD A NEW, TRAINABLE HEAD 🔥
# As an example, we provide a simple linear layer. 
# Its parameters are trainable because (requires_grad=True) by default (syntax in PyTorch).
classification_head = nn.Linear(hidden_size, num_classes)

# 3. TRAINING LOOP (simplified)
# CRITICAL: We only pass the parameters of the *new head* to the optimizer.
# The frozen backbone parameters are not included.
optimizer = optim.AdamW(classification_head.parameters(), lr=1e-3) # Set up the optimizer
criterion = nn.CrossEntropyLoss() # Loss for classification

# Set your head to train mode
classification_head.train()
# Set your model on CPU mode or GPU mode.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
classification_head.to(device)

# Actual training loop
for inputs, labels in tqdm(trainloader, desc="Training", leave=False):
    # Move data to the device
    inputs, labels = inputs.to(device), labels.to(device)

    # 0. Clear gradients
    optimizer.zero_grad()

    # 1. Get backbone outputs (features)
    # Use torch.no_grad() for the backbone pass to save memory
    with torch.no_grad():
        outputs = model(**inputs) # Get prediction
        features = outputs.last_hidden_state # Extract only the last layer embeddings

        # In some models you may only want the [CLS] token's output for classification
        # features = outputs.last_hidden_state[:, 0, :] # [batch_size, sequence_length, hidden_size] -> [batch_size, hidden_size]

    # 2. Pass features to the new head
    logits = classification_head(features)

    # 3. Calculate loss
    loss = criterion(logits, labels)

    # 4. Backpropagate (only calculates grads for the head)
    loss.backward()

    # 5. Update weights (only for the head)
    optimizer.step()

print("Training complete! Only the head was trained.")

Fine-Tuning

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# 1. DEFINE A NEW, TRAINABLE HEAD 🔥
# As an example, we provide a simple linear layer. 
# Its parameters are trainable because (requires_grad=True) by default (syntax in PyTorch).
classification_head = nn.Linear(hidden_size, num_classes) 

# 2. COMBINE THE BACKBONE AND HEAD INTO ONE MODEL
# This is the "clean" PyTorch way to do it.
class FullModel(nn.Module):
    def __init__(self, backbone, classification_head):
        super(FullModel, self).__init__()
        self.backbone = backbone # The pre-trained part 🔥
        self.head = classification_head # Your new part 🔥

    def forward(self, input_ids, attention_mask):
        # 1. Get features from the backbone
        # The backbone's requires_grad is True (by default)
        outputs = self.backbone(input_ids=input_ids, 
                                attention_mask=attention_mask)

        features = outputs.last_hidden_state # Extract only the last layer embeddings

        # In some models you may only want the [CLS] token's output for classification
        # features = outputs.last_hidden_state[:, 0, :] # [batch_size, sequence_length, hidden_size] -> [batch_size, hidden_size]

        # 2. Pass features to our custom head
        logits = self.head(features)

        return logits

# 3. DEFINE THE NEW MODEL
fine_tuned_model = FullModel(model, classification_head)

# 4. TRAINING LOOP (simplified)
# CRITICAL: We pass *all* parameters to the optimizer.
# We must use a *very low* learning rate to avoid destroying the pre-trained knowledge ("catastrophic forgetting").
optimizer = optim.AdamW(fine_tuned_model.parameters(), lr=2e-5) # 2e-5 is a common default
criterion = nn.CrossEntropyLoss() # Loss for classification

# Set the model to train mode
fine_tuned_model.train()
# Set your model on CPU mode or GPU mode.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
fine_tuned_model.to(device)

# Actual training loop
for inputs, labels in tqdm(trainloader, desc="Training", leave=False):
    # Move data to the device
    inputs, labels = inputs.to(device), labels.to(device)

    # 0. Clear gradients
    optimizer.zero_grad()

    # 1. Forward pass
    logits = fine_tuned_model(**inputs)

    # 3. Calculate loss
    loss = criterion(logits, labels)

    # 4. Backpropagate (only calculates grads for the head)
    loss.backward()

    # 5. Update weights (only for the head)
    optimizer.step()

print("Training complete! The backbone and custom head were fine-tuned.")

Parameter-Efficient Fine-Tuning (PEFT) -- LoRA

LoRA an be easily used with HuggingFace. See the HuggingFace documentation for more details.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
# Required imports
from peft import LoraConfig, get_peft_model

## 1. CONFIGURE THE LORA MODULE
config = LoraConfig(
            r=lora_rank, # Rank of the LoRA matrices
            target_modules=["query", "key", "value"], # Modules to apply LoRA on. Their names will surely change depending on the model you use!

            # Below are default parameters with default value, may be changed
            lora_dropout=0,
            lora_alpha=8,
            bias="none",
        )

## 2. SET UP YOUR MODEL WITH LORA ADAPTERS
lora_model = get_peft_model(model, config)

## 3. DEFINE A NEW, TRAINABLE CLASSIFICATION HEAD
# Generally, we set a simple linear layer.
lora_model.classifier = nn.Linear(hidden_size, num_classes) # Your model had to be a classification model. Change accordingly.

# 4. TRAINING LOOP (simplified)
# CRITICAL: We pass *all* parameters to the optimizer.
optimizer = optim.AdamW(lora_model.parameters(), lr=1e-3) # 1e-3 is a common default
criterion = nn.CrossEntropyLoss() # Loss for classification

# Set the model to train mode
lora_model.train()
# Set your model on CPU mode or GPU mode.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
lora_model.to(device)

# Actual training loop
for inputs, labels in tqdm(trainloader, desc="Training", leave=False):
    # Move data to the device
    inputs, labels = inputs.to(device), labels.to(device)

    # 0. Clear gradients
    optimizer.zero_grad()

    # 1. Forward pass
    logits = lora_model(**inputs)

    # 3. Calculate loss
    loss = criterion(logits, labels)

    # 4. Backpropagate (only calculates grads for the head)
    loss.backward()

    # 5. Update weights (only for the head)
    optimizer.step()

print("Training complete! The backbone (with LoRA) and custom head were fine-tuned.")