Adapting a Deep Learning model
TODO!
Outline of the methods
The course was segmented into 4 techinques:
- Transfer Learning
- Fine-Tuning
- Parameter-Efficient Fine-Tuning (PEFT)
Requirements
You need to be able to load a HuggingFace model. If you don't know how to do it, refer to the last Lab. You also need to create a Dataloader with your data, refer to the PyTorch tutorial if you need to.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 | import torch
import torch.nn as nn
import torch.optim as optim
model = # Load from HuggingFace
train_dataloader = # Your PyTorch DataLoader with your custom data.
# 0. PARAMETERS
# Get the dimension of the last layer of the original model, i.e. the dimension of the outputs of the model.
# Note that the syntax may change depending on your model.
hidden_size = model.config.hidden_size
num_classes = # TODO: the number of classes you want to estimate in your adapted model.
# 0.5 Intermediate layer
# Alternatively, you could decide not to use the last layer but an intermediate one.
# You can either redefine the forward function for stopping at the desired layer, or use the following code:
## Get the embeddings from the model
outputs = model(**inputs, output_hidden_states=True)
## Get the embeddings for the specific the layer
intermediate_layer = outputs.hidden_states[layer_num]
# Change the hidden size accordingly.
|
In the following, we will assume that you have filled out the code presented above
Transfer Learning
Below is a snippet of code for a transfer learning task:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54 | # 1. FREEZE THE PRE-TRAINED MODEL (BACKBONE) ❄️
# Here, we are freezing the original parameters of the pre-trained model.
# This is the most important step for this method.
for param in model.parameters():
param.requires_grad = False # PyTorch syntax for freezing weights
# 2. ADD A NEW, TRAINABLE HEAD 🔥
# As an example, we provide a simple linear layer.
# Its parameters are trainable because (requires_grad=True) by default (syntax in PyTorch).
classification_head = nn.Linear(hidden_size, num_classes)
# 3. TRAINING LOOP (simplified)
# CRITICAL: We only pass the parameters of the *new head* to the optimizer.
# The frozen backbone parameters are not included.
optimizer = optim.AdamW(classification_head.parameters(), lr=1e-3) # Set up the optimizer
criterion = nn.CrossEntropyLoss() # Loss for classification
# Set your head to train mode
classification_head.train()
# Set your model on CPU mode or GPU mode.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
classification_head.to(device)
# Actual training loop
for inputs, labels in tqdm(trainloader, desc="Training", leave=False):
# Move data to the device
inputs, labels = inputs.to(device), labels.to(device)
# 0. Clear gradients
optimizer.zero_grad()
# 1. Get backbone outputs (features)
# Use torch.no_grad() for the backbone pass to save memory
with torch.no_grad():
outputs = model(**inputs) # Get prediction
features = outputs.last_hidden_state # Extract only the last layer embeddings
# In some models you may only want the [CLS] token's output for classification
# features = outputs.last_hidden_state[:, 0, :] # [batch_size, sequence_length, hidden_size] -> [batch_size, hidden_size]
# 2. Pass features to the new head
logits = classification_head(features)
# 3. Calculate loss
loss = criterion(logits, labels)
# 4. Backpropagate (only calculates grads for the head)
loss.backward()
# 5. Update weights (only for the head)
optimizer.step()
print("Training complete! Only the head was trained.")
|
Fine-Tuning
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65 | # 1. DEFINE A NEW, TRAINABLE HEAD 🔥
# As an example, we provide a simple linear layer.
# Its parameters are trainable because (requires_grad=True) by default (syntax in PyTorch).
classification_head = nn.Linear(hidden_size, num_classes)
# 2. COMBINE THE BACKBONE AND HEAD INTO ONE MODEL
# This is the "clean" PyTorch way to do it.
class FullModel(nn.Module):
def __init__(self, backbone, classification_head):
super(FullModel, self).__init__()
self.backbone = backbone # The pre-trained part 🔥
self.head = classification_head # Your new part 🔥
def forward(self, input_ids, attention_mask):
# 1. Get features from the backbone
# The backbone's requires_grad is True (by default)
outputs = self.backbone(input_ids=input_ids,
attention_mask=attention_mask)
features = outputs.last_hidden_state # Extract only the last layer embeddings
# In some models you may only want the [CLS] token's output for classification
# features = outputs.last_hidden_state[:, 0, :] # [batch_size, sequence_length, hidden_size] -> [batch_size, hidden_size]
# 2. Pass features to our custom head
logits = self.head(features)
return logits
# 3. DEFINE THE NEW MODEL
fine_tuned_model = FullModel(model, classification_head)
# 4. TRAINING LOOP (simplified)
# CRITICAL: We pass *all* parameters to the optimizer.
# We must use a *very low* learning rate to avoid destroying the pre-trained knowledge ("catastrophic forgetting").
optimizer = optim.AdamW(fine_tuned_model.parameters(), lr=2e-5) # 2e-5 is a common default
criterion = nn.CrossEntropyLoss() # Loss for classification
# Set the model to train mode
fine_tuned_model.train()
# Set your model on CPU mode or GPU mode.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
fine_tuned_model.to(device)
# Actual training loop
for inputs, labels in tqdm(trainloader, desc="Training", leave=False):
# Move data to the device
inputs, labels = inputs.to(device), labels.to(device)
# 0. Clear gradients
optimizer.zero_grad()
# 1. Forward pass
logits = fine_tuned_model(**inputs)
# 3. Calculate loss
loss = criterion(logits, labels)
# 4. Backpropagate (only calculates grads for the head)
loss.backward()
# 5. Update weights (only for the head)
optimizer.step()
print("Training complete! The backbone and custom head were fine-tuned.")
|
Parameter-Efficient Fine-Tuning (PEFT) -- LoRA
LoRA an be easily used with HuggingFace. See the HuggingFace documentation for more details.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53 | # Required imports
from peft import LoraConfig, get_peft_model
## 1. CONFIGURE THE LORA MODULE
config = LoraConfig(
r=lora_rank, # Rank of the LoRA matrices
target_modules=["query", "key", "value"], # Modules to apply LoRA on. Their names will surely change depending on the model you use!
# Below are default parameters with default value, may be changed
lora_dropout=0,
lora_alpha=8,
bias="none",
)
## 2. SET UP YOUR MODEL WITH LORA ADAPTERS
lora_model = get_peft_model(model, config)
## 3. DEFINE A NEW, TRAINABLE CLASSIFICATION HEAD
# Generally, we set a simple linear layer.
lora_model.classifier = nn.Linear(hidden_size, num_classes) # Your model had to be a classification model. Change accordingly.
# 4. TRAINING LOOP (simplified)
# CRITICAL: We pass *all* parameters to the optimizer.
optimizer = optim.AdamW(lora_model.parameters(), lr=1e-3) # 1e-3 is a common default
criterion = nn.CrossEntropyLoss() # Loss for classification
# Set the model to train mode
lora_model.train()
# Set your model on CPU mode or GPU mode.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
lora_model.to(device)
# Actual training loop
for inputs, labels in tqdm(trainloader, desc="Training", leave=False):
# Move data to the device
inputs, labels = inputs.to(device), labels.to(device)
# 0. Clear gradients
optimizer.zero_grad()
# 1. Forward pass
logits = lora_model(**inputs)
# 3. Calculate loss
loss = criterion(logits, labels)
# 4. Backpropagate (only calculates grads for the head)
loss.backward()
# 5. Update weights (only for the head)
optimizer.step()
print("Training complete! The backbone (with LoRA) and custom head were fine-tuned.")
|