Transfer Learning với Pretrained Models

🎯 Mục tiêu bài học

TB5 min

Sau bài này, bạn sẽ:

✅ Hiểu BERT, GPT hoạt động thế nào

✅ Biết cách Fine-tuning pretrained models

✅ Sử dụng Hugging Face Transformers

✅ Áp dụng cho các bài toán NLP thực tế

Ôn lại bài trước

Đã học Attention và Transformer. Hôm nay học cách dùng chúng trong thực tế!

Task 0

🎯 Transfer Learning Recap

TB5 min

Tại sao Transfer Learning?

Transfer Learning = Sử dụng kiến thức đã học để giải quyết bài toán mới.

Pretrained models (BERT, GPT, ResNet) đã học được:

Language understanding (NLP)
Visual patterns (Vision)
Common sense knowledge

→ Không cần train từ đầu, tiết kiệm thời gian và compute.

Lợi ích

Benefit	Chi tiết
Ít data	100-1000 samples có thể đủ
Nhanh	Fine-tune vài giờ thay vì vài tuần
Tốt hơn	Pretrained features đã tốt sẵn
Tiết kiệm	Không cần GPU clusters

Checkpoint

Bạn đã hiểu lợi ích của Transfer Learning?

Task 1

🤗 Hugging Face Transformers

TB5 min

Giới thiệu

Hugging Face là thư viện phổ biến nhất để sử dụng pretrained Transformer models.

200,000+ models
Dễ sử dụng
Support PyTorch, TensorFlow, JAX

Installation

python.py

1# Install
2# pip install transformers datasets
3
4from transformers import (
5    AutoModel,
6    AutoTokenizer,
7    AutoModelForSequenceClassification,
8    TFAutoModel,  # For TensorFlow
9    pipeline
10)

Pipeline (Easiest way)

python.py

1from transformers import pipeline
2
3# Sentiment Analysis
4sentiment = pipeline("sentiment-analysis")
5result = sentiment("I love this movie! It's fantastic.")
6print(result)
7# [{'label': 'POSITIVE', 'score': 0.9998}]
8
9# Text Generation
10generator = pipeline("text-generation", model="gpt2")
11text = generator("Once upon a time", max_length=50, num_return_sequences=1)
12print(text)
13
14# Question Answering
15qa = pipeline("question-answering")
16result = qa(
17    question="What is the capital of France?",
18    context="Paris is the capital and largest city of France."
19)
20print(result)
21# {'answer': 'Paris', 'score': 0.98, ...}
22
23# Named Entity Recognition
24ner = pipeline("ner", aggregation_strategy="simple")
25result = ner("Elon Musk is the CEO of Tesla in California")
26print(result)

Available Pipelines

Pipeline	Task
`sentiment-analysis`	Text classification
`text-generation`	Generate text
`question-answering`	Extract answers
`ner`	Named entity recognition
`fill-mask`	Fill [MASK] token
`summarization`	Summarize text
`translation`	Translate
`zero-shot-classification`	Classify without training

Checkpoint

Bạn đã biết cách dùng Hugging Face pipeline?

Task 2

🔧 Fine-tuning BERT

TB5 min

Load Model và Tokenizer

python.py

1from transformers import (
2    AutoTokenizer,
3    TFAutoModelForSequenceClassification,
4    AutoModelForSequenceClassification
5)
6
7# Model name
8model_name = "bert-base-uncased"
9
10# Load tokenizer
11tokenizer = AutoTokenizer.from_pretrained(model_name)
12
13# Load model for classification (TensorFlow)
14model = TFAutoModelForSequenceClassification.from_pretrained(
15    model_name,
16    num_labels=2  # Binary classification
17)
18
19# Or PyTorch
20# model = AutoModelForSequenceClassification.from_pretrained(
21#     model_name,
22#     num_labels=2
23# )

Tokenization

python.py

1# Single sentence
2text = "This movie is great!"
3tokens = tokenizer(
4    text,
5    padding="max_length",
6    truncation=True,
7    max_length=128,
8    return_tensors="tf"  # or "pt" for PyTorch
9)
10
11print("Input IDs:", tokens["input_ids"].shape)
12print("Attention Mask:", tokens["attention_mask"].shape)
13print("Decoded:", tokenizer.decode(tokens["input_ids"][0]))
14
15# Batch tokenization
16texts = ["I love this!", "This is terrible.", "Pretty good movie."]
17batch_tokens = tokenizer(
18    texts,
19    padding=True,
20    truncation=True,
21    max_length=128,
22    return_tensors="tf"
23)

Expected Output

1Input IDs: (1, 128)
2Attention Mask: (1, 128)
3Decoded: [CLS] this movie is great! [SEP] [PAD] [PAD] ...

Fine-tuning với Keras

python.py

1import tensorflow as tf
2from transformers import TFAutoModelForSequenceClassification
3
4# Prepare data
5train_texts = ["Great movie!", "Terrible film", "I loved it", "Waste of time"]
6train_labels = [1, 0, 1, 0]  # 1=positive, 0=negative
7
8# Tokenize
9train_encodings = tokenizer(
10    train_texts,
11    padding=True,
12    truncation=True,
13    max_length=128,
14    return_tensors="tf"
15)
16
17# Create dataset
18train_dataset = tf.data.Dataset.from_tensor_slices((
19    dict(train_encodings),
20    train_labels
21)).batch(2)
22
23# Load model
24model = TFAutoModelForSequenceClassification.from_pretrained(
25    "bert-base-uncased",
26    num_labels=2
27)
28
29# Compile
30model.compile(
31    optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
32    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
33    metrics=['accuracy']
34)
35
36# Train
37model.fit(train_dataset, epochs=3)

Checkpoint

Bạn đã biết cách fine-tune BERT?

Task 3

📊 Fine-tuning Strategies

TB5 min

Các chiến lược

Strategy	Description	When to use
Feature Extraction	Freeze BERT, only train classifier	Very few data (ít hơn 100)
Full Fine-tuning	Train entire model	Standard (hơn 1000)
Gradual Unfreezing	Unfreeze layers gradually	Medium data
Discriminative LR	Different LR for different layers	Best performance

Feature Extraction

python.py

1from transformers import TFAutoModel
2from tensorflow.keras import layers, Model
3
4# Load BERT without classification head
5base_model = TFAutoModel.from_pretrained("bert-base-uncased")
6
7# Freeze BERT
8base_model.trainable = False
9
10# Build classifier on top
11inputs = {
12    "input_ids": layers.Input(shape=(128,), dtype=tf.int32),
13    "attention_mask": layers.Input(shape=(128,), dtype=tf.int32)
14}
15
16bert_output = base_model(inputs)
17pooled = bert_output.last_hidden_state[:, 0, :]  # [CLS] token
18
19x = layers.Dense(256, activation='relu')(pooled)
20x = layers.Dropout(0.3)(x)
21outputs = layers.Dense(2, activation='softmax')(x)
22
23model = Model(inputs, outputs)
24
25# Only classifier is trainable
26print(f"Trainable params: {sum([tf.size(w).numpy() for w in model.trainable_weights]):,}")

Discriminative Learning Rates

python.py

1from transformers import TFAutoModelForSequenceClassification
2import tensorflow as tf
3
4# Different LR for different layers
5def get_optimizer_with_discriminative_lr(model, base_lr=2e-5):
6    """
7    Lower LR for lower layers, higher LR for top layers
8    """
9    # Group parameters
10    embeddings = []
11    encoder_layers = [[] for _ in range(12)]  # BERT has 12 layers
12    classifier = []
13    
14    for var in model.trainable_variables:
15        name = var.name.lower()
16        
17        if 'embedding' in name:
18            embeddings.append(var)
19        elif 'classifier' in name or 'pooler' in name:
20            classifier.append(var)
21        else:
22            # Find layer number
23            for i in range(12):
24                if f'layer_._{i}' in name or f'layer/{i}' in name:
25                    encoder_layers[i].append(var)
26                    break
27    
28    # Create optimizer with different LRs
29    lr_multipliers = [0.1] + [0.1 + 0.9 * i / 11 for i in range(12)] + [1.0]
30    
31    # Simplified: use single optimizer with base_lr
32    optimizer = tf.keras.optimizers.Adam(learning_rate=base_lr)
33    
34    return optimizer
35
36# Usage
37model = TFAutoModelForSequenceClassification.from_pretrained(
38    "bert-base-uncased", num_labels=2
39)
40optimizer = get_optimizer_with_discriminative_lr(model)

Gradual Unfreezing

python.py

1def train_with_gradual_unfreezing(model, train_data, epochs_per_stage=2):
2    """
3    Unfreeze layers gradually from top to bottom
4    """
5    # Stage 1: Only classifier
6    for layer in model.bert.layers:
7        layer.trainable = False
8    
9    model.compile(
10        optimizer=tf.keras.optimizers.Adam(1e-3),
11        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
12        metrics=['accuracy']
13    )
14    
15    print("Stage 1: Training classifier only")
16    model.fit(train_data, epochs=epochs_per_stage)
17    
18    # Stage 2: Unfreeze top 4 layers
19    for layer in model.bert.encoder.layer[-4:]:
20        layer.trainable = True
21    
22    model.compile(
23        optimizer=tf.keras.optimizers.Adam(1e-4),
24        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
25        metrics=['accuracy']
26    )
27    
28    print("Stage 2: Training top 4 layers + classifier")
29    model.fit(train_data, epochs=epochs_per_stage)
30    
31    # Stage 3: Full fine-tuning
32    for layer in model.bert.layers:
33        layer.trainable = True
34    
35    model.compile(
36        optimizer=tf.keras.optimizers.Adam(2e-5),
37        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
38        metrics=['accuracy']
39    )
40    
41    print("Stage 3: Full fine-tuning")
42    model.fit(train_data, epochs=epochs_per_stage)

Checkpoint

Bạn đã hiểu các fine-tuning strategies?

Task 4

💻 Complete Fine-tuning Pipeline

TB5 min

Full Pipeline với Hugging Face Trainer

python.py

1from transformers import (
2    AutoTokenizer,
3    AutoModelForSequenceClassification,
4    TrainingArguments,
5    Trainer
6)
7from datasets import load_dataset
8import numpy as np
9from sklearn.metrics import accuracy_score, f1_score
10
11# 1. Load dataset
12dataset = load_dataset("imdb")
13# Subset for demo
14train_data = dataset["train"].shuffle(seed=42).select(range(1000))
15test_data = dataset["test"].shuffle(seed=42).select(range(200))
16
17# 2. Load tokenizer and model
18model_name = "distilbert-base-uncased"
19tokenizer = AutoTokenizer.from_pretrained(model_name)
20model = AutoModelForSequenceClassification.from_pretrained(
21    model_name, 
22    num_labels=2
23)
24
25# 3. Tokenize
26def tokenize_function(examples):
27    return tokenizer(
28        examples["text"],
29        padding="max_length",
30        truncation=True,
31        max_length=256
32    )
33
34train_tokenized = train_data.map(tokenize_function, batched=True)
35test_tokenized = test_data.map(tokenize_function, batched=True)
36
37# 4. Define metrics
38def compute_metrics(eval_pred):
39    logits, labels = eval_pred
40    predictions = np.argmax(logits, axis=-1)
41    return {
42        "accuracy": accuracy_score(labels, predictions),
43        "f1": f1_score(labels, predictions)
44    }
45
46# 5. Training arguments
47training_args = TrainingArguments(
48    output_dir="./results",
49    evaluation_strategy="epoch",
50    save_strategy="epoch",
51    learning_rate=2e-5,
52    per_device_train_batch_size=16,
53    per_device_eval_batch_size=16,
54    num_train_epochs=3,
55    weight_decay=0.01,
56    load_best_model_at_end=True,
57    metric_for_best_model="f1",
58)
59
60# 6. Create Trainer
61trainer = Trainer(
62    model=model,
63    args=training_args,
64    train_dataset=train_tokenized,
65    eval_dataset=test_tokenized,
66    compute_metrics=compute_metrics,
67)
68
69# 7. Train
70trainer.train()
71
72# 8. Evaluate
73results = trainer.evaluate()
74print(f"Test Accuracy: {results['eval_accuracy']:.4f}")
75print(f"Test F1: {results['eval_f1']:.4f}")
76
77# 9. Save model
78model.save_pretrained("./my_model")
79tokenizer.save_pretrained("./my_model")

Inference với saved model

python.py

1from transformers import pipeline
2
3# Load saved model
4classifier = pipeline(
5    "sentiment-analysis",
6    model="./my_model",
7    tokenizer="./my_model"
8)
9
10# Predict
11texts = [
12    "This movie was absolutely fantastic!",
13    "I wasted two hours of my life on this garbage.",
14    "It was okay, nothing special."
15]
16
17results = classifier(texts)
18for text, result in zip(texts, results):
19    print(f"{text[:50]}...")
20    print(f"  → {result['label']}: {result['score']:.4f}")

Checkpoint

Bạn có thể xây dựng full fine-tuning pipeline?

Task 5

🎯 Best Practices

TB5 min

Hyperparameters

Parameter	Recommended Value
Learning Rate	1e-5 to 5e-5
Batch Size	16, 32
Epochs	2-4
Max Length	Task dependent (128, 256, 512)
Warmup	6-10% of total steps

Tips

Fine-tuning tips:

Start small: Use DistilBERT before BERT-large
LR matters: 2e-5 is a good starting point
Don't overtrain: 2-4 epochs usually enough
Validate often: Watch for overfitting
Use mixed precision: Faster training
Gradient accumulation: For larger effective batch size

Common Mistakes

Mistake	Solution
LR too high	Reduce to 1e-5 or lower
Too many epochs	Use early stopping
Wrong tokenizer	Always use matching tokenizer
Truncation issues	Increase max_length or use sliding window

python.py

1# Mixed precision training
2from transformers import TrainingArguments
3
4training_args = TrainingArguments(
5    output_dir="./results",
6    fp16=True,  # Enable mixed precision
7    gradient_accumulation_steps=4,  # Effective batch = 16 * 4 = 64
8    warmup_ratio=0.1,  # 10% warmup
9    # ...
10)

Checkpoint

Bạn đã nắm được best practices?

Task 6

🎯 Tổng kết

TB5 min

Transfer Learning với Transformers

Approach	When to use
Pipeline	Quick prototyping, standard tasks
Feature Extraction	Very few data
Fine-tuning	Best performance
Gradual Unfreezing	Medium data, prevent overfitting

Key Libraries

Python

1from transformers import (
2    AutoTokenizer,           # Tokenization
3    AutoModel,               # Base model
4    AutoModelForXXX,         # Task-specific head
5    Trainer,                 # Training loop
6    TrainingArguments,       # Config
7    pipeline                 # Easy inference
8)

Model Selection

Size	Model	Use case
Small	DistilBERT, MiniLM	Fast, resource limited
Medium	BERT-base, RoBERTa	Standard
Large	BERT-large, DeBERTa	Best accuracy

Bài tiếp theo

Pretrained Models cho Vision (nếu có) hoặc Optimization & Deployment:

Model compression (pruning, quantization)
Knowledge distillation
Deployment strategies

🎉 Hoàn thành Module Transfer Learning! Bạn đã sẵn sàng áp dụng pretrained models vào production.

Task 7