MinAI - Về trang chủ
Lý thuyết
17/2165 phút
Đang tải...

Transfer Learning với Pretrained Models

Sử dụng pretrained models hiệu quả: Fine-tuning BERT, GPT, và các strategies

0

🎯 Mục tiêu bài học

TB5 min

Sau bài này, bạn sẽ:

✅ Hiểu BERT, GPT hoạt động thế nào

✅ Biết cách Fine-tuning pretrained models

✅ Sử dụng Hugging Face Transformers

✅ Áp dụng cho các bài toán NLP thực tế

Ôn lại bài trước

Đã học Attention và Transformer. Hôm nay học cách dùng chúng trong thực tế!

1

🎯 Transfer Learning Recap

TB5 min

Tại sao Transfer Learning?

Transfer Learning = Sử dụng kiến thức đã học để giải quyết bài toán mới.

Pretrained models (BERT, GPT, ResNet) đã học được:

  • Language understanding (NLP)
  • Visual patterns (Vision)
  • Common sense knowledge

→ Không cần train từ đầu, tiết kiệm thời giancompute.

Lợi ích

BenefitChi tiết
Ít data100-1000 samples có thể đủ
NhanhFine-tune vài giờ thay vì vài tuần
Tốt hơnPretrained features đã tốt sẵn
Tiết kiệmKhông cần GPU clusters

Checkpoint

Bạn đã hiểu lợi ích của Transfer Learning?

2

🤗 Hugging Face Transformers

TB5 min

Giới thiệu

Hugging Face là thư viện phổ biến nhất để sử dụng pretrained Transformer models.

  • 200,000+ models
  • Dễ sử dụng
  • Support PyTorch, TensorFlow, JAX

Installation

python.py
1# Install
2# pip install transformers datasets
3
4from transformers import (
5 AutoModel,
6 AutoTokenizer,
7 AutoModelForSequenceClassification,
8 TFAutoModel, # For TensorFlow
9 pipeline
10)

Pipeline (Easiest way)

python.py
1from transformers import pipeline
2
3# Sentiment Analysis
4sentiment = pipeline("sentiment-analysis")
5result = sentiment("I love this movie! It's fantastic.")
6print(result)
7# [{'label': 'POSITIVE', 'score': 0.9998}]
8
9# Text Generation
10generator = pipeline("text-generation", model="gpt2")
11text = generator("Once upon a time", max_length=50, num_return_sequences=1)
12print(text)
13
14# Question Answering
15qa = pipeline("question-answering")
16result = qa(
17 question="What is the capital of France?",
18 context="Paris is the capital and largest city of France."
19)
20print(result)
21# {'answer': 'Paris', 'score': 0.98, ...}
22
23# Named Entity Recognition
24ner = pipeline("ner", aggregation_strategy="simple")
25result = ner("Elon Musk is the CEO of Tesla in California")
26print(result)

Available Pipelines

PipelineTask
sentiment-analysisText classification
text-generationGenerate text
question-answeringExtract answers
nerNamed entity recognition
fill-maskFill [MASK] token
summarizationSummarize text
translationTranslate
zero-shot-classificationClassify without training

Checkpoint

Bạn đã biết cách dùng Hugging Face pipeline?

3

🔧 Fine-tuning BERT

TB5 min

Load Model và Tokenizer

python.py
1from transformers import (
2 AutoTokenizer,
3 TFAutoModelForSequenceClassification,
4 AutoModelForSequenceClassification
5)
6
7# Model name
8model_name = "bert-base-uncased"
9
10# Load tokenizer
11tokenizer = AutoTokenizer.from_pretrained(model_name)
12
13# Load model for classification (TensorFlow)
14model = TFAutoModelForSequenceClassification.from_pretrained(
15 model_name,
16 num_labels=2 # Binary classification
17)
18
19# Or PyTorch
20# model = AutoModelForSequenceClassification.from_pretrained(
21# model_name,
22# num_labels=2
23# )

Tokenization

python.py
1# Single sentence
2text = "This movie is great!"
3tokens = tokenizer(
4 text,
5 padding="max_length",
6 truncation=True,
7 max_length=128,
8 return_tensors="tf" # or "pt" for PyTorch
9)
10
11print("Input IDs:", tokens["input_ids"].shape)
12print("Attention Mask:", tokens["attention_mask"].shape)
13print("Decoded:", tokenizer.decode(tokens["input_ids"][0]))
14
15# Batch tokenization
16texts = ["I love this!", "This is terrible.", "Pretty good movie."]
17batch_tokens = tokenizer(
18 texts,
19 padding=True,
20 truncation=True,
21 max_length=128,
22 return_tensors="tf"
23)
Expected Output
1Input IDs: (1, 128)
2Attention Mask: (1, 128)
3Decoded: [CLS] this movie is great! [SEP] [PAD] [PAD] ...

Fine-tuning với Keras

python.py
1import tensorflow as tf
2from transformers import TFAutoModelForSequenceClassification
3
4# Prepare data
5train_texts = ["Great movie!", "Terrible film", "I loved it", "Waste of time"]
6train_labels = [1, 0, 1, 0] # 1=positive, 0=negative
7
8# Tokenize
9train_encodings = tokenizer(
10 train_texts,
11 padding=True,
12 truncation=True,
13 max_length=128,
14 return_tensors="tf"
15)
16
17# Create dataset
18train_dataset = tf.data.Dataset.from_tensor_slices((
19 dict(train_encodings),
20 train_labels
21)).batch(2)
22
23# Load model
24model = TFAutoModelForSequenceClassification.from_pretrained(
25 "bert-base-uncased",
26 num_labels=2
27)
28
29# Compile
30model.compile(
31 optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
32 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
33 metrics=['accuracy']
34)
35
36# Train
37model.fit(train_dataset, epochs=3)

Checkpoint

Bạn đã biết cách fine-tune BERT?

4

📊 Fine-tuning Strategies

TB5 min

Các chiến lược

StrategyDescriptionWhen to use
Feature ExtractionFreeze BERT, only train classifierVery few data (ít hơn 100)
Full Fine-tuningTrain entire modelStandard (hơn 1000)
Gradual UnfreezingUnfreeze layers graduallyMedium data
Discriminative LRDifferent LR for different layersBest performance

Feature Extraction

python.py
1from transformers import TFAutoModel
2from tensorflow.keras import layers, Model
3
4# Load BERT without classification head
5base_model = TFAutoModel.from_pretrained("bert-base-uncased")
6
7# Freeze BERT
8base_model.trainable = False
9
10# Build classifier on top
11inputs = {
12 "input_ids": layers.Input(shape=(128,), dtype=tf.int32),
13 "attention_mask": layers.Input(shape=(128,), dtype=tf.int32)
14}
15
16bert_output = base_model(inputs)
17pooled = bert_output.last_hidden_state[:, 0, :] # [CLS] token
18
19x = layers.Dense(256, activation='relu')(pooled)
20x = layers.Dropout(0.3)(x)
21outputs = layers.Dense(2, activation='softmax')(x)
22
23model = Model(inputs, outputs)
24
25# Only classifier is trainable
26print(f"Trainable params: {sum([tf.size(w).numpy() for w in model.trainable_weights]):,}")

Discriminative Learning Rates

python.py
1from transformers import TFAutoModelForSequenceClassification
2import tensorflow as tf
3
4# Different LR for different layers
5def get_optimizer_with_discriminative_lr(model, base_lr=2e-5):
6 """
7 Lower LR for lower layers, higher LR for top layers
8 """
9 # Group parameters
10 embeddings = []
11 encoder_layers = [[] for _ in range(12)] # BERT has 12 layers
12 classifier = []
13
14 for var in model.trainable_variables:
15 name = var.name.lower()
16
17 if 'embedding' in name:
18 embeddings.append(var)
19 elif 'classifier' in name or 'pooler' in name:
20 classifier.append(var)
21 else:
22 # Find layer number
23 for i in range(12):
24 if f'layer_._{i}' in name or f'layer/{i}' in name:
25 encoder_layers[i].append(var)
26 break
27
28 # Create optimizer with different LRs
29 lr_multipliers = [0.1] + [0.1 + 0.9 * i / 11 for i in range(12)] + [1.0]
30
31 # Simplified: use single optimizer with base_lr
32 optimizer = tf.keras.optimizers.Adam(learning_rate=base_lr)
33
34 return optimizer
35
36# Usage
37model = TFAutoModelForSequenceClassification.from_pretrained(
38 "bert-base-uncased", num_labels=2
39)
40optimizer = get_optimizer_with_discriminative_lr(model)

Gradual Unfreezing

python.py
1def train_with_gradual_unfreezing(model, train_data, epochs_per_stage=2):
2 """
3 Unfreeze layers gradually from top to bottom
4 """
5 # Stage 1: Only classifier
6 for layer in model.bert.layers:
7 layer.trainable = False
8
9 model.compile(
10 optimizer=tf.keras.optimizers.Adam(1e-3),
11 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
12 metrics=['accuracy']
13 )
14
15 print("Stage 1: Training classifier only")
16 model.fit(train_data, epochs=epochs_per_stage)
17
18 # Stage 2: Unfreeze top 4 layers
19 for layer in model.bert.encoder.layer[-4:]:
20 layer.trainable = True
21
22 model.compile(
23 optimizer=tf.keras.optimizers.Adam(1e-4),
24 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
25 metrics=['accuracy']
26 )
27
28 print("Stage 2: Training top 4 layers + classifier")
29 model.fit(train_data, epochs=epochs_per_stage)
30
31 # Stage 3: Full fine-tuning
32 for layer in model.bert.layers:
33 layer.trainable = True
34
35 model.compile(
36 optimizer=tf.keras.optimizers.Adam(2e-5),
37 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
38 metrics=['accuracy']
39 )
40
41 print("Stage 3: Full fine-tuning")
42 model.fit(train_data, epochs=epochs_per_stage)

Checkpoint

Bạn đã hiểu các fine-tuning strategies?

5

💻 Complete Fine-tuning Pipeline

TB5 min

Full Pipeline với Hugging Face Trainer

python.py
1from transformers import (
2 AutoTokenizer,
3 AutoModelForSequenceClassification,
4 TrainingArguments,
5 Trainer
6)
7from datasets import load_dataset
8import numpy as np
9from sklearn.metrics import accuracy_score, f1_score
10
11# 1. Load dataset
12dataset = load_dataset("imdb")
13# Subset for demo
14train_data = dataset["train"].shuffle(seed=42).select(range(1000))
15test_data = dataset["test"].shuffle(seed=42).select(range(200))
16
17# 2. Load tokenizer and model
18model_name = "distilbert-base-uncased"
19tokenizer = AutoTokenizer.from_pretrained(model_name)
20model = AutoModelForSequenceClassification.from_pretrained(
21 model_name,
22 num_labels=2
23)
24
25# 3. Tokenize
26def tokenize_function(examples):
27 return tokenizer(
28 examples["text"],
29 padding="max_length",
30 truncation=True,
31 max_length=256
32 )
33
34train_tokenized = train_data.map(tokenize_function, batched=True)
35test_tokenized = test_data.map(tokenize_function, batched=True)
36
37# 4. Define metrics
38def compute_metrics(eval_pred):
39 logits, labels = eval_pred
40 predictions = np.argmax(logits, axis=-1)
41 return {
42 "accuracy": accuracy_score(labels, predictions),
43 "f1": f1_score(labels, predictions)
44 }
45
46# 5. Training arguments
47training_args = TrainingArguments(
48 output_dir="./results",
49 evaluation_strategy="epoch",
50 save_strategy="epoch",
51 learning_rate=2e-5,
52 per_device_train_batch_size=16,
53 per_device_eval_batch_size=16,
54 num_train_epochs=3,
55 weight_decay=0.01,
56 load_best_model_at_end=True,
57 metric_for_best_model="f1",
58)
59
60# 6. Create Trainer
61trainer = Trainer(
62 model=model,
63 args=training_args,
64 train_dataset=train_tokenized,
65 eval_dataset=test_tokenized,
66 compute_metrics=compute_metrics,
67)
68
69# 7. Train
70trainer.train()
71
72# 8. Evaluate
73results = trainer.evaluate()
74print(f"Test Accuracy: {results['eval_accuracy']:.4f}")
75print(f"Test F1: {results['eval_f1']:.4f}")
76
77# 9. Save model
78model.save_pretrained("./my_model")
79tokenizer.save_pretrained("./my_model")

Inference với saved model

python.py
1from transformers import pipeline
2
3# Load saved model
4classifier = pipeline(
5 "sentiment-analysis",
6 model="./my_model",
7 tokenizer="./my_model"
8)
9
10# Predict
11texts = [
12 "This movie was absolutely fantastic!",
13 "I wasted two hours of my life on this garbage.",
14 "It was okay, nothing special."
15]
16
17results = classifier(texts)
18for text, result in zip(texts, results):
19 print(f"{text[:50]}...")
20 print(f" → {result['label']}: {result['score']:.4f}")

Checkpoint

Bạn có thể xây dựng full fine-tuning pipeline?

6

🎯 Best Practices

TB5 min

Hyperparameters

ParameterRecommended Value
Learning Rate1e-5 to 5e-5
Batch Size16, 32
Epochs2-4
Max LengthTask dependent (128, 256, 512)
Warmup6-10% of total steps

Tips

Fine-tuning tips:

  1. Start small: Use DistilBERT before BERT-large
  2. LR matters: 2e-5 is a good starting point
  3. Don't overtrain: 2-4 epochs usually enough
  4. Validate often: Watch for overfitting
  5. Use mixed precision: Faster training
  6. Gradient accumulation: For larger effective batch size

Common Mistakes

MistakeSolution
LR too highReduce to 1e-5 or lower
Too many epochsUse early stopping
Wrong tokenizerAlways use matching tokenizer
Truncation issuesIncrease max_length or use sliding window
python.py
1# Mixed precision training
2from transformers import TrainingArguments
3
4training_args = TrainingArguments(
5 output_dir="./results",
6 fp16=True, # Enable mixed precision
7 gradient_accumulation_steps=4, # Effective batch = 16 * 4 = 64
8 warmup_ratio=0.1, # 10% warmup
9 # ...
10)

Checkpoint

Bạn đã nắm được best practices?

7

🎯 Tổng kết

TB5 min

Transfer Learning với Transformers

ApproachWhen to use
PipelineQuick prototyping, standard tasks
Feature ExtractionVery few data
Fine-tuningBest performance
Gradual UnfreezingMedium data, prevent overfitting

Key Libraries

Python
1from transformers import (
2 AutoTokenizer, # Tokenization
3 AutoModel, # Base model
4 AutoModelForXXX, # Task-specific head
5 Trainer, # Training loop
6 TrainingArguments, # Config
7 pipeline # Easy inference
8)

Model Selection

SizeModelUse case
SmallDistilBERT, MiniLMFast, resource limited
MediumBERT-base, RoBERTaStandard
LargeBERT-large, DeBERTaBest accuracy

Bài tiếp theo

Pretrained Models cho Vision (nếu có) hoặc Optimization & Deployment:

  • Model compression (pruning, quantization)
  • Knowledge distillation
  • Deployment strategies

🎉 Hoàn thành Module Transfer Learning! Bạn đã sẵn sàng áp dụng pretrained models vào production.