🎯 Mục tiêu bài học
Sau bài này, bạn sẽ:
✅ Hiểu BERT, GPT hoạt động thế nào
✅ Biết cách Fine-tuning pretrained models
✅ Sử dụng Hugging Face Transformers
✅ Áp dụng cho các bài toán NLP thực tế
Ôn lại bài trước
Đã học Attention và Transformer. Hôm nay học cách dùng chúng trong thực tế!
🎯 Transfer Learning Recap
Tại sao Transfer Learning?
Transfer Learning = Sử dụng kiến thức đã học để giải quyết bài toán mới.
Pretrained models (BERT, GPT, ResNet) đã học được:
- Language understanding (NLP)
- Visual patterns (Vision)
- Common sense knowledge
→ Không cần train từ đầu, tiết kiệm thời gian và compute.
Lợi ích
| Benefit | Chi tiết |
|---|---|
| Ít data | 100-1000 samples có thể đủ |
| Nhanh | Fine-tune vài giờ thay vì vài tuần |
| Tốt hơn | Pretrained features đã tốt sẵn |
| Tiết kiệm | Không cần GPU clusters |
Checkpoint
Bạn đã hiểu lợi ích của Transfer Learning?
🤗 Hugging Face Transformers
Giới thiệu
Hugging Face là thư viện phổ biến nhất để sử dụng pretrained Transformer models.
- 200,000+ models
- Dễ sử dụng
- Support PyTorch, TensorFlow, JAX
Installation
1# Install2# pip install transformers datasets34from transformers import (5 AutoModel,6 AutoTokenizer,7 AutoModelForSequenceClassification,8 TFAutoModel, # For TensorFlow9 pipeline10)Pipeline (Easiest way)
1from transformers import pipeline23# Sentiment Analysis4sentiment = pipeline("sentiment-analysis")5result = sentiment("I love this movie! It's fantastic.")6print(result)7# [{'label': 'POSITIVE', 'score': 0.9998}]89# Text Generation10generator = pipeline("text-generation", model="gpt2")11text = generator("Once upon a time", max_length=50, num_return_sequences=1)12print(text)1314# Question Answering15qa = pipeline("question-answering")16result = qa(17 question="What is the capital of France?",18 context="Paris is the capital and largest city of France."19)20print(result)21# {'answer': 'Paris', 'score': 0.98, ...}2223# Named Entity Recognition24ner = pipeline("ner", aggregation_strategy="simple")25result = ner("Elon Musk is the CEO of Tesla in California")26print(result)Available Pipelines
| Pipeline | Task |
|---|---|
sentiment-analysis | Text classification |
text-generation | Generate text |
question-answering | Extract answers |
ner | Named entity recognition |
fill-mask | Fill [MASK] token |
summarization | Summarize text |
translation | Translate |
zero-shot-classification | Classify without training |
Checkpoint
Bạn đã biết cách dùng Hugging Face pipeline?
🔧 Fine-tuning BERT
Load Model và Tokenizer
1from transformers import (2 AutoTokenizer,3 TFAutoModelForSequenceClassification,4 AutoModelForSequenceClassification5)67# Model name8model_name = "bert-base-uncased"910# Load tokenizer11tokenizer = AutoTokenizer.from_pretrained(model_name)1213# Load model for classification (TensorFlow)14model = TFAutoModelForSequenceClassification.from_pretrained(15 model_name,16 num_labels=2 # Binary classification17)1819# Or PyTorch20# model = AutoModelForSequenceClassification.from_pretrained(21# model_name,22# num_labels=223# )Tokenization
1# Single sentence2text = "This movie is great!"3tokens = tokenizer(4 text,5 padding="max_length",6 truncation=True,7 max_length=128,8 return_tensors="tf" # or "pt" for PyTorch9)1011print("Input IDs:", tokens["input_ids"].shape)12print("Attention Mask:", tokens["attention_mask"].shape)13print("Decoded:", tokenizer.decode(tokens["input_ids"][0]))1415# Batch tokenization16texts = ["I love this!", "This is terrible.", "Pretty good movie."]17batch_tokens = tokenizer(18 texts,19 padding=True,20 truncation=True,21 max_length=128,22 return_tensors="tf"23)1Input IDs: (1, 128)2Attention Mask: (1, 128)3Decoded: [CLS] this movie is great! [SEP] [PAD] [PAD] ...Fine-tuning với Keras
1import tensorflow as tf2from transformers import TFAutoModelForSequenceClassification34# Prepare data5train_texts = ["Great movie!", "Terrible film", "I loved it", "Waste of time"]6train_labels = [1, 0, 1, 0] # 1=positive, 0=negative78# Tokenize9train_encodings = tokenizer(10 train_texts,11 padding=True,12 truncation=True,13 max_length=128,14 return_tensors="tf"15)1617# Create dataset18train_dataset = tf.data.Dataset.from_tensor_slices((19 dict(train_encodings),20 train_labels21)).batch(2)2223# Load model24model = TFAutoModelForSequenceClassification.from_pretrained(25 "bert-base-uncased",26 num_labels=227)2829# Compile30model.compile(31 optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),32 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),33 metrics=['accuracy']34)3536# Train37model.fit(train_dataset, epochs=3)Checkpoint
Bạn đã biết cách fine-tune BERT?
📊 Fine-tuning Strategies
Các chiến lược
| Strategy | Description | When to use |
|---|---|---|
| Feature Extraction | Freeze BERT, only train classifier | Very few data (ít hơn 100) |
| Full Fine-tuning | Train entire model | Standard (hơn 1000) |
| Gradual Unfreezing | Unfreeze layers gradually | Medium data |
| Discriminative LR | Different LR for different layers | Best performance |
Feature Extraction
1from transformers import TFAutoModel2from tensorflow.keras import layers, Model34# Load BERT without classification head5base_model = TFAutoModel.from_pretrained("bert-base-uncased")67# Freeze BERT8base_model.trainable = False910# Build classifier on top11inputs = {12 "input_ids": layers.Input(shape=(128,), dtype=tf.int32),13 "attention_mask": layers.Input(shape=(128,), dtype=tf.int32)14}1516bert_output = base_model(inputs)17pooled = bert_output.last_hidden_state[:, 0, :] # [CLS] token1819x = layers.Dense(256, activation='relu')(pooled)20x = layers.Dropout(0.3)(x)21outputs = layers.Dense(2, activation='softmax')(x)2223model = Model(inputs, outputs)2425# Only classifier is trainable26print(f"Trainable params: {sum([tf.size(w).numpy() for w in model.trainable_weights]):,}")Discriminative Learning Rates
1from transformers import TFAutoModelForSequenceClassification2import tensorflow as tf34# Different LR for different layers5def get_optimizer_with_discriminative_lr(model, base_lr=2e-5):6 """7 Lower LR for lower layers, higher LR for top layers8 """9 # Group parameters10 embeddings = []11 encoder_layers = [[] for _ in range(12)] # BERT has 12 layers12 classifier = []13 14 for var in model.trainable_variables:15 name = var.name.lower()16 17 if 'embedding' in name:18 embeddings.append(var)19 elif 'classifier' in name or 'pooler' in name:20 classifier.append(var)21 else:22 # Find layer number23 for i in range(12):24 if f'layer_._{i}' in name or f'layer/{i}' in name:25 encoder_layers[i].append(var)26 break27 28 # Create optimizer with different LRs29 lr_multipliers = [0.1] + [0.1 + 0.9 * i / 11 for i in range(12)] + [1.0]30 31 # Simplified: use single optimizer with base_lr32 optimizer = tf.keras.optimizers.Adam(learning_rate=base_lr)33 34 return optimizer3536# Usage37model = TFAutoModelForSequenceClassification.from_pretrained(38 "bert-base-uncased", num_labels=239)40optimizer = get_optimizer_with_discriminative_lr(model)Gradual Unfreezing
1def train_with_gradual_unfreezing(model, train_data, epochs_per_stage=2):2 """3 Unfreeze layers gradually from top to bottom4 """5 # Stage 1: Only classifier6 for layer in model.bert.layers:7 layer.trainable = False8 9 model.compile(10 optimizer=tf.keras.optimizers.Adam(1e-3),11 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),12 metrics=['accuracy']13 )14 15 print("Stage 1: Training classifier only")16 model.fit(train_data, epochs=epochs_per_stage)17 18 # Stage 2: Unfreeze top 4 layers19 for layer in model.bert.encoder.layer[-4:]:20 layer.trainable = True21 22 model.compile(23 optimizer=tf.keras.optimizers.Adam(1e-4),24 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),25 metrics=['accuracy']26 )27 28 print("Stage 2: Training top 4 layers + classifier")29 model.fit(train_data, epochs=epochs_per_stage)30 31 # Stage 3: Full fine-tuning32 for layer in model.bert.layers:33 layer.trainable = True34 35 model.compile(36 optimizer=tf.keras.optimizers.Adam(2e-5),37 loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),38 metrics=['accuracy']39 )40 41 print("Stage 3: Full fine-tuning")42 model.fit(train_data, epochs=epochs_per_stage)Checkpoint
Bạn đã hiểu các fine-tuning strategies?
💻 Complete Fine-tuning Pipeline
Full Pipeline với Hugging Face Trainer
1from transformers import (2 AutoTokenizer,3 AutoModelForSequenceClassification,4 TrainingArguments,5 Trainer6)7from datasets import load_dataset8import numpy as np9from sklearn.metrics import accuracy_score, f1_score1011# 1. Load dataset12dataset = load_dataset("imdb")13# Subset for demo14train_data = dataset["train"].shuffle(seed=42).select(range(1000))15test_data = dataset["test"].shuffle(seed=42).select(range(200))1617# 2. Load tokenizer and model18model_name = "distilbert-base-uncased"19tokenizer = AutoTokenizer.from_pretrained(model_name)20model = AutoModelForSequenceClassification.from_pretrained(21 model_name, 22 num_labels=223)2425# 3. Tokenize26def tokenize_function(examples):27 return tokenizer(28 examples["text"],29 padding="max_length",30 truncation=True,31 max_length=25632 )3334train_tokenized = train_data.map(tokenize_function, batched=True)35test_tokenized = test_data.map(tokenize_function, batched=True)3637# 4. Define metrics38def compute_metrics(eval_pred):39 logits, labels = eval_pred40 predictions = np.argmax(logits, axis=-1)41 return {42 "accuracy": accuracy_score(labels, predictions),43 "f1": f1_score(labels, predictions)44 }4546# 5. Training arguments47training_args = TrainingArguments(48 output_dir="./results",49 evaluation_strategy="epoch",50 save_strategy="epoch",51 learning_rate=2e-5,52 per_device_train_batch_size=16,53 per_device_eval_batch_size=16,54 num_train_epochs=3,55 weight_decay=0.01,56 load_best_model_at_end=True,57 metric_for_best_model="f1",58)5960# 6. Create Trainer61trainer = Trainer(62 model=model,63 args=training_args,64 train_dataset=train_tokenized,65 eval_dataset=test_tokenized,66 compute_metrics=compute_metrics,67)6869# 7. Train70trainer.train()7172# 8. Evaluate73results = trainer.evaluate()74print(f"Test Accuracy: {results['eval_accuracy']:.4f}")75print(f"Test F1: {results['eval_f1']:.4f}")7677# 9. Save model78model.save_pretrained("./my_model")79tokenizer.save_pretrained("./my_model")Inference với saved model
1from transformers import pipeline23# Load saved model4classifier = pipeline(5 "sentiment-analysis",6 model="./my_model",7 tokenizer="./my_model"8)910# Predict11texts = [12 "This movie was absolutely fantastic!",13 "I wasted two hours of my life on this garbage.",14 "It was okay, nothing special."15]1617results = classifier(texts)18for text, result in zip(texts, results):19 print(f"{text[:50]}...")20 print(f" → {result['label']}: {result['score']:.4f}")Checkpoint
Bạn có thể xây dựng full fine-tuning pipeline?
🎯 Best Practices
Hyperparameters
| Parameter | Recommended Value |
|---|---|
| Learning Rate | 1e-5 to 5e-5 |
| Batch Size | 16, 32 |
| Epochs | 2-4 |
| Max Length | Task dependent (128, 256, 512) |
| Warmup | 6-10% of total steps |
Tips
Fine-tuning tips:
- Start small: Use DistilBERT before BERT-large
- LR matters: 2e-5 is a good starting point
- Don't overtrain: 2-4 epochs usually enough
- Validate often: Watch for overfitting
- Use mixed precision: Faster training
- Gradient accumulation: For larger effective batch size
Common Mistakes
| Mistake | Solution |
|---|---|
| LR too high | Reduce to 1e-5 or lower |
| Too many epochs | Use early stopping |
| Wrong tokenizer | Always use matching tokenizer |
| Truncation issues | Increase max_length or use sliding window |
1# Mixed precision training2from transformers import TrainingArguments34training_args = TrainingArguments(5 output_dir="./results",6 fp16=True, # Enable mixed precision7 gradient_accumulation_steps=4, # Effective batch = 16 * 4 = 648 warmup_ratio=0.1, # 10% warmup9 # ...10)Checkpoint
Bạn đã nắm được best practices?
🎯 Tổng kết
Transfer Learning với Transformers
| Approach | When to use |
|---|---|
| Pipeline | Quick prototyping, standard tasks |
| Feature Extraction | Very few data |
| Fine-tuning | Best performance |
| Gradual Unfreezing | Medium data, prevent overfitting |
Key Libraries
1from transformers import (2 AutoTokenizer, # Tokenization3 AutoModel, # Base model4 AutoModelForXXX, # Task-specific head5 Trainer, # Training loop6 TrainingArguments, # Config7 pipeline # Easy inference8)Model Selection
| Size | Model | Use case |
|---|---|---|
| Small | DistilBERT, MiniLM | Fast, resource limited |
| Medium | BERT-base, RoBERTa | Standard |
| Large | BERT-large, DeBERTa | Best accuracy |
Bài tiếp theo
Pretrained Models cho Vision (nếu có) hoặc Optimization & Deployment:
- Model compression (pruning, quantization)
- Knowledge distillation
- Deployment strategies
🎉 Hoàn thành Module Transfer Learning! Bạn đã sẵn sàng áp dụng pretrained models vào production.
