🎯 Mục tiêu bài học
Sau bài này, bạn sẽ:
✅ Hiểu các Optimizer: SGD, Adam, AdamW
✅ Biết Learning Rate Scheduling
✅ Hiểu Gradient Clipping và các kỹ thuật khác
✅ Chọn optimizer phù hợp cho từng bài toán
Tổng hợp kiến thức
Đây là bài tổng hợp các kỹ thuật tối ưu giúp model train nhanh hơn và tốt hơn!
⚡ Tầm quan trọng của Optimization
Training Deep Learning = Optimization Problem
Goal: Tìm parameters θ sao cho Loss(θ) nhỏ nhất.
Deep Learning optimization khó vì:
- Non-convex: Nhiều local minima
- High-dimensional: Millions of parameters
- Noisy gradients: Mini-batch approximation
Các yếu tố ảnh hưởng
| Factor | Impact |
|---|---|
| Optimizer | Tốc độ và direction update |
| Learning Rate | Bước nhảy mỗi lần update |
| Batch Size | Noise level của gradient |
| Regularization | Prevent overfitting |
| Initialization | Starting point |
Checkpoint
Bạn đã hiểu tại sao optimization quan trọng?
🔧 Optimizers
SGD (Stochastic Gradient Descent)
1import tensorflow as tf23# Basic SGD4sgd = tf.keras.optimizers.SGD(learning_rate=0.01)56# SGD with Momentum7sgd_momentum = tf.keras.optimizers.SGD(8 learning_rate=0.01,9 momentum=0.9 # Accumulate past gradients10)1112# SGD with Nesterov Momentum (look-ahead)13sgd_nesterov = tf.keras.optimizers.SGD(14 learning_rate=0.01,15 momentum=0.9,16 nesterov=True17)Momentum formula:
Adam (Adaptive Moment Estimation)
1# Adam = Momentum + RMSprop2adam = tf.keras.optimizers.Adam(3 learning_rate=0.001, # Default4 beta_1=0.9, # Momentum term5 beta_2=0.999, # RMSprop term6 epsilon=1e-7 # Numerical stability7)89# AdamW (Adam with decoupled weight decay)10adamw = tf.keras.optimizers.AdamW(11 learning_rate=0.001,12 weight_decay=0.01 # L2 regularization13)Adam formulas:
Optimizer Comparison
1import tensorflow as tf2from tensorflow.keras import layers, Sequential3import numpy as np4import matplotlib.pyplot as plt56def create_model():7 return Sequential([8 layers.Dense(64, activation='relu', input_shape=(10,)),9 layers.Dense(32, activation='relu'),10 layers.Dense(1)11 ])1213# Create synthetic data14np.random.seed(42)15X = np.random.randn(1000, 10)16y = np.sum(X[:, :3], axis=1, keepdims=True) + np.random.randn(1000, 1) * 0.11718# Test different optimizers19optimizers = {20 'SGD': tf.keras.optimizers.SGD(0.01),21 'SGD+Momentum': tf.keras.optimizers.SGD(0.01, momentum=0.9),22 'Adam': tf.keras.optimizers.Adam(0.001),23 'AdamW': tf.keras.optimizers.AdamW(0.001, weight_decay=0.01),24 'RMSprop': tf.keras.optimizers.RMSprop(0.001),25}2627histories = {}28for name, opt in optimizers.items():29 model = create_model()30 model.compile(optimizer=opt, loss='mse')31 history = model.fit(X, y, epochs=50, validation_split=0.2, verbose=0)32 histories[name] = history.history['val_loss']3334# Plot comparison35plt.figure(figsize=(10, 5))36for name, losses in histories.items():37 plt.plot(losses, label=name)38plt.xlabel('Epoch')39plt.ylabel('Validation Loss')40plt.title('Optimizer Comparison')41plt.legend()42plt.yscale('log')43plt.show()When to use which?
| Optimizer | Best for |
|---|---|
| Adam | Default choice, works well most cases |
| AdamW | Transformers, fine-tuning |
| SGD+Momentum | Computer Vision, final fine-tuning |
| RMSprop | RNNs, non-stationary problems |
Checkpoint
Bạn đã hiểu các optimizers?
📈 Learning Rate Scheduling
Tại sao cần LR Scheduling?
Learning Rate là hyperparameter quan trọng nhất!
- Quá cao: Diverge, không converge
- Quá thấp: Converge chậm, stuck in local minima
- Scheduling: Giảm dần LR để converge tốt hơn
Common Schedules
1import tensorflow as tf2import numpy as np3import matplotlib.pyplot as plt45total_steps = 10006initial_lr = 0.178# 1. Step Decay9step_decay = tf.keras.optimizers.schedules.PiecewiseConstantDecay(10 boundaries=[300, 600, 800],11 values=[0.1, 0.01, 0.001, 0.0001]12)1314# 2. Exponential Decay15exp_decay = tf.keras.optimizers.schedules.ExponentialDecay(16 initial_learning_rate=0.1,17 decay_steps=100,18 decay_rate=0.9619)2021# 3. Cosine Annealing22cosine_decay = tf.keras.optimizers.schedules.CosineDecay(23 initial_learning_rate=0.1,24 decay_steps=total_steps,25 alpha=0.0 # Final LR = initial_lr * alpha26)2728# 4. Cosine with Warmup29warmup_steps = 10030class WarmupCosineDecay(tf.keras.optimizers.schedules.LearningRateSchedule):31 def __init__(self, initial_lr, warmup_steps, total_steps):32 super().__init__()33 self.initial_lr = initial_lr34 self.warmup_steps = warmup_steps35 self.total_steps = total_steps36 37 def __call__(self, step):38 step = tf.cast(step, tf.float32)39 warmup_lr = self.initial_lr * (step / self.warmup_steps)40 41 decay_steps = self.total_steps - self.warmup_steps42 decay_step = step - self.warmup_steps43 cosine_lr = self.initial_lr * 0.5 * (44 1 + tf.cos(np.pi * decay_step / decay_steps)45 )46 47 return tf.where(step < self.warmup_steps, warmup_lr, cosine_lr)4849warmup_cosine = WarmupCosineDecay(0.1, 100, total_steps)5051# Visualize52steps = range(total_steps)53schedules = {54 'Step Decay': [step_decay(s).numpy() for s in steps],55 'Exponential': [exp_decay(s).numpy() for s in steps],56 'Cosine': [cosine_decay(s).numpy() for s in steps],57 'Warmup+Cosine': [warmup_cosine(s).numpy() for s in steps],58}5960plt.figure(figsize=(12, 4))61for name, lrs in schedules.items():62 plt.plot(lrs, label=name)63plt.xlabel('Step')64plt.ylabel('Learning Rate')65plt.title('Learning Rate Schedules')66plt.legend()67plt.show()Callbacks for LR Scheduling
1from tensorflow.keras.callbacks import LearningRateScheduler, ReduceLROnPlateau23# 1. Custom schedule function4def scheduler(epoch, lr):5 if epoch < 10:6 return lr7 else:8 return lr * tf.math.exp(-0.1)910lr_scheduler = LearningRateScheduler(scheduler)1112# 2. Reduce LR on Plateau (automatic)13reduce_lr = ReduceLROnPlateau(14 monitor='val_loss',15 factor=0.2, # New LR = LR * factor16 patience=5, # Wait 5 epochs17 min_lr=1e-7,18 verbose=119)2021# Usage22model.fit(23 X_train, y_train,24 epochs=100,25 callbacks=[lr_scheduler, reduce_lr]26)Checkpoint
Bạn đã biết các LR scheduling strategies?
🛡️ Regularization Techniques
L1 và L2 Regularization
1from tensorflow.keras import layers, regularizers23# L2 Regularization (Weight Decay)4model = tf.keras.Sequential([5 layers.Dense(6 256, 7 activation='relu',8 kernel_regularizer=regularizers.l2(0.01) # λ = 0.019 ),10 layers.Dense(11 128, 12 activation='relu',13 kernel_regularizer=regularizers.l2(0.01)14 ),15 layers.Dense(10, activation='softmax')16])1718# L1 Regularization (Sparsity)19l1_layer = layers.Dense(20 128,21 activation='relu',22 kernel_regularizer=regularizers.l1(0.001)23)2425# L1 + L2 (Elastic Net)26elastic_layer = layers.Dense(27 128,28 activation='relu',29 kernel_regularizer=regularizers.l1_l2(l1=0.001, l2=0.01)30)Loss functions:
Dropout
1from tensorflow.keras import layers23# Standard Dropout4model = tf.keras.Sequential([5 layers.Dense(256, activation='relu'),6 layers.Dropout(0.5), # Drop 50% neurons during training7 layers.Dense(128, activation='relu'),8 layers.Dropout(0.3),9 layers.Dense(10, activation='softmax')10])1112# Spatial Dropout (for CNN)13cnn_model = tf.keras.Sequential([14 layers.Conv2D(64, 3, activation='relu'),15 layers.SpatialDropout2D(0.2), # Drop entire feature maps16 layers.Conv2D(128, 3, activation='relu'),17 layers.SpatialDropout2D(0.2),18])1920# Dropout in RNN21rnn_model = tf.keras.Sequential([22 layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2),23 layers.Dense(10)24])Batch Normalization
1from tensorflow.keras import layers23model = tf.keras.Sequential([4 # Conv block with BatchNorm5 layers.Conv2D(64, 3, padding='same'),6 layers.BatchNormalization(),7 layers.Activation('relu'),8 layers.MaxPooling2D(2),9 10 # Dense block with BatchNorm11 layers.Flatten(),12 layers.Dense(256),13 layers.BatchNormalization(),14 layers.Activation('relu'),15 layers.Dropout(0.5),16 17 layers.Dense(10, activation='softmax')18])1920# Layer Normalization (for Transformers)21transformer_block = tf.keras.Sequential([22 layers.MultiHeadAttention(num_heads=8, key_dim=64),23 layers.LayerNormalization(), # Instead of BatchNorm24])Early Stopping
1from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint23early_stop = EarlyStopping(4 monitor='val_loss',5 patience=10, # Wait 10 epochs6 restore_best_weights=True,7 verbose=18)910checkpoint = ModelCheckpoint(11 'best_model.keras',12 monitor='val_loss',13 save_best_only=True14)1516model.fit(17 X_train, y_train,18 validation_data=(X_val, y_val),19 epochs=100,20 callbacks=[early_stop, checkpoint]21)Checkpoint
Bạn đã hiểu các regularization techniques?
🔀 Data Augmentation
Image Augmentation
1import tensorflow as tf2from tensorflow.keras import layers34# Keras Preprocessing Layers (GPU accelerated)5data_augmentation = tf.keras.Sequential([6 layers.RandomFlip("horizontal"),7 layers.RandomRotation(0.1),8 layers.RandomZoom(0.1),9 layers.RandomContrast(0.1),10 layers.RandomTranslation(0.1, 0.1),11])1213# Apply in model14model = tf.keras.Sequential([15 # Augmentation (only during training)16 data_augmentation,17 18 # Model19 layers.Conv2D(32, 3, activation='relu'),20 layers.MaxPooling2D(),21 layers.Conv2D(64, 3, activation='relu'),22 layers.GlobalAveragePooling2D(),23 layers.Dense(10, activation='softmax')24])2526# Or apply in dataset pipeline27def augment(image, label):28 image = data_augmentation(image, training=True)29 return image, label3031augmented_ds = train_ds.map(augment, num_parallel_calls=tf.data.AUTOTUNE)CutMix và MixUp
1import tensorflow as tf2import numpy as np34def mixup(images, labels, alpha=0.2):5 """MixUp augmentation"""6 batch_size = tf.shape(images)[0]7 8 # Sample lambda from Beta distribution9 lam = np.random.beta(alpha, alpha)10 11 # Random shuffle12 indices = tf.random.shuffle(tf.range(batch_size))13 14 # Mix images and labels15 mixed_images = lam * images + (1 - lam) * tf.gather(images, indices)16 mixed_labels = lam * labels + (1 - lam) * tf.gather(labels, indices)17 18 return mixed_images, mixed_labels192021def cutmix(images, labels, alpha=1.0):22 """CutMix augmentation"""23 batch_size = tf.shape(images)[0]24 img_h, img_w = tf.shape(images)[1], tf.shape(images)[2]25 26 # Sample lambda27 lam = np.random.beta(alpha, alpha)28 29 # Get random box30 cut_ratio = tf.sqrt(1.0 - lam)31 cut_h = tf.cast(tf.cast(img_h, tf.float32) * cut_ratio, tf.int32)32 cut_w = tf.cast(tf.cast(img_w, tf.float32) * cut_ratio, tf.int32)33 34 cx = tf.random.uniform([], 0, img_w, dtype=tf.int32)35 cy = tf.random.uniform([], 0, img_h, dtype=tf.int32)36 37 x1 = tf.clip_by_value(cx - cut_w // 2, 0, img_w)38 y1 = tf.clip_by_value(cy - cut_h // 2, 0, img_h)39 x2 = tf.clip_by_value(cx + cut_w // 2, 0, img_w)40 y2 = tf.clip_by_value(cy + cut_h // 2, 0, img_h)41 42 # Random shuffle43 indices = tf.random.shuffle(tf.range(batch_size))44 45 # Mix (simplified - actual implementation needs masking)46 mixed_labels = lam * labels + (1 - lam) * tf.gather(labels, indices)47 48 return images, mixed_labels # Simplified495051# Usage in training loop52for images, labels in train_ds:53 # Apply MixUp 50% of the time54 if np.random.random() > 0.5:55 images, labels = mixup(images, labels)56 57 # Training step...Text Augmentation
1import random23def text_augmentation(text, aug_prob=0.1):4 """Simple text augmentation techniques"""5 words = text.split()6 7 # Random deletion8 words = [w for w in words if random.random() > aug_prob]9 10 # Random swap11 if len(words) > 2 and random.random() < aug_prob:12 i, j = random.sample(range(len(words)), 2)13 words[i], words[j] = words[j], words[i]14 15 return ' '.join(words)1617# Synonym replacement using WordNet (need nltk)18# from nltk.corpus import wordnet19# def synonym_replacement(text, n=1):20# words = text.split()21# for _ in range(n):22# word = random.choice(words)23# synonyms = wordnet.synsets(word)24# if synonyms:25# synonym = synonyms[0].lemmas()[0].name()26# words = [synonym if w == word else w for w in words]27# return ' '.join(words)Checkpoint
Bạn đã biết các data augmentation techniques?
🚀 Training Best Practices
Complete Training Setup
1import tensorflow as tf2from tensorflow.keras import layers, Model3from tensorflow.keras.callbacks import (4 EarlyStopping, 5 ModelCheckpoint,6 ReduceLROnPlateau,7 TensorBoard8)910def create_optimized_model(input_shape, num_classes):11 """Model with all optimization techniques"""12 13 # Data augmentation14 augmentation = tf.keras.Sequential([15 layers.RandomFlip("horizontal"),16 layers.RandomRotation(0.1),17 layers.RandomZoom(0.1),18 ])19 20 # Build model21 inputs = layers.Input(shape=input_shape)22 x = augmentation(inputs)23 24 # Conv blocks with BatchNorm25 for filters in [64, 128, 256]:26 x = layers.Conv2D(filters, 3, padding='same')(x)27 x = layers.BatchNormalization()(x)28 x = layers.Activation('relu')(x)29 x = layers.MaxPooling2D()(x)30 x = layers.SpatialDropout2D(0.2)(x)31 32 # Head33 x = layers.GlobalAveragePooling2D()(x)34 x = layers.Dense(256, kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)35 x = layers.BatchNormalization()(x)36 x = layers.Activation('relu')(x)37 x = layers.Dropout(0.5)(x)38 outputs = layers.Dense(num_classes, activation='softmax')(x)39 40 return Model(inputs, outputs)414243def train_with_best_practices(model, train_ds, val_ds):44 """Training with all best practices"""45 46 # Optimizer with schedule47 initial_lr = 0.00148 lr_schedule = tf.keras.optimizers.schedules.CosineDecay(49 initial_lr, decay_steps=100050 )51 52 optimizer = tf.keras.optimizers.AdamW(53 learning_rate=lr_schedule,54 weight_decay=0.0155 )56 57 # Compile58 model.compile(59 optimizer=optimizer,60 loss='categorical_crossentropy',61 metrics=['accuracy']62 )63 64 # Callbacks65 callbacks = [66 EarlyStopping(67 monitor='val_loss',68 patience=10,69 restore_best_weights=True70 ),71 ModelCheckpoint(72 'best_model.keras',73 monitor='val_accuracy',74 save_best_only=True75 ),76 ReduceLROnPlateau(77 monitor='val_loss',78 factor=0.5,79 patience=5,80 min_lr=1e-781 ),82 TensorBoard(log_dir='./logs')83 ]84 85 # Train86 history = model.fit(87 train_ds,88 validation_data=val_ds,89 epochs=100,90 callbacks=callbacks91 )92 93 return history949596# Usage97model = create_optimized_model((224, 224, 3), 10)98# history = train_with_best_practices(model, train_ds, val_ds)Training Checklist
Before Training:
- Data augmentation configured
- Proper train/val/test split
- Learning rate schedule set
- Regularization (dropout, weight decay)
- Callbacks (early stopping, checkpointing)
- Logging (TensorBoard)
During Training:
- Monitor loss curves
- Check for overfitting
- Validate LR is appropriate
After Training:
- Evaluate on test set
- Save best model
- Document hyperparameters
Checkpoint
Bạn đã nắm training best practices?
🎯 Tổng kết Optimization
Summary Table
| Technique | Purpose | When to use |
|---|---|---|
| Adam | Adaptive LR per parameter | Default choice |
| AdamW | Adam + weight decay | Transformers |
| Cosine Schedule | Smooth LR decay | Most training |
| Warmup | Avoid early instability | Large models |
| Dropout | Regularization | Dense layers |
| BatchNorm | Stabilize training | CNNs |
| Early Stopping | Prevent overfitting | Always |
| Data Augmentation | More data | Limited data |
Key Hyperparameters
| Hyperparameter | Start with | Tune |
|---|---|---|
| Learning Rate | 1e-3 (CNN), 1e-5 (fine-tune) | Grid search |
| Batch Size | 32 | Based on memory |
| Dropout | 0.5 | 0.1 - 0.7 |
| Weight Decay | 0.01 | 0.001 - 0.1 |
| Warmup | 10% of steps | 5-15% |
Next Lesson
Deployment & Production:
- Model compression
- Inference optimization
- Deployment options
🎉 Hoàn thành Optimization! Bạn đã biết cách optimize training cho Deep Learning models.
