MinAI - Về trang chủ
Lý thuyết
19/2160 phút
Đang tải...

Optimization Techniques

Các kỹ thuật tối ưu training: Optimizers, Learning Rate Scheduling, Regularization

0

🎯 Mục tiêu bài học

TB5 min

Sau bài này, bạn sẽ:

✅ Hiểu các Optimizer: SGD, Adam, AdamW

✅ Biết Learning Rate Scheduling

✅ Hiểu Gradient Clipping và các kỹ thuật khác

✅ Chọn optimizer phù hợp cho từng bài toán

Tổng hợp kiến thức

Đây là bài tổng hợp các kỹ thuật tối ưu giúp model train nhanh hơn và tốt hơn!

1

⚡ Tầm quan trọng của Optimization

TB5 min

Training Deep Learning = Optimization Problem

Goal: Tìm parameters θ sao cho Loss(θ) nhỏ nhất.

Deep Learning optimization khó vì:

  • Non-convex: Nhiều local minima
  • High-dimensional: Millions of parameters
  • Noisy gradients: Mini-batch approximation

Các yếu tố ảnh hưởng

FactorImpact
OptimizerTốc độ và direction update
Learning RateBước nhảy mỗi lần update
Batch SizeNoise level của gradient
RegularizationPrevent overfitting
InitializationStarting point

Checkpoint

Bạn đã hiểu tại sao optimization quan trọng?

2

🔧 Optimizers

TB5 min

SGD (Stochastic Gradient Descent)

python.py
1import tensorflow as tf
2
3# Basic SGD
4sgd = tf.keras.optimizers.SGD(learning_rate=0.01)
5
6# SGD with Momentum
7sgd_momentum = tf.keras.optimizers.SGD(
8 learning_rate=0.01,
9 momentum=0.9 # Accumulate past gradients
10)
11
12# SGD with Nesterov Momentum (look-ahead)
13sgd_nesterov = tf.keras.optimizers.SGD(
14 learning_rate=0.01,
15 momentum=0.9,
16 nesterov=True
17)

Momentum formula: vt=βvt1+(1β)Lv_t = \beta v_{t-1} + (1-\beta) \nabla L θ=θαvt\theta = \theta - \alpha v_t

Adam (Adaptive Moment Estimation)

python.py
1# Adam = Momentum + RMSprop
2adam = tf.keras.optimizers.Adam(
3 learning_rate=0.001, # Default
4 beta_1=0.9, # Momentum term
5 beta_2=0.999, # RMSprop term
6 epsilon=1e-7 # Numerical stability
7)
8
9# AdamW (Adam with decoupled weight decay)
10adamw = tf.keras.optimizers.AdamW(
11 learning_rate=0.001,
12 weight_decay=0.01 # L2 regularization
13)

Adam formulas: mt=β1mt1+(1β1)gt(1st moment)m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(1st moment)} vt=β2vt1+(1β2)gt2(2nd moment)v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(2nd moment)} m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} θ=θαm^tv^t+ϵ\theta = \theta - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Optimizer Comparison

python.py
1import tensorflow as tf
2from tensorflow.keras import layers, Sequential
3import numpy as np
4import matplotlib.pyplot as plt
5
6def create_model():
7 return Sequential([
8 layers.Dense(64, activation='relu', input_shape=(10,)),
9 layers.Dense(32, activation='relu'),
10 layers.Dense(1)
11 ])
12
13# Create synthetic data
14np.random.seed(42)
15X = np.random.randn(1000, 10)
16y = np.sum(X[:, :3], axis=1, keepdims=True) + np.random.randn(1000, 1) * 0.1
17
18# Test different optimizers
19optimizers = {
20 'SGD': tf.keras.optimizers.SGD(0.01),
21 'SGD+Momentum': tf.keras.optimizers.SGD(0.01, momentum=0.9),
22 'Adam': tf.keras.optimizers.Adam(0.001),
23 'AdamW': tf.keras.optimizers.AdamW(0.001, weight_decay=0.01),
24 'RMSprop': tf.keras.optimizers.RMSprop(0.001),
25}
26
27histories = {}
28for name, opt in optimizers.items():
29 model = create_model()
30 model.compile(optimizer=opt, loss='mse')
31 history = model.fit(X, y, epochs=50, validation_split=0.2, verbose=0)
32 histories[name] = history.history['val_loss']
33
34# Plot comparison
35plt.figure(figsize=(10, 5))
36for name, losses in histories.items():
37 plt.plot(losses, label=name)
38plt.xlabel('Epoch')
39plt.ylabel('Validation Loss')
40plt.title('Optimizer Comparison')
41plt.legend()
42plt.yscale('log')
43plt.show()

When to use which?

OptimizerBest for
AdamDefault choice, works well most cases
AdamWTransformers, fine-tuning
SGD+MomentumComputer Vision, final fine-tuning
RMSpropRNNs, non-stationary problems

Checkpoint

Bạn đã hiểu các optimizers?

3

📈 Learning Rate Scheduling

TB5 min

Tại sao cần LR Scheduling?

Learning Rate là hyperparameter quan trọng nhất!

  • Quá cao: Diverge, không converge
  • Quá thấp: Converge chậm, stuck in local minima
  • Scheduling: Giảm dần LR để converge tốt hơn

Common Schedules

python.py
1import tensorflow as tf
2import numpy as np
3import matplotlib.pyplot as plt
4
5total_steps = 1000
6initial_lr = 0.1
7
8# 1. Step Decay
9step_decay = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
10 boundaries=[300, 600, 800],
11 values=[0.1, 0.01, 0.001, 0.0001]
12)
13
14# 2. Exponential Decay
15exp_decay = tf.keras.optimizers.schedules.ExponentialDecay(
16 initial_learning_rate=0.1,
17 decay_steps=100,
18 decay_rate=0.96
19)
20
21# 3. Cosine Annealing
22cosine_decay = tf.keras.optimizers.schedules.CosineDecay(
23 initial_learning_rate=0.1,
24 decay_steps=total_steps,
25 alpha=0.0 # Final LR = initial_lr * alpha
26)
27
28# 4. Cosine with Warmup
29warmup_steps = 100
30class WarmupCosineDecay(tf.keras.optimizers.schedules.LearningRateSchedule):
31 def __init__(self, initial_lr, warmup_steps, total_steps):
32 super().__init__()
33 self.initial_lr = initial_lr
34 self.warmup_steps = warmup_steps
35 self.total_steps = total_steps
36
37 def __call__(self, step):
38 step = tf.cast(step, tf.float32)
39 warmup_lr = self.initial_lr * (step / self.warmup_steps)
40
41 decay_steps = self.total_steps - self.warmup_steps
42 decay_step = step - self.warmup_steps
43 cosine_lr = self.initial_lr * 0.5 * (
44 1 + tf.cos(np.pi * decay_step / decay_steps)
45 )
46
47 return tf.where(step < self.warmup_steps, warmup_lr, cosine_lr)
48
49warmup_cosine = WarmupCosineDecay(0.1, 100, total_steps)
50
51# Visualize
52steps = range(total_steps)
53schedules = {
54 'Step Decay': [step_decay(s).numpy() for s in steps],
55 'Exponential': [exp_decay(s).numpy() for s in steps],
56 'Cosine': [cosine_decay(s).numpy() for s in steps],
57 'Warmup+Cosine': [warmup_cosine(s).numpy() for s in steps],
58}
59
60plt.figure(figsize=(12, 4))
61for name, lrs in schedules.items():
62 plt.plot(lrs, label=name)
63plt.xlabel('Step')
64plt.ylabel('Learning Rate')
65plt.title('Learning Rate Schedules')
66plt.legend()
67plt.show()

Callbacks for LR Scheduling

python.py
1from tensorflow.keras.callbacks import LearningRateScheduler, ReduceLROnPlateau
2
3# 1. Custom schedule function
4def scheduler(epoch, lr):
5 if epoch < 10:
6 return lr
7 else:
8 return lr * tf.math.exp(-0.1)
9
10lr_scheduler = LearningRateScheduler(scheduler)
11
12# 2. Reduce LR on Plateau (automatic)
13reduce_lr = ReduceLROnPlateau(
14 monitor='val_loss',
15 factor=0.2, # New LR = LR * factor
16 patience=5, # Wait 5 epochs
17 min_lr=1e-7,
18 verbose=1
19)
20
21# Usage
22model.fit(
23 X_train, y_train,
24 epochs=100,
25 callbacks=[lr_scheduler, reduce_lr]
26)

Checkpoint

Bạn đã biết các LR scheduling strategies?

4

🛡️ Regularization Techniques

TB5 min

L1 và L2 Regularization

python.py
1from tensorflow.keras import layers, regularizers
2
3# L2 Regularization (Weight Decay)
4model = tf.keras.Sequential([
5 layers.Dense(
6 256,
7 activation='relu',
8 kernel_regularizer=regularizers.l2(0.01) # λ = 0.01
9 ),
10 layers.Dense(
11 128,
12 activation='relu',
13 kernel_regularizer=regularizers.l2(0.01)
14 ),
15 layers.Dense(10, activation='softmax')
16])
17
18# L1 Regularization (Sparsity)
19l1_layer = layers.Dense(
20 128,
21 activation='relu',
22 kernel_regularizer=regularizers.l1(0.001)
23)
24
25# L1 + L2 (Elastic Net)
26elastic_layer = layers.Dense(
27 128,
28 activation='relu',
29 kernel_regularizer=regularizers.l1_l2(l1=0.001, l2=0.01)
30)

Loss functions: LL2=Loriginal+λiwi2L_{L2} = L_{original} + \lambda \sum_i w_i^2 LL1=Loriginal+λiwiL_{L1} = L_{original} + \lambda \sum_i |w_i|

Dropout

python.py
1from tensorflow.keras import layers
2
3# Standard Dropout
4model = tf.keras.Sequential([
5 layers.Dense(256, activation='relu'),
6 layers.Dropout(0.5), # Drop 50% neurons during training
7 layers.Dense(128, activation='relu'),
8 layers.Dropout(0.3),
9 layers.Dense(10, activation='softmax')
10])
11
12# Spatial Dropout (for CNN)
13cnn_model = tf.keras.Sequential([
14 layers.Conv2D(64, 3, activation='relu'),
15 layers.SpatialDropout2D(0.2), # Drop entire feature maps
16 layers.Conv2D(128, 3, activation='relu'),
17 layers.SpatialDropout2D(0.2),
18])
19
20# Dropout in RNN
21rnn_model = tf.keras.Sequential([
22 layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2),
23 layers.Dense(10)
24])

Batch Normalization

python.py
1from tensorflow.keras import layers
2
3model = tf.keras.Sequential([
4 # Conv block with BatchNorm
5 layers.Conv2D(64, 3, padding='same'),
6 layers.BatchNormalization(),
7 layers.Activation('relu'),
8 layers.MaxPooling2D(2),
9
10 # Dense block with BatchNorm
11 layers.Flatten(),
12 layers.Dense(256),
13 layers.BatchNormalization(),
14 layers.Activation('relu'),
15 layers.Dropout(0.5),
16
17 layers.Dense(10, activation='softmax')
18])
19
20# Layer Normalization (for Transformers)
21transformer_block = tf.keras.Sequential([
22 layers.MultiHeadAttention(num_heads=8, key_dim=64),
23 layers.LayerNormalization(), # Instead of BatchNorm
24])

Early Stopping

python.py
1from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
2
3early_stop = EarlyStopping(
4 monitor='val_loss',
5 patience=10, # Wait 10 epochs
6 restore_best_weights=True,
7 verbose=1
8)
9
10checkpoint = ModelCheckpoint(
11 'best_model.keras',
12 monitor='val_loss',
13 save_best_only=True
14)
15
16model.fit(
17 X_train, y_train,
18 validation_data=(X_val, y_val),
19 epochs=100,
20 callbacks=[early_stop, checkpoint]
21)

Checkpoint

Bạn đã hiểu các regularization techniques?

5

🔀 Data Augmentation

TB5 min

Image Augmentation

python.py
1import tensorflow as tf
2from tensorflow.keras import layers
3
4# Keras Preprocessing Layers (GPU accelerated)
5data_augmentation = tf.keras.Sequential([
6 layers.RandomFlip("horizontal"),
7 layers.RandomRotation(0.1),
8 layers.RandomZoom(0.1),
9 layers.RandomContrast(0.1),
10 layers.RandomTranslation(0.1, 0.1),
11])
12
13# Apply in model
14model = tf.keras.Sequential([
15 # Augmentation (only during training)
16 data_augmentation,
17
18 # Model
19 layers.Conv2D(32, 3, activation='relu'),
20 layers.MaxPooling2D(),
21 layers.Conv2D(64, 3, activation='relu'),
22 layers.GlobalAveragePooling2D(),
23 layers.Dense(10, activation='softmax')
24])
25
26# Or apply in dataset pipeline
27def augment(image, label):
28 image = data_augmentation(image, training=True)
29 return image, label
30
31augmented_ds = train_ds.map(augment, num_parallel_calls=tf.data.AUTOTUNE)

CutMix và MixUp

python.py
1import tensorflow as tf
2import numpy as np
3
4def mixup(images, labels, alpha=0.2):
5 """MixUp augmentation"""
6 batch_size = tf.shape(images)[0]
7
8 # Sample lambda from Beta distribution
9 lam = np.random.beta(alpha, alpha)
10
11 # Random shuffle
12 indices = tf.random.shuffle(tf.range(batch_size))
13
14 # Mix images and labels
15 mixed_images = lam * images + (1 - lam) * tf.gather(images, indices)
16 mixed_labels = lam * labels + (1 - lam) * tf.gather(labels, indices)
17
18 return mixed_images, mixed_labels
19
20
21def cutmix(images, labels, alpha=1.0):
22 """CutMix augmentation"""
23 batch_size = tf.shape(images)[0]
24 img_h, img_w = tf.shape(images)[1], tf.shape(images)[2]
25
26 # Sample lambda
27 lam = np.random.beta(alpha, alpha)
28
29 # Get random box
30 cut_ratio = tf.sqrt(1.0 - lam)
31 cut_h = tf.cast(tf.cast(img_h, tf.float32) * cut_ratio, tf.int32)
32 cut_w = tf.cast(tf.cast(img_w, tf.float32) * cut_ratio, tf.int32)
33
34 cx = tf.random.uniform([], 0, img_w, dtype=tf.int32)
35 cy = tf.random.uniform([], 0, img_h, dtype=tf.int32)
36
37 x1 = tf.clip_by_value(cx - cut_w // 2, 0, img_w)
38 y1 = tf.clip_by_value(cy - cut_h // 2, 0, img_h)
39 x2 = tf.clip_by_value(cx + cut_w // 2, 0, img_w)
40 y2 = tf.clip_by_value(cy + cut_h // 2, 0, img_h)
41
42 # Random shuffle
43 indices = tf.random.shuffle(tf.range(batch_size))
44
45 # Mix (simplified - actual implementation needs masking)
46 mixed_labels = lam * labels + (1 - lam) * tf.gather(labels, indices)
47
48 return images, mixed_labels # Simplified
49
50
51# Usage in training loop
52for images, labels in train_ds:
53 # Apply MixUp 50% of the time
54 if np.random.random() > 0.5:
55 images, labels = mixup(images, labels)
56
57 # Training step...

Text Augmentation

python.py
1import random
2
3def text_augmentation(text, aug_prob=0.1):
4 """Simple text augmentation techniques"""
5 words = text.split()
6
7 # Random deletion
8 words = [w for w in words if random.random() > aug_prob]
9
10 # Random swap
11 if len(words) > 2 and random.random() < aug_prob:
12 i, j = random.sample(range(len(words)), 2)
13 words[i], words[j] = words[j], words[i]
14
15 return ' '.join(words)
16
17# Synonym replacement using WordNet (need nltk)
18# from nltk.corpus import wordnet
19# def synonym_replacement(text, n=1):
20# words = text.split()
21# for _ in range(n):
22# word = random.choice(words)
23# synonyms = wordnet.synsets(word)
24# if synonyms:
25# synonym = synonyms[0].lemmas()[0].name()
26# words = [synonym if w == word else w for w in words]
27# return ' '.join(words)

Checkpoint

Bạn đã biết các data augmentation techniques?

6

🚀 Training Best Practices

TB5 min

Complete Training Setup

python.py
1import tensorflow as tf
2from tensorflow.keras import layers, Model
3from tensorflow.keras.callbacks import (
4 EarlyStopping,
5 ModelCheckpoint,
6 ReduceLROnPlateau,
7 TensorBoard
8)
9
10def create_optimized_model(input_shape, num_classes):
11 """Model with all optimization techniques"""
12
13 # Data augmentation
14 augmentation = tf.keras.Sequential([
15 layers.RandomFlip("horizontal"),
16 layers.RandomRotation(0.1),
17 layers.RandomZoom(0.1),
18 ])
19
20 # Build model
21 inputs = layers.Input(shape=input_shape)
22 x = augmentation(inputs)
23
24 # Conv blocks with BatchNorm
25 for filters in [64, 128, 256]:
26 x = layers.Conv2D(filters, 3, padding='same')(x)
27 x = layers.BatchNormalization()(x)
28 x = layers.Activation('relu')(x)
29 x = layers.MaxPooling2D()(x)
30 x = layers.SpatialDropout2D(0.2)(x)
31
32 # Head
33 x = layers.GlobalAveragePooling2D()(x)
34 x = layers.Dense(256, kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
35 x = layers.BatchNormalization()(x)
36 x = layers.Activation('relu')(x)
37 x = layers.Dropout(0.5)(x)
38 outputs = layers.Dense(num_classes, activation='softmax')(x)
39
40 return Model(inputs, outputs)
41
42
43def train_with_best_practices(model, train_ds, val_ds):
44 """Training with all best practices"""
45
46 # Optimizer with schedule
47 initial_lr = 0.001
48 lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
49 initial_lr, decay_steps=1000
50 )
51
52 optimizer = tf.keras.optimizers.AdamW(
53 learning_rate=lr_schedule,
54 weight_decay=0.01
55 )
56
57 # Compile
58 model.compile(
59 optimizer=optimizer,
60 loss='categorical_crossentropy',
61 metrics=['accuracy']
62 )
63
64 # Callbacks
65 callbacks = [
66 EarlyStopping(
67 monitor='val_loss',
68 patience=10,
69 restore_best_weights=True
70 ),
71 ModelCheckpoint(
72 'best_model.keras',
73 monitor='val_accuracy',
74 save_best_only=True
75 ),
76 ReduceLROnPlateau(
77 monitor='val_loss',
78 factor=0.5,
79 patience=5,
80 min_lr=1e-7
81 ),
82 TensorBoard(log_dir='./logs')
83 ]
84
85 # Train
86 history = model.fit(
87 train_ds,
88 validation_data=val_ds,
89 epochs=100,
90 callbacks=callbacks
91 )
92
93 return history
94
95
96# Usage
97model = create_optimized_model((224, 224, 3), 10)
98# history = train_with_best_practices(model, train_ds, val_ds)

Training Checklist

Before Training:

  • Data augmentation configured
  • Proper train/val/test split
  • Learning rate schedule set
  • Regularization (dropout, weight decay)
  • Callbacks (early stopping, checkpointing)
  • Logging (TensorBoard)

During Training:

  • Monitor loss curves
  • Check for overfitting
  • Validate LR is appropriate

After Training:

  • Evaluate on test set
  • Save best model
  • Document hyperparameters

Checkpoint

Bạn đã nắm training best practices?

7

🎯 Tổng kết Optimization

TB5 min

Summary Table

TechniquePurposeWhen to use
AdamAdaptive LR per parameterDefault choice
AdamWAdam + weight decayTransformers
Cosine ScheduleSmooth LR decayMost training
WarmupAvoid early instabilityLarge models
DropoutRegularizationDense layers
BatchNormStabilize trainingCNNs
Early StoppingPrevent overfittingAlways
Data AugmentationMore dataLimited data

Key Hyperparameters

HyperparameterStart withTune
Learning Rate1e-3 (CNN), 1e-5 (fine-tune)Grid search
Batch Size32Based on memory
Dropout0.50.1 - 0.7
Weight Decay0.010.001 - 0.1
Warmup10% of steps5-15%

Next Lesson

Deployment & Production:

  • Model compression
  • Inference optimization
  • Deployment options

🎉 Hoàn thành Optimization! Bạn đã biết cách optimize training cho Deep Learning models.