Optimization Techniques | MinAI Learning

🎯 Mục tiêu bài học

TB5 min

Sau bài này, bạn sẽ:

✅ Hiểu các Optimizer: SGD, Adam, AdamW

✅ Biết Learning Rate Scheduling

✅ Hiểu Gradient Clipping và các kỹ thuật khác

✅ Chọn optimizer phù hợp cho từng bài toán

Tổng hợp kiến thức

Đây là bài tổng hợp các kỹ thuật tối ưu giúp model train nhanh hơn và tốt hơn!

Task 0

⚡ Tầm quan trọng của Optimization

TB5 min

Training Deep Learning = Optimization Problem

Goal: Tìm parameters θ sao cho Loss(θ) nhỏ nhất.

Deep Learning optimization khó vì:

Non-convex: Nhiều local minima
High-dimensional: Millions of parameters
Noisy gradients: Mini-batch approximation

Các yếu tố ảnh hưởng

Factor	Impact
Optimizer	Tốc độ và direction update
Learning Rate	Bước nhảy mỗi lần update
Batch Size	Noise level của gradient
Regularization	Prevent overfitting
Initialization	Starting point

Checkpoint

Bạn đã hiểu tại sao optimization quan trọng?

Task 1

🔧 Optimizers

TB5 min

SGD (Stochastic Gradient Descent)

python.py

1import tensorflow as tf
2
3# Basic SGD
4sgd = tf.keras.optimizers.SGD(learning_rate=0.01)
5
6# SGD with Momentum
7sgd_momentum = tf.keras.optimizers.SGD(
8    learning_rate=0.01,
9    momentum=0.9  # Accumulate past gradients
10)
11
12# SGD with Nesterov Momentum (look-ahead)
13sgd_nesterov = tf.keras.optimizers.SGD(
14    learning_rate=0.01,
15    momentum=0.9,
16    nesterov=True
17)

Momentum formula: $v_t = \beta v_{t-1} + (1-\beta) \nabla L$ $\theta = \theta - \alpha v_t$

Adam (Adaptive Moment Estimation)

python.py

1# Adam = Momentum + RMSprop
2adam = tf.keras.optimizers.Adam(
3    learning_rate=0.001,  # Default
4    beta_1=0.9,           # Momentum term
5    beta_2=0.999,         # RMSprop term
6    epsilon=1e-7          # Numerical stability
7)
8
9# AdamW (Adam with decoupled weight decay)
10adamw = tf.keras.optimizers.AdamW(
11    learning_rate=0.001,
12    weight_decay=0.01  # L2 regularization
13)

Adam formulas: $m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(1st moment)}$ $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(2nd moment)}$ $\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}$ $\theta = \theta - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$

Optimizer Comparison

python.py

1import tensorflow as tf
2from tensorflow.keras import layers, Sequential
3import numpy as np
4import matplotlib.pyplot as plt
5
6def create_model():
7    return Sequential([
8        layers.Dense(64, activation='relu', input_shape=(10,)),
9        layers.Dense(32, activation='relu'),
10        layers.Dense(1)
11    ])
12
13# Create synthetic data
14np.random.seed(42)
15X = np.random.randn(1000, 10)
16y = np.sum(X[:, :3], axis=1, keepdims=True) + np.random.randn(1000, 1) * 0.1
17
18# Test different optimizers
19optimizers = {
20    'SGD': tf.keras.optimizers.SGD(0.01),
21    'SGD+Momentum': tf.keras.optimizers.SGD(0.01, momentum=0.9),
22    'Adam': tf.keras.optimizers.Adam(0.001),
23    'AdamW': tf.keras.optimizers.AdamW(0.001, weight_decay=0.01),
24    'RMSprop': tf.keras.optimizers.RMSprop(0.001),
25}
26
27histories = {}
28for name, opt in optimizers.items():
29    model = create_model()
30    model.compile(optimizer=opt, loss='mse')
31    history = model.fit(X, y, epochs=50, validation_split=0.2, verbose=0)
32    histories[name] = history.history['val_loss']
33
34# Plot comparison
35plt.figure(figsize=(10, 5))
36for name, losses in histories.items():
37    plt.plot(losses, label=name)
38plt.xlabel('Epoch')
39plt.ylabel('Validation Loss')
40plt.title('Optimizer Comparison')
41plt.legend()
42plt.yscale('log')
43plt.show()

When to use which?

Optimizer	Best for
Adam	Default choice, works well most cases
AdamW	Transformers, fine-tuning
SGD+Momentum	Computer Vision, final fine-tuning
RMSprop	RNNs, non-stationary problems

Checkpoint

Bạn đã hiểu các optimizers?

Task 2

📈 Learning Rate Scheduling

TB5 min

Tại sao cần LR Scheduling?

Learning Rate là hyperparameter quan trọng nhất!

Quá cao: Diverge, không converge
Quá thấp: Converge chậm, stuck in local minima
Scheduling: Giảm dần LR để converge tốt hơn

Common Schedules

python.py

1import tensorflow as tf
2import numpy as np
3import matplotlib.pyplot as plt
4
5total_steps = 1000
6initial_lr = 0.1
7
8# 1. Step Decay
9step_decay = tf.keras.optimizers.schedules.PiecewiseConstantDecay(
10    boundaries=[300, 600, 800],
11    values=[0.1, 0.01, 0.001, 0.0001]
12)
13
14# 2. Exponential Decay
15exp_decay = tf.keras.optimizers.schedules.ExponentialDecay(
16    initial_learning_rate=0.1,
17    decay_steps=100,
18    decay_rate=0.96
19)
20
21# 3. Cosine Annealing
22cosine_decay = tf.keras.optimizers.schedules.CosineDecay(
23    initial_learning_rate=0.1,
24    decay_steps=total_steps,
25    alpha=0.0  # Final LR = initial_lr * alpha
26)
27
28# 4. Cosine with Warmup
29warmup_steps = 100
30class WarmupCosineDecay(tf.keras.optimizers.schedules.LearningRateSchedule):
31    def __init__(self, initial_lr, warmup_steps, total_steps):
32        super().__init__()
33        self.initial_lr = initial_lr
34        self.warmup_steps = warmup_steps
35        self.total_steps = total_steps
36    
37    def __call__(self, step):
38        step = tf.cast(step, tf.float32)
39        warmup_lr = self.initial_lr * (step / self.warmup_steps)
40        
41        decay_steps = self.total_steps - self.warmup_steps
42        decay_step = step - self.warmup_steps
43        cosine_lr = self.initial_lr * 0.5 * (
44            1 + tf.cos(np.pi * decay_step / decay_steps)
45        )
46        
47        return tf.where(step < self.warmup_steps, warmup_lr, cosine_lr)
48
49warmup_cosine = WarmupCosineDecay(0.1, 100, total_steps)
50
51# Visualize
52steps = range(total_steps)
53schedules = {
54    'Step Decay': [step_decay(s).numpy() for s in steps],
55    'Exponential': [exp_decay(s).numpy() for s in steps],
56    'Cosine': [cosine_decay(s).numpy() for s in steps],
57    'Warmup+Cosine': [warmup_cosine(s).numpy() for s in steps],
58}
59
60plt.figure(figsize=(12, 4))
61for name, lrs in schedules.items():
62    plt.plot(lrs, label=name)
63plt.xlabel('Step')
64plt.ylabel('Learning Rate')
65plt.title('Learning Rate Schedules')
66plt.legend()
67plt.show()

Callbacks for LR Scheduling

python.py

1from tensorflow.keras.callbacks import LearningRateScheduler, ReduceLROnPlateau
2
3# 1. Custom schedule function
4def scheduler(epoch, lr):
5    if epoch < 10:
6        return lr
7    else:
8        return lr * tf.math.exp(-0.1)
9
10lr_scheduler = LearningRateScheduler(scheduler)
11
12# 2. Reduce LR on Plateau (automatic)
13reduce_lr = ReduceLROnPlateau(
14    monitor='val_loss',
15    factor=0.2,        # New LR = LR * factor
16    patience=5,        # Wait 5 epochs
17    min_lr=1e-7,
18    verbose=1
19)
20
21# Usage
22model.fit(
23    X_train, y_train,
24    epochs=100,
25    callbacks=[lr_scheduler, reduce_lr]
26)

Checkpoint

Bạn đã biết các LR scheduling strategies?

Task 3

🛡️ Regularization Techniques

TB5 min

L1 và L2 Regularization

python.py

1from tensorflow.keras import layers, regularizers
2
3# L2 Regularization (Weight Decay)
4model = tf.keras.Sequential([
5    layers.Dense(
6        256, 
7        activation='relu',
8        kernel_regularizer=regularizers.l2(0.01)  # λ = 0.01
9    ),
10    layers.Dense(
11        128, 
12        activation='relu',
13        kernel_regularizer=regularizers.l2(0.01)
14    ),
15    layers.Dense(10, activation='softmax')
16])
17
18# L1 Regularization (Sparsity)
19l1_layer = layers.Dense(
20    128,
21    activation='relu',
22    kernel_regularizer=regularizers.l1(0.001)
23)
24
25# L1 + L2 (Elastic Net)
26elastic_layer = layers.Dense(
27    128,
28    activation='relu',
29    kernel_regularizer=regularizers.l1_l2(l1=0.001, l2=0.01)
30)

Loss functions: $L_{L2} = L_{original} + \lambda \sum_i w_i^2$ $L_{L1} = L_{original} + \lambda \sum_i |w_i|$

Dropout

python.py

1from tensorflow.keras import layers
2
3# Standard Dropout
4model = tf.keras.Sequential([
5    layers.Dense(256, activation='relu'),
6    layers.Dropout(0.5),  # Drop 50% neurons during training
7    layers.Dense(128, activation='relu'),
8    layers.Dropout(0.3),
9    layers.Dense(10, activation='softmax')
10])
11
12# Spatial Dropout (for CNN)
13cnn_model = tf.keras.Sequential([
14    layers.Conv2D(64, 3, activation='relu'),
15    layers.SpatialDropout2D(0.2),  # Drop entire feature maps
16    layers.Conv2D(128, 3, activation='relu'),
17    layers.SpatialDropout2D(0.2),
18])
19
20# Dropout in RNN
21rnn_model = tf.keras.Sequential([
22    layers.LSTM(128, dropout=0.2, recurrent_dropout=0.2),
23    layers.Dense(10)
24])

Batch Normalization

python.py

1from tensorflow.keras import layers
2
3model = tf.keras.Sequential([
4    # Conv block with BatchNorm
5    layers.Conv2D(64, 3, padding='same'),
6    layers.BatchNormalization(),
7    layers.Activation('relu'),
8    layers.MaxPooling2D(2),
9    
10    # Dense block with BatchNorm
11    layers.Flatten(),
12    layers.Dense(256),
13    layers.BatchNormalization(),
14    layers.Activation('relu'),
15    layers.Dropout(0.5),
16    
17    layers.Dense(10, activation='softmax')
18])
19
20# Layer Normalization (for Transformers)
21transformer_block = tf.keras.Sequential([
22    layers.MultiHeadAttention(num_heads=8, key_dim=64),
23    layers.LayerNormalization(),  # Instead of BatchNorm
24])

Early Stopping

python.py

1from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
2
3early_stop = EarlyStopping(
4    monitor='val_loss',
5    patience=10,             # Wait 10 epochs
6    restore_best_weights=True,
7    verbose=1
8)
9
10checkpoint = ModelCheckpoint(
11    'best_model.keras',
12    monitor='val_loss',
13    save_best_only=True
14)
15
16model.fit(
17    X_train, y_train,
18    validation_data=(X_val, y_val),
19    epochs=100,
20    callbacks=[early_stop, checkpoint]
21)

Checkpoint

Bạn đã hiểu các regularization techniques?

Task 4

🔀 Data Augmentation

TB5 min

Image Augmentation

python.py

1import tensorflow as tf
2from tensorflow.keras import layers
3
4# Keras Preprocessing Layers (GPU accelerated)
5data_augmentation = tf.keras.Sequential([
6    layers.RandomFlip("horizontal"),
7    layers.RandomRotation(0.1),
8    layers.RandomZoom(0.1),
9    layers.RandomContrast(0.1),
10    layers.RandomTranslation(0.1, 0.1),
11])
12
13# Apply in model
14model = tf.keras.Sequential([
15    # Augmentation (only during training)
16    data_augmentation,
17    
18    # Model
19    layers.Conv2D(32, 3, activation='relu'),
20    layers.MaxPooling2D(),
21    layers.Conv2D(64, 3, activation='relu'),
22    layers.GlobalAveragePooling2D(),
23    layers.Dense(10, activation='softmax')
24])
25
26# Or apply in dataset pipeline
27def augment(image, label):
28    image = data_augmentation(image, training=True)
29    return image, label
30
31augmented_ds = train_ds.map(augment, num_parallel_calls=tf.data.AUTOTUNE)

CutMix và MixUp

python.py

1import tensorflow as tf
2import numpy as np
3
4def mixup(images, labels, alpha=0.2):
5    """MixUp augmentation"""
6    batch_size = tf.shape(images)[0]
7    
8    # Sample lambda from Beta distribution
9    lam = np.random.beta(alpha, alpha)
10    
11    # Random shuffle
12    indices = tf.random.shuffle(tf.range(batch_size))
13    
14    # Mix images and labels
15    mixed_images = lam * images + (1 - lam) * tf.gather(images, indices)
16    mixed_labels = lam * labels + (1 - lam) * tf.gather(labels, indices)
17    
18    return mixed_images, mixed_labels
19
20
21def cutmix(images, labels, alpha=1.0):
22    """CutMix augmentation"""
23    batch_size = tf.shape(images)[0]
24    img_h, img_w = tf.shape(images)[1], tf.shape(images)[2]
25    
26    # Sample lambda
27    lam = np.random.beta(alpha, alpha)
28    
29    # Get random box
30    cut_ratio = tf.sqrt(1.0 - lam)
31    cut_h = tf.cast(tf.cast(img_h, tf.float32) * cut_ratio, tf.int32)
32    cut_w = tf.cast(tf.cast(img_w, tf.float32) * cut_ratio, tf.int32)
33    
34    cx = tf.random.uniform([], 0, img_w, dtype=tf.int32)
35    cy = tf.random.uniform([], 0, img_h, dtype=tf.int32)
36    
37    x1 = tf.clip_by_value(cx - cut_w // 2, 0, img_w)
38    y1 = tf.clip_by_value(cy - cut_h // 2, 0, img_h)
39    x2 = tf.clip_by_value(cx + cut_w // 2, 0, img_w)
40    y2 = tf.clip_by_value(cy + cut_h // 2, 0, img_h)
41    
42    # Random shuffle
43    indices = tf.random.shuffle(tf.range(batch_size))
44    
45    # Mix (simplified - actual implementation needs masking)
46    mixed_labels = lam * labels + (1 - lam) * tf.gather(labels, indices)
47    
48    return images, mixed_labels  # Simplified
49
50
51# Usage in training loop
52for images, labels in train_ds:
53    # Apply MixUp 50% of the time
54    if np.random.random() > 0.5:
55        images, labels = mixup(images, labels)
56    
57    # Training step...

Text Augmentation

python.py

1import random
2
3def text_augmentation(text, aug_prob=0.1):
4    """Simple text augmentation techniques"""
5    words = text.split()
6    
7    # Random deletion
8    words = [w for w in words if random.random() > aug_prob]
9    
10    # Random swap
11    if len(words) > 2 and random.random() < aug_prob:
12        i, j = random.sample(range(len(words)), 2)
13        words[i], words[j] = words[j], words[i]
14    
15    return ' '.join(words)
16
17# Synonym replacement using WordNet (need nltk)
18# from nltk.corpus import wordnet
19# def synonym_replacement(text, n=1):
20#     words = text.split()
21#     for _ in range(n):
22#         word = random.choice(words)
23#         synonyms = wordnet.synsets(word)
24#         if synonyms:
25#             synonym = synonyms[0].lemmas()[0].name()
26#             words = [synonym if w == word else w for w in words]
27#     return ' '.join(words)

Checkpoint

Bạn đã biết các data augmentation techniques?

Task 5

🚀 Training Best Practices

TB5 min

Complete Training Setup

python.py

1import tensorflow as tf
2from tensorflow.keras import layers, Model
3from tensorflow.keras.callbacks import (
4    EarlyStopping, 
5    ModelCheckpoint,
6    ReduceLROnPlateau,
7    TensorBoard
8)
9
10def create_optimized_model(input_shape, num_classes):
11    """Model with all optimization techniques"""
12    
13    # Data augmentation
14    augmentation = tf.keras.Sequential([
15        layers.RandomFlip("horizontal"),
16        layers.RandomRotation(0.1),
17        layers.RandomZoom(0.1),
18    ])
19    
20    # Build model
21    inputs = layers.Input(shape=input_shape)
22    x = augmentation(inputs)
23    
24    # Conv blocks with BatchNorm
25    for filters in [64, 128, 256]:
26        x = layers.Conv2D(filters, 3, padding='same')(x)
27        x = layers.BatchNormalization()(x)
28        x = layers.Activation('relu')(x)
29        x = layers.MaxPooling2D()(x)
30        x = layers.SpatialDropout2D(0.2)(x)
31    
32    # Head
33    x = layers.GlobalAveragePooling2D()(x)
34    x = layers.Dense(256, kernel_regularizer=tf.keras.regularizers.l2(0.01))(x)
35    x = layers.BatchNormalization()(x)
36    x = layers.Activation('relu')(x)
37    x = layers.Dropout(0.5)(x)
38    outputs = layers.Dense(num_classes, activation='softmax')(x)
39    
40    return Model(inputs, outputs)
41
42
43def train_with_best_practices(model, train_ds, val_ds):
44    """Training with all best practices"""
45    
46    # Optimizer with schedule
47    initial_lr = 0.001
48    lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
49        initial_lr, decay_steps=1000
50    )
51    
52    optimizer = tf.keras.optimizers.AdamW(
53        learning_rate=lr_schedule,
54        weight_decay=0.01
55    )
56    
57    # Compile
58    model.compile(
59        optimizer=optimizer,
60        loss='categorical_crossentropy',
61        metrics=['accuracy']
62    )
63    
64    # Callbacks
65    callbacks = [
66        EarlyStopping(
67            monitor='val_loss',
68            patience=10,
69            restore_best_weights=True
70        ),
71        ModelCheckpoint(
72            'best_model.keras',
73            monitor='val_accuracy',
74            save_best_only=True
75        ),
76        ReduceLROnPlateau(
77            monitor='val_loss',
78            factor=0.5,
79            patience=5,
80            min_lr=1e-7
81        ),
82        TensorBoard(log_dir='./logs')
83    ]
84    
85    # Train
86    history = model.fit(
87        train_ds,
88        validation_data=val_ds,
89        epochs=100,
90        callbacks=callbacks
91    )
92    
93    return history
94
95
96# Usage
97model = create_optimized_model((224, 224, 3), 10)
98# history = train_with_best_practices(model, train_ds, val_ds)

Training Checklist

Checkpoint

Bạn đã nắm training best practices?

Task 6

🎯 Tổng kết Optimization

TB5 min

Summary Table

Technique	Purpose	When to use
Adam	Adaptive LR per parameter	Default choice
AdamW	Adam + weight decay	Transformers
Cosine Schedule	Smooth LR decay	Most training
Warmup	Avoid early instability	Large models
Dropout	Regularization	Dense layers
BatchNorm	Stabilize training	CNNs
Early Stopping	Prevent overfitting	Always
Data Augmentation	More data	Limited data

Key Hyperparameters

Hyperparameter	Start with	Tune
Learning Rate	1e-3 (CNN), 1e-5 (fine-tune)	Grid search
Batch Size	32	Based on memory
Dropout	0.5	0.1 - 0.7
Weight Decay	0.01	0.001 - 0.1
Warmup	10% of steps	5-15%

Next Lesson

Deployment & Production:

Model compression
Inference optimization
Deployment options

🎉 Hoàn thành Optimization! Bạn đã biết cách optimize training cho Deep Learning models.

Task 7