Lý thuyết
4-5 gio
Bài 12/15

Cross-Validation và Hyperparameter Tuning

Overfitting, Cross-Validation, GridSearchCV va Best Practices

Cross-Validation và Hyperparameter Tuning

Mục tiêu bài học

Sau bài học này, học viên sẽ:

  • Hiểu vấn đề Overfitting va cách phát hiện
  • Nắm vững các phương pháp Cross-Validation
  • Biết cách sử dụng GridSearchCV va RandomizedSearchCV
  • Thuc hanh tuning hyperparameters

1. Overfitting vs Underfitting

1.1 Định nghĩa

Trạng tháiTrain ErrorTest ErrorVấn đề
UnderfittingCaoCaoModel quá đơn giản
Good fitThấpThấpLý tưởng
OverfittingRất thấpCaoModel quá phức tạp

Overfitting Underfitting

Hinh: Underfitting (trai) - Good Fit (giữa) - Overfitting (phai)

1.2 Bias-Variance Tradeoff

Total Error=Bias2+Variance+Irreducible Error\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}

Thành phầnÝ nghĩa
BiasSai số do giả định đơn giản
VarianceSai số do nhạy với training data
Irreducible ErrorSai số không thể giảm (noise)

1.3 Cách phát hiện Overfitting

Python
1# Train và đánh giá
2model.fit(X_train, y_train)
3
4train_score = model.score(X_train, y_train)
5test_score = model.score(X_test, y_test)
6
7print(f"Train Score: {train_score:.4f}")
8print(f"Test Score: {test_score:.4f}")
9print(f"Gap: {train_score - test_score:.4f}")
10
11# Nếu gap > 0.1 Có thể overfitting

2. Cross-Validation

2.1 Tại sao can Cross-Validation?

Vấn đề voi train/test split:

  • Test score phụ thuộc vào cách chia
  • Variance cao giữa cac lan split khac nhau

Giải pháp: K-Fold Cross-Validation

2.2 K-Fold Cross-Validation

Thuật toán:

BướcMô tả
1Chia data thành K folds bằng nhau
2Với mỗi fold i: Train trên K-1 folds, test trên fold i
3Tính trung binh scores

Vi du 5-Fold:

FoldData 1Data 2Data 3Data 4Data 5
1TestTrainTrainTrainTrain
2TrainTestTrainTrainTrain
3TrainTrainTestTrainTrain
4TrainTrainTrainTestTrain
5TrainTrainTrainTrainTest

Cross-Validation

Hinh: K-Fold Cross-Validation

2.3 Các loại Cross-Validation

MethodUse case
K-FoldGeneral purpose
Stratified K-FoldClassification, imbalanced
Leave-One-OutSmall dataset
Time Series SplitTime series data

3. Ví dụ tính toán thủ công

3.1 5-Fold CV với Accuracy

Kết quả 5 folds:

  • Fold 1: 0.85
  • Fold 2: 0.82
  • Fold 3: 0.88
  • Fold 4: 0.84
  • Fold 5: 0.86

Mean: xˉ=0.85+0.82+0.88+0.84+0.865=0.85\bar{x} = \frac{0.85 + 0.82 + 0.88 + 0.84 + 0.86}{5} = 0.85

Standard Deviation: σ=(xixˉ)2n=0.021\sigma = \sqrt{\frac{\sum(x_i - \bar{x})^2}{n}} = 0.021

Kết quả: 0.85±0.020.85 \pm 0.02


4. Thuc hanh Cross-Validation

4.1 Basic Cross-Validation

Python
1from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
2from sklearn.ensemble import RandomForestClassifier
3import numpy as np
4
5model = RandomForestClassifier(n_estìmators=100, random_state=42)
6
7# Simple K-Fold CV
8scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
9print(f"CV Scores: {scores}")
10print(f"Mean: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")
11
12# Stratified K-Fold (cho classification)
13skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
14scores = cross_val_score(model, X, y, cv=skf, scoring='f1')
15print(f"Stratified CV F1: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

4.2 Multiple Metrics

Python
1from sklearn.model_selection import cross_validate
2
3scoring = ['accuracy', 'precision', 'recall', 'f1']
4cv_results = cross_validate(model, X, y, cv=5, scoring=scoring)
5
6for metric in scoring:
7 scores = cv_results[f'test_{metric}']
8 print(f"{metric}: {scores.mean():.4f} (+/- {scores.std()*2:.4f})")

5. Hyperparameter Tuning

5.1 GridSearchCV

Tìm kiếm exhaustive tất cả combinations của hyperparameters.

Python
1from sklearn.model_selection import GridSearchCV
2from sklearn.ensemble import RandomForestClassifier
3
4# Định nghĩa parameter grid
5param_grid = {
6 'n_estìmators': [50, 100, 200],
7 'max_depth': [3, 5, 10, None],
8 'min_samples_split': [2, 5, 10],
9 'min_samples_leaf': [1, 2, 4]
10}
11
12# GridSearchCV
13grid_search = GridSearchCV(
14 estìmator=RandomForestClassifier(random_state=42),
15 param_grid=param_grid,
16 cv=5,
17 scoring='f1',
18 n_jobs=-1, # Parallel processing
19 verbose=1
20)
21
22grid_search.fit(X_train, y_train)
23
24# Kết quả
25print(f"Best Parameters: {grid_search.best_params_}")
26print(f"Best CV Score: {grid_search.best_score_:.4f}")
27
28# Best model
29best_model = grid_search.best_estìmator_

5.2 RandomizedSearchCV

Nhanh hon khi parameter space lon.

Python
1from sklearn.model_selection import RandomizedSearchCV
2from scipy.stats import randint, uniform
3
4# Định nghĩa distributions
5param_dist = {
6 'n_estìmators': randint(50, 300),
7 'max_depth': randint(3, 20),
8 'min_samples_split': randint(2, 20),
9 'min_samples_leaf': randint(1, 10),
10 'max_features': uniform(0.1, 0.9)
11}
12
13# RandomizedSearchCV
14random_search = RandomizedSearchCV(
15 estìmator=RandomForestClassifier(random_state=42),
16 param_distributions=param_dist,
17 n_iter=50, # So lan thu
18 cv=5,
19 scoring='f1',
20 n_jobs=-1,
21 random_state=42
22)
23
24random_search.fit(X_train, y_train)
25print(f"Best Parameters: {random_search.best_params_}")
26print(f"Best CV Score: {random_search.best_score_:.4f}")

5.3 So sanh GridSearch vs RandomizedSearch

AspectGridSearchCVRandomizedSearchCV
Tim kiemExhaustiveRandom sampling
Toc doChamNhanh
Parameter space lonKhong kha thiKha thi
Dam bao tìm tot nhatCoKhong chac

6. Learning Curves

6.1 Phat hien Overfitting/Underfitting

Python
1from sklearn.model_selection import learning_curve
2import matplotlib.pyplot as plt
3
4train_sizes, train_scores, val_scores = learning_curve(
5 model, X, y,
6 cv=5,
7 train_sizes=np.linspace(0.1, 1.0, 10),
8 scoring='accuracy'
9)
10
11# Tính mean va std
12train_mean = train_scores.mean(axis=1)
13train_std = train_scores.std(axis=1)
14val_mean = val_scores.mean(axis=1)
15val_std = val_scores.std(axis=1)
16
17# Ve Learning Curve
18plt.figure(figsize=(10, 6))
19plt.plot(train_sizes, train_mean, label='Training score')
20plt.fill_between(train_sizes, train_mean - train_std,
21 train_mean + train_std, alpha=0.1)
22plt.plot(train_sizes, val_mean, label='Validation score')
23plt.fill_between(train_sizes, val_mean - val_std,
24 val_mean + val_std, alpha=0.1)
25plt.xlabel('Training Size')
26plt.ylabel('Score')
27plt.title('Learning Curve')
28plt.legend()
29plt.grid(True)
30plt.show()

Learning Curve

Hinh: Learning Curve - phat hien overfitting

6.2 Diễn giải Learning Curve

PatternVấn đềGiải pháp
Train cao, Val thap, gap lonOverfittingRegularization, more data
Ca hai thapUnderfittingComplex model, more features
Ca hai cao, gap nhoGood fitOK!

7. Best Practices

7.1 Pipeline hoan chinh

Python
1from sklearn.pipeline import Pipeline
2from sklearn.preprocessing import StandardScaler
3from sklearn.model_selection import GridSearchCV
4
5# Tao pipeline
6pipeline = Pipeline([
7 ('scaler', StandardScaler()),
8 ('classifier', RandomForestClassifier(random_state=42))
9])
10
11# Parameter grid voi pipeline
12param_grid = {
13 'classifier__n_estìmators': [50, 100],
14 'classifier__max_depth': [3, 5, 10]
15}
16
17# GridSearchCV voi pipeline
18grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1')
19grid_search.fit(X_train, y_train)

7.2 Tips quan trong

TipMô tả
Luon dung CVThay vi single train/test split
StratifiedCho classification
PipelineTranh data leakage
RandomizedSearch truocTruoc khi GridSearch
Test setChi dung cuoi cung

Bài tập tự luyện

  1. Bai tap 1: Implement 5-Fold CV tu dau (khong dung sklearn)
  2. Bai tap 2: So sanh GridSearchCV va RandomizedSearchCV tren cung dataset
  3. Bai tap 3: Ve Learning Curve va phan tich overfitting

Tài liệu tham khảo