Lý thuyết
30 phút
Bài 3/4

AutoML - Automated Machine Learning

Tự động hóa ML workflow với AutoML frameworks

🤖 AutoML - Automated Machine Learning

AutoML tự động hóa các bước trong ML pipeline: feature engineering, model selection, hyperparameter tuning. Bài này cover các frameworks phổ biến.

AutoML là gì?

Diagram
graph LR
    D[Data] --> FE[Feature Engineering]
    FE --> MS[Model Selection]
    MS --> HPT[Hyperparameter Tuning]
    HPT --> E[Ensemble]
    E --> M[Final Model]
    
    style FE fill:#f9f,stroke:#333
    style MS fill:#f9f,stroke:#333
    style HPT fill:#f9f,stroke:#333
    style E fill:#f9f,stroke:#333
AutoML automates
  • Feature Engineering: Selection, transformation, creation
  • Model Selection: Try multiple algorithms
  • Hyperparameter Tuning: Find optimal params
  • Ensemble: Combine best models

Popular AutoML Frameworks

FrameworkProsUse Case
Auto-sklearnBuilt on sklearnGeneral ML
H2O AutoMLFast, production-readyEnterprise
TPOTGenetic algorithmExploration
AutoGluonState-of-the-artBest accuracy
PyCaretEasy to useRapid prototyping

Auto-sklearn

Installation

Bash
1pip install auto-sklearn

Basic Usage

Python
1import autosklearn.classification
2from sklearn.model_selection import train_test_split
3from sklearn.datasets import load_breast_cancer
4import sklearn.metrics
5
6# Load data
7X, y = load_breast_cancer(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(
9 X, y, test_size=0.2, random_state=42
10)
11
12# AutoML classifier
13automl = autosklearn.classification.AutoSklearnClassifier(
14 time_left_for_this_task=300, # 5 minutes
15 per_run_time_limit=60, # 1 minute per model
16 n_jobs=-1,
17 ensemble_size=10,
18 memory_limit=4096
19)
20
21# Fit
22automl.fit(X_train, y_train)
23
24# Evaluate
25y_pred = automl.predict(X_test)
26accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
27print(f"Accuracy: {accuracy:.4f}")
28
29# Show models
30print(automl.leaderboard())

H2O AutoML

Setup

Python
1import h2o
2from h2o.automl import H2OAutoML
3
4# Start H2O
5h2o.init()
6
7# Load data
8df = h2o.import_file("data/train.csv")
9
10# Define features and target
11y = "target"
12x = df.columns
13x.remove(y)
14
15# Split
16train, test = df.split_frame(ratios=[0.8], seed=42)
17
18# Run AutoML
19aml = H2OAutoML(
20 max_runtime_secs=300,
21 max_models=20,
22 seed=42,
23 sort_metric="AUC"
24)
25
26aml.train(x=x, y=y, training_frame=train)
27
28# Leaderboard
29lb = aml.leaderboard
30print(lb)
31
32# Best model
33best = aml.leader
34perf = best.model_performance(test)
35print(perf)
36
37# Predictions
38preds = best.predict(test)

Model Explainability

Python
1# SHAP values
2import h2o.explain as h2o_explain
3
4# Explain single prediction
5h2o_explain.explain_row(aml.leader, train, row_index=0)
6
7# Feature importance
8h2o_explain.varimp_heatmap(aml.leader)

AutoGluon

Installation

Bash
1pip install autogluon

Tabular Data

Python
1from autogluon.tabular import TabularDataset, TabularPredictor
2
3# Load data
4train_data = TabularDataset('train.csv')
5test_data = TabularDataset('test.csv')
6
7# Train AutoML
8predictor = TabularPredictor(
9 label='target',
10 eval_metric='accuracy',
11 problem_type='binary'
12).fit(
13 train_data,
14 time_limit=300,
15 presets='best_quality' # or 'medium_quality_faster_train'
16)
17
18# Leaderboard
19predictor.leaderboard(test_data)
20
21# Predict
22predictions = predictor.predict(test_data)
23
24# Feature importance
25predictor.feature_importance(test_data)

Presets Options

Python
1# Best quality (slow, accurate)
2predictor.fit(train_data, presets='best_quality')
3
4# Medium quality (balanced)
5predictor.fit(train_data, presets='medium_quality_faster_train')
6
7# Fast (quick prototyping)
8predictor.fit(train_data, presets='optimize_for_deployment')

PyCaret

Easy AutoML Pipeline

Python
1from pycaret.classification import *
2
3# Initialize
4clf = setup(
5 data=train_df,
6 target='target',
7 session_id=42,
8 normalize=True,
9 feature_selection=True
10)
11
12# Compare all models
13best = compare_models()
14
15# Tune best model
16tuned = tune_model(best)
17
18# Ensemble
19ensembled = ensemble_model(tuned, method='Bagging')
20
21# Blend multiple models
22blend = blend_models(top3=[best, tuned, ensembled])
23
24# Finalize
25final = finalize_model(blend)
26
27# Predict
28predictions = predict_model(final, data=test_df)
29
30# Save model
31save_model(final, 'my_model')

Custom Metrics

Python
1from pycaret.classification import *
2
3# Custom scorer
4from sklearn.metrics import f1_score, make_scorer
5
6custom_scorer = make_scorer(f1_score, average='weighted')
7
8# Setup with custom metric
9clf = setup(
10 data=train_df,
11 target='target',
12 custom_metric=custom_scorer
13)

TPOT (Genetic Algorithm)

Python
1from tpot import TPOTClassifier
2from sklearn.model_selection import train_test_split
3
4# Split data
5X_train, X_test, y_train, y_test = train_test_split(
6 X, y, test_size=0.2, random_state=42
7)
8
9# TPOT AutoML
10tpot = TPOTClassifier(
11 generations=10,
12 population_size=50,
13 offspring_size=50,
14 scoring='accuracy',
15 cv=5,
16 random_state=42,
17 n_jobs=-1,
18 verbosity=2
19)
20
21tpot.fit(X_train, y_train)
22
23# Score
24print(f"Score: {tpot.score(X_test, y_test):.4f}")
25
26# Export best pipeline
27tpot.export('best_pipeline.py')

Choosing AutoML Framework

Diagram
graph TD
    Start[Start] --> Q1{Need?}
    Q1 -->|Best Accuracy| AG[AutoGluon]
    Q1 -->|Fast Prototyping| PC[PyCaret]
    Q1 -->|Production| H2O[H2O AutoML]
    Q1 -->|Sklearn Compatible| AS[Auto-sklearn]
    Q1 -->|Interpretable Pipeline| TP[TPOT]

Best Practices

AutoML Tips
  1. Set time limits để kiểm soát resources
  2. Split data properly trước khi AutoML
  3. Handle imbalanced data trước
  4. Review generated pipelines để hiểu
  5. Don't blindly trust - validate results
  6. Consider interpretability vs accuracy trade-off

Complete Example

End-to-End AutoML Pipeline

Python
1import pandas as pd
2from autogluon.tabular import TabularDataset, TabularPredictor
3from sklearn.model_selection import train_test_split
4import matplotlib.pyplot as plt
5
6# Load data
7df = pd.read_csv('data.csv')
8
9# Basic EDA
10print(df.info())
11print(df.describe())
12
13# Check target distribution
14print(df['target'].value_counts())
15
16# Split
17train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['target'])
18
19# Convert to AutoGluon format
20train_data = TabularDataset(train_df)
21test_data = TabularDataset(test_df)
22
23# Train with AutoGluon
24predictor = TabularPredictor(
25 label='target',
26 eval_metric='roc_auc',
27 problem_type='binary'
28).fit(
29 train_data,
30 time_limit=600, # 10 minutes
31 presets='best_quality',
32 verbosity=2
33)
34
35# Results
36print("\n" + "="*50)
37print("RESULTS")
38print("="*50)
39
40# Leaderboard
41lb = predictor.leaderboard(test_data, silent=True)
42print("\nLeaderboard:")
43print(lb)
44
45# Best model performance
46perf = predictor.evaluate(test_data)
47print(f"\nTest Performance: {perf}")
48
49# Feature importance
50importance = predictor.feature_importance(test_data)
51print("\nFeature Importance:")
52print(importance)
53
54# Plot feature importance
55importance.head(20).plot(kind='barh', figsize=(10, 8))
56plt.title('Top 20 Feature Importance')
57plt.tight_layout()
58plt.savefig('feature_importance.png')
59
60# Save predictor
61predictor.save('my_automl_model')
62
63# Load and predict on new data
64predictor = TabularPredictor.load('my_automl_model')
65new_predictions = predictor.predict(new_data)

Bài tập thực hành

Hands-on Exercise

AutoML Challenge:

  1. Chọn dataset (Titanic, House Prices, etc.)
  2. So sánh 3 AutoML frameworks:
    • PyCaret (easy)
    • AutoGluon (best)
    • H2O AutoML (production)
  3. So sánh:
    • Accuracy
    • Training time
    • Ease of use
  4. Document findings

Target: Đạt competitive accuracy với minimal code


Tài liệu tham khảo