🤖 AutoML - Automated Machine Learning

AutoML tự động hóa các bước trong ML pipeline: feature engineering, model selection, hyperparameter tuning. Bài này cover các frameworks phổ biến.

AutoML là gì?

Diagram

Đang vẽ diagram...

AutoML automates

Feature Engineering: Selection, transformation, creation
Model Selection: Try multiple algorithms
Hyperparameter Tuning: Find optimal params
Ensemble: Combine best models

Popular AutoML Frameworks

Framework	Pros	Use Case
Auto-sklearn	Built on sklearn	General ML
H2O AutoML	Fast, production-ready	Enterprise
TPOT	Genetic algorithm	Exploration
AutoGluon	State-of-the-art	Best accuracy
PyCaret	Easy to use	Rapid prototyping

Auto-sklearn

Installation

Bash

1pip install auto-sklearn

Basic Usage

Python

1import autosklearn.classification
2from sklearn.model_selection import train_test_split
3from sklearn.datasets import load_breast_cancer
4import sklearn.metrics
5
6# Load data
7X, y = load_breast_cancer(return_X_y=True)
8X_train, X_test, y_train, y_test = train_test_split(
9    X, y, test_size=0.2, random_state=42
10)
11
12# AutoML classifier
13automl = autosklearn.classification.AutoSklearnClassifier(
14    time_left_for_this_task=300,  # 5 minutes
15    per_run_time_limit=60,         # 1 minute per model
16    n_jobs=-1,
17    ensemble_size=10,
18    memory_limit=4096
19)
20
21# Fit
22automl.fit(X_train, y_train)
23
24# Evaluate
25y_pred = automl.predict(X_test)
26accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
27print(f"Accuracy: {accuracy:.4f}")
28
29# Show models
30print(automl.leaderboard())

H2O AutoML

Setup

Python

1import h2o
2from h2o.automl import H2OAutoML
3
4# Start H2O
5h2o.init()
6
7# Load data
8df = h2o.import_file("data/train.csv")
9
10# Define features and target
11y = "target"
12x = df.columns
13x.remove(y)
14
15# Split
16train, test = df.split_frame(ratios=[0.8], seed=42)
17
18# Run AutoML
19aml = H2OAutoML(
20    max_runtime_secs=300,
21    max_models=20,
22    seed=42,
23    sort_metric="AUC"
24)
25
26aml.train(x=x, y=y, training_frame=train)
27
28# Leaderboard
29lb = aml.leaderboard
30print(lb)
31
32# Best model
33best = aml.leader
34perf = best.model_performance(test)
35print(perf)
36
37# Predictions
38preds = best.predict(test)

Model Explainability

Python

1# SHAP values
2import h2o.explain as h2o_explain
3
4# Explain single prediction
5h2o_explain.explain_row(aml.leader, train, row_index=0)
6
7# Feature importance
8h2o_explain.varimp_heatmap(aml.leader)

AutoGluon

Installation

Bash

1pip install autogluon

Tabular Data

Python

1from autogluon.tabular import TabularDataset, TabularPredictor
2
3# Load data
4train_data = TabularDataset('train.csv')
5test_data = TabularDataset('test.csv')
6
7# Train AutoML
8predictor = TabularPredictor(
9    label='target',
10    eval_metric='accuracy',
11    problem_type='binary'
12).fit(
13    train_data,
14    time_limit=300,
15    presets='best_quality'  # or 'medium_quality_faster_train'
16)
17
18# Leaderboard
19predictor.leaderboard(test_data)
20
21# Predict
22predictions = predictor.predict(test_data)
23
24# Feature importance
25predictor.feature_importance(test_data)

Presets Options

Python

1# Best quality (slow, accurate)
2predictor.fit(train_data, presets='best_quality')
3
4# Medium quality (balanced)
5predictor.fit(train_data, presets='medium_quality_faster_train')
6
7# Fast (quick prototyping)
8predictor.fit(train_data, presets='optimize_for_deployment')

PyCaret

Easy AutoML Pipeline

Python

1from pycaret.classification import *
2
3# Initialize
4clf = setup(
5    data=train_df,
6    target='target',
7    session_id=42,
8    normalize=True,
9    feature_selection=True
10)
11
12# Compare all models
13best = compare_models()
14
15# Tune best model
16tuned = tune_model(best)
17
18# Ensemble
19ensembled = ensemble_model(tuned, method='Bagging')
20
21# Blend multiple models
22blend = blend_models(top3=[best, tuned, ensembled])
23
24# Finalize
25final = finalize_model(blend)
26
27# Predict
28predictions = predict_model(final, data=test_df)
29
30# Save model
31save_model(final, 'my_model')

Custom Metrics

Python

1from pycaret.classification import *
2
3# Custom scorer
4from sklearn.metrics import f1_score, make_scorer
5
6custom_scorer = make_scorer(f1_score, average='weighted')
7
8# Setup with custom metric
9clf = setup(
10    data=train_df,
11    target='target',
12    custom_metric=custom_scorer
13)

TPOT (Genetic Algorithm)

Python

1from tpot import TPOTClassifier
2from sklearn.model_selection import train_test_split
3
4# Split data
5X_train, X_test, y_train, y_test = train_test_split(
6    X, y, test_size=0.2, random_state=42
7)
8
9# TPOT AutoML
10tpot = TPOTClassifier(
11    generations=10,
12    population_size=50,
13    offspring_size=50,
14    scoring='accuracy',
15    cv=5,
16    random_state=42,
17    n_jobs=-1,
18    verbosity=2
19)
20
21tpot.fit(X_train, y_train)
22
23# Score
24print(f"Score: {tpot.score(X_test, y_test):.4f}")
25
26# Export best pipeline
27tpot.export('best_pipeline.py')

Choosing AutoML Framework

Diagram

Đang vẽ diagram...

Best Practices

AutoML Tips

Set time limits để kiểm soát resources
Split data properly trước khi AutoML
Handle imbalanced data trước
Review generated pipelines để hiểu
Don't blindly trust - validate results
Consider interpretability vs accuracy trade-off

Complete Example

End-to-End AutoML Pipeline

Python

1import pandas as pd
2from autogluon.tabular import TabularDataset, TabularPredictor
3from sklearn.model_selection import train_test_split
4import matplotlib.pyplot as plt
5
6# Load data
7df = pd.read_csv('data.csv')
8
9# Basic EDA
10print(df.info())
11print(df.describe())
12
13# Check target distribution
14print(df['target'].value_counts())
15
16# Split
17train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['target'])
18
19# Convert to AutoGluon format
20train_data = TabularDataset(train_df)
21test_data = TabularDataset(test_df)
22
23# Train with AutoGluon
24predictor = TabularPredictor(
25    label='target',
26    eval_metric='roc_auc',
27    problem_type='binary'
28).fit(
29    train_data,
30    time_limit=600,  # 10 minutes
31    presets='best_quality',
32    verbosity=2
33)
34
35# Results
36print("\n" + "="*50)
37print("RESULTS")
38print("="*50)
39
40# Leaderboard
41lb = predictor.leaderboard(test_data, silent=True)
42print("\nLeaderboard:")
43print(lb)
44
45# Best model performance
46perf = predictor.evaluate(test_data)
47print(f"\nTest Performance: {perf}")
48
49# Feature importance
50importance = predictor.feature_importance(test_data)
51print("\nFeature Importance:")
52print(importance)
53
54# Plot feature importance
55importance.head(20).plot(kind='barh', figsize=(10, 8))
56plt.title('Top 20 Feature Importance')
57plt.tight_layout()
58plt.savefig('feature_importance.png')
59
60# Save predictor
61predictor.save('my_automl_model')
62
63# Load and predict on new data
64predictor = TabularPredictor.load('my_automl_model')
65new_predictions = predictor.predict(new_data)

Bài tập thực hành

Hands-on Exercise

AutoML Challenge:

Chọn dataset (Titanic, House Prices, etc.)
So sánh 3 AutoML frameworks:
- PyCaret (easy)
- AutoGluon (best)
- H2O AutoML (production)
So sánh:
- Accuracy
- Training time
- Ease of use
Document findings

Target: Đạt competitive accuracy với minimal code

AutoML - Automated Machine Learning

🤖 AutoML - Automated Machine Learning

AutoML là gì?

Popular AutoML Frameworks

Auto-sklearn

Installation

Basic Usage

H2O AutoML

Setup

Model Explainability

AutoGluon

Installation

Tabular Data

Presets Options

PyCaret

Easy AutoML Pipeline

Custom Metrics

TPOT (Genetic Algorithm)

Choosing AutoML Framework

Best Practices

Complete Example

End-to-End AutoML Pipeline

Bài tập thực hành

Tài liệu tham khảo

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu