🤖 AutoML - Automated Machine Learning
AutoML tự động hóa các bước trong ML pipeline: feature engineering, model selection, hyperparameter tuning. Bài này cover các frameworks phổ biến.
AutoML là gì?
Diagram
graph LR
D[Data] --> FE[Feature Engineering]
FE --> MS[Model Selection]
MS --> HPT[Hyperparameter Tuning]
HPT --> E[Ensemble]
E --> M[Final Model]
style FE fill:#f9f,stroke:#333
style MS fill:#f9f,stroke:#333
style HPT fill:#f9f,stroke:#333
style E fill:#f9f,stroke:#333AutoML automates
- Feature Engineering: Selection, transformation, creation
- Model Selection: Try multiple algorithms
- Hyperparameter Tuning: Find optimal params
- Ensemble: Combine best models
Popular AutoML Frameworks
| Framework | Pros | Use Case |
|---|---|---|
| Auto-sklearn | Built on sklearn | General ML |
| H2O AutoML | Fast, production-ready | Enterprise |
| TPOT | Genetic algorithm | Exploration |
| AutoGluon | State-of-the-art | Best accuracy |
| PyCaret | Easy to use | Rapid prototyping |
Auto-sklearn
Installation
Bash
1pip install auto-sklearnBasic Usage
Python
1import autosklearn.classification2from sklearn.model_selection import train_test_split3from sklearn.datasets import load_breast_cancer4import sklearn.metrics56# Load data7X, y = load_breast_cancer(return_X_y=True)8X_train, X_test, y_train, y_test = train_test_split(9 X, y, test_size=0.2, random_state=4210)1112# AutoML classifier13automl = autosklearn.classification.AutoSklearnClassifier(14 time_left_for_this_task=300, # 5 minutes15 per_run_time_limit=60, # 1 minute per model16 n_jobs=-1,17 ensemble_size=10,18 memory_limit=409619)2021# Fit22automl.fit(X_train, y_train)2324# Evaluate25y_pred = automl.predict(X_test)26accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)27print(f"Accuracy: {accuracy:.4f}")2829# Show models30print(automl.leaderboard())H2O AutoML
Setup
Python
1import h2o2from h2o.automl import H2OAutoML34# Start H2O5h2o.init()67# Load data8df = h2o.import_file("data/train.csv")910# Define features and target11y = "target"12x = df.columns13x.remove(y)1415# Split16train, test = df.split_frame(ratios=[0.8], seed=42)1718# Run AutoML19aml = H2OAutoML(20 max_runtime_secs=300,21 max_models=20,22 seed=42,23 sort_metric="AUC"24)2526aml.train(x=x, y=y, training_frame=train)2728# Leaderboard29lb = aml.leaderboard30print(lb)3132# Best model33best = aml.leader34perf = best.model_performance(test)35print(perf)3637# Predictions38preds = best.predict(test)Model Explainability
Python
1# SHAP values2import h2o.explain as h2o_explain34# Explain single prediction5h2o_explain.explain_row(aml.leader, train, row_index=0)67# Feature importance8h2o_explain.varimp_heatmap(aml.leader)AutoGluon
Installation
Bash
1pip install autogluonTabular Data
Python
1from autogluon.tabular import TabularDataset, TabularPredictor23# Load data4train_data = TabularDataset('train.csv')5test_data = TabularDataset('test.csv')67# Train AutoML8predictor = TabularPredictor(9 label='target',10 eval_metric='accuracy',11 problem_type='binary'12).fit(13 train_data,14 time_limit=300,15 presets='best_quality' # or 'medium_quality_faster_train'16)1718# Leaderboard19predictor.leaderboard(test_data)2021# Predict22predictions = predictor.predict(test_data)2324# Feature importance25predictor.feature_importance(test_data)Presets Options
Python
1# Best quality (slow, accurate)2predictor.fit(train_data, presets='best_quality')34# Medium quality (balanced)5predictor.fit(train_data, presets='medium_quality_faster_train')67# Fast (quick prototyping)8predictor.fit(train_data, presets='optimize_for_deployment')PyCaret
Easy AutoML Pipeline
Python
1from pycaret.classification import *23# Initialize4clf = setup(5 data=train_df,6 target='target',7 session_id=42,8 normalize=True,9 feature_selection=True10)1112# Compare all models13best = compare_models()1415# Tune best model16tuned = tune_model(best)1718# Ensemble19ensembled = ensemble_model(tuned, method='Bagging')2021# Blend multiple models22blend = blend_models(top3=[best, tuned, ensembled])2324# Finalize25final = finalize_model(blend)2627# Predict28predictions = predict_model(final, data=test_df)2930# Save model31save_model(final, 'my_model')Custom Metrics
Python
1from pycaret.classification import *23# Custom scorer4from sklearn.metrics import f1_score, make_scorer56custom_scorer = make_scorer(f1_score, average='weighted')78# Setup with custom metric9clf = setup(10 data=train_df,11 target='target',12 custom_metric=custom_scorer13)TPOT (Genetic Algorithm)
Python
1from tpot import TPOTClassifier2from sklearn.model_selection import train_test_split34# Split data5X_train, X_test, y_train, y_test = train_test_split(6 X, y, test_size=0.2, random_state=427)89# TPOT AutoML10tpot = TPOTClassifier(11 generations=10,12 population_size=50,13 offspring_size=50,14 scoring='accuracy',15 cv=5,16 random_state=42,17 n_jobs=-1,18 verbosity=219)2021tpot.fit(X_train, y_train)2223# Score24print(f"Score: {tpot.score(X_test, y_test):.4f}")2526# Export best pipeline27tpot.export('best_pipeline.py')Choosing AutoML Framework
Diagram
graph TD
Start[Start] --> Q1{Need?}
Q1 -->|Best Accuracy| AG[AutoGluon]
Q1 -->|Fast Prototyping| PC[PyCaret]
Q1 -->|Production| H2O[H2O AutoML]
Q1 -->|Sklearn Compatible| AS[Auto-sklearn]
Q1 -->|Interpretable Pipeline| TP[TPOT]Best Practices
AutoML Tips
- Set time limits để kiểm soát resources
- Split data properly trước khi AutoML
- Handle imbalanced data trước
- Review generated pipelines để hiểu
- Don't blindly trust - validate results
- Consider interpretability vs accuracy trade-off
Complete Example
End-to-End AutoML Pipeline
Python
1import pandas as pd2from autogluon.tabular import TabularDataset, TabularPredictor3from sklearn.model_selection import train_test_split4import matplotlib.pyplot as plt56# Load data7df = pd.read_csv('data.csv')89# Basic EDA10print(df.info())11print(df.describe())1213# Check target distribution14print(df['target'].value_counts())1516# Split17train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['target'])1819# Convert to AutoGluon format20train_data = TabularDataset(train_df)21test_data = TabularDataset(test_df)2223# Train with AutoGluon24predictor = TabularPredictor(25 label='target',26 eval_metric='roc_auc',27 problem_type='binary'28).fit(29 train_data,30 time_limit=600, # 10 minutes31 presets='best_quality',32 verbosity=233)3435# Results36print("\n" + "="*50)37print("RESULTS")38print("="*50)3940# Leaderboard41lb = predictor.leaderboard(test_data, silent=True)42print("\nLeaderboard:")43print(lb)4445# Best model performance46perf = predictor.evaluate(test_data)47print(f"\nTest Performance: {perf}")4849# Feature importance50importance = predictor.feature_importance(test_data)51print("\nFeature Importance:")52print(importance)5354# Plot feature importance55importance.head(20).plot(kind='barh', figsize=(10, 8))56plt.title('Top 20 Feature Importance')57plt.tight_layout()58plt.savefig('feature_importance.png')5960# Save predictor61predictor.save('my_automl_model')6263# Load and predict on new data64predictor = TabularPredictor.load('my_automl_model')65new_predictions = predictor.predict(new_data)Bài tập thực hành
Hands-on Exercise
AutoML Challenge:
- Chọn dataset (Titanic, House Prices, etc.)
- So sánh 3 AutoML frameworks:
- PyCaret (easy)
- AutoGluon (best)
- H2O AutoML (production)
- So sánh:
- Accuracy
- Training time
- Ease of use
- Document findings
Target: Đạt competitive accuracy với minimal code
