Decision Tree: Regression & Classification
🎯 Mục tiêu bài học
Sau bài học này, học viên sẽ:
- Hiểu Decision Tree như flowchart của if-else questions
- Nắm vững Gini Impurity, Entropy, MSE để chọn best split
- Tính toán thủ công Information Gain
- Understand Overfitting, Pruning, và Hyperparameters
- Implement Classification & Regression Trees với Scikit-learn
- So sánh với Linear/Logistic Regression
1. Core Idea - Decision Tree là gì?
Cấu trúc Decision Tree
1.1 Định nghĩa
Decision Tree là supervised learning algorithm học simple decision rules từ data features để dự đoán.
Cơ chế: Hoạt động như flowchart of if-else questions:
1IF Age < 30 THEN2 IF Income > 50k THEN3 Prediction: Buy4 ELSE5 Prediction: Don't Buy6ELSE7 Prediction: Buy1.2 Key Terms
| Term | Định nghĩa | Ví dụ |
|---|---|---|
| Root Node | Starting point; toàn bộ dataset | "All 1000 customers" |
| Internal Node | Decision point về 1 feature | "Age > 18?" |
| Split | Chia node thành sub-nodes | Split by "Income < 50k" |
| Leaf Node | Terminal node với final output | "Class = Spam" hoặc "Price = $200k" |
2. Types of Trees
2.1 Classification Tree
Output: Categorical (Class label)
Ví dụ: Spam or Not Spam? (Binary), Flower Species (Multiclass)
Prediction ở Leaf: Majority class trong leaf đó
1from sklearn.tree import DecisionTreeClassifier23# Create classifier4clf = DecisionTreeClassifier(max_depth=3)5clf.fit(X_train, y_train)67# Predict8y_pred = clf.predict(X_test)2.2 Regression Tree
Output: Continuous (Number)
Ví dụ: House Price Prediction, Sales Forecast
Prediction ở Leaf: Mean (average) của target values trong leaf đó
1from sklearn.tree import DecisionTreeRegressor23# Create regressor4reg = DecisionTreeRegressor(max_depth=5)5reg.fit(X_train, y_train)67# Predict8y_pred = reg.predict(X_test)Leaf Node Prediction:
Prediction Strategy
3. Splitting Criteria - Chọn "Best Split"
Mục tiêu: The "Best Split" minimizes impurity (độ hỗn loạn/chaos).
Tree Building Process - Recursive Partitioning
3.1 Classification Metrics
Gini Impurity (Mặc định - Faster)
Đo xác suất misclassifying một random element. Range: [0, 0.5]
Ví dụ: Node có 60% Class A, 40% Class B
Entropy (Slower - log calculation)
Đo information/disorder. Range: [0, 1]
Ví dụ: Node có 60% Class A, 40% Class B
Information Gain
1# So sánh Gini vs Entropy2from sklearn.tree import DecisionTreeClassifier34# Using Gini (default, faster)5clf_gini = DecisionTreeClassifier(criterion='gini')67# Using Entropy8clf_entropy = DecisionTreeClassifier(criterion='entropy')910# Using Log Loss (similar to Entropy)11clf_log = DecisionTreeClassifier(criterion='log_loss')3.2 Regression Metrics
MSE (Mean Squared Error) - Default
Variance reduction:
MAE (Mean Absolute Error)
Robust to outliers:
1from sklearn.tree import DecisionTreeRegressor23# Using MSE (default)4reg_mse = DecisionTreeRegressor(criterion='squared_error')56# Using MAE (robust to outliers)7reg_mae = DecisionTreeRegressor(criterion='absolute_error')3. Vi du tinh toan thu cong chi tiet
3.1 Bai toan
Du doan khach hang MUA hay KHONG MUA:
| ID | Tuoi | Thu nhap (trieu) | Sinh vien | Mua |
|---|---|---|---|---|
| 1 | 25 | 30 | Co | Khong |
| 2 | 35 | 80 | Khong | Co |
| 3 | 28 | 50 | Co | Khong |
| 4 | 45 | 90 | Khong | Co |
| 5 | 32 | 60 | Khong | Co |
| 6 | 22 | 25 | Co | Khong |
| 7 | 40 | 70 | Khong | Co |
| 8 | 30 | 55 | Co | Co |
Dataset: 5 Co, 3 Khong
3.2 Buoc 1: Tinh Gini goc
3.3 Buoc 2: Thu split theo "Sinh vien"
Nhanh "Co" (4 samples): 1 Mua, 3 Khong mua
Nhanh "Khong" (4 samples): 4 Mua, 0 Khong mua
Weighted Gini:
Information Gain:
3.4 Buoc 3: Thu split theo Tuoi nho hon hoac bang 30
Nhanh Yes (4 samples): ID 1, 3, 6, 8 - 1 Mua, 3 Khong
Nhanh No (4 samples): ID 2, 4, 5, 7 - 4 Mua, 0 Khong
Weighted Gini:
Information Gain:
3.5 Buoc 4: So sanh va chon split tot nhat
| Feature | Weighted Gini | Information Gain |
|---|---|---|
| Sinh vien | 0.1875 | 0.2805 |
| Tuoi nho hon 30 | 0.1875 | 0.2805 |
Ca hai nhu nhau - Chon "Sinh vien" (categorical don gian hon)
3.6 Ket qua cay
Cay quyet dinh cuoi cung:
| Dieu kien | Ket qua |
|---|---|
| Sinh vien = Co | Gini=0.375, 1 Mua, 3 Khong - Du doan: KHONG MUA |
| Sinh vien = Khong | Gini=0, 4 Mua, 0 Khong - Du doan: MUA |
4. Entropy va Information Gain (Alternative)
4.1 Cong thuc Entropy
4.2 So sanh Gini vs Entropy
| Metric | Range | Dac diem |
|---|---|---|
| Gini | 0 - 0.5 | Nhanh hon, sklearn default |
| Entropy | 0 - 1 | Ly thuyet information theory |
5. Thuc hanh voi Scikit-learn
5.1 Train Decision Tree
1import numpy as np2import pandas as pd3from sklearn.tree import DecisionTreeClassifier, plot_tree4from sklearn.model_selection import train_test_split5from sklearn.metrics import accuracy_score, classification_report6import matplotlib.pyplot as plt78# Du lieu9data = {10 'Tuoi': [25, 35, 28, 45, 32, 22, 40, 30],11 'Thu_nhap': [30, 80, 50, 90, 60, 25, 70, 55],12 'Sinh_vien': [1, 0, 1, 0, 0, 1, 0, 1],13 'Mua': [0, 1, 0, 1, 1, 0, 1, 1]14}15df = pd.DataFrame(data)1617X = df[['Tuoi', 'Thu_nhap', 'Sinh_vien']]18y = df['Mua']1920# Train model21model = DecisionTreeClassifier(22 criterion='gini', # hoac 'entropy'23 max_depth=3,24 random_state=4225)26model.fit(X, y)2728# Visualize tree29plt.figure(figsize=(15, 10))30plot_tree(model, 31 feature_names=['Tuoi', 'Thu_nhap', 'Sinh_vien'],32 class_names=['Khong mua', 'Mua'],33 filled=True,34 rounded=True)35plt.title('Decision Tree')36plt.show()3738# Feature Importance39importance = pd.DataFrame({40 'Feature': X.columns,41 'Importance': model.feature_importances_42}).sort_values('Importance', ascending=False)43print(importance)5. Hyperparameters - Kiểm soát Complexity
5.1 Key Hyperparameters
Control complexity để prevent overfitting:
1from sklearn.tree import DecisionTreeClassifier23model = DecisionTreeClassifier(4 max_depth=5, # Max levels (chiều sâu tối đa)5 min_samples_split=10, # Min samples to split a node6 min_samples_leaf=5, # Min samples in a leaf node7 max_features='sqrt', # Features to consider for split8 criterion='gini', # 'gini', 'entropy', or 'log_loss'9 random_state=4210)1112model.fit(X_train, y_train)| Parameter | Mô tả | Default | Effect |
|---|---|---|---|
| max_depth | Max number of levels | None (unlimited) | ↑ = More complex, risk overfitting |
| min_samples_split | Min samples to split | 2 | ↑ = Less splits, simpler tree |
| min_samples_leaf | Min samples in leaf | 1 | ↑ = Larger leaves, smoother |
| max_features | Features to consider | None (all) | 'sqrt', 'log2', or int |
| ccp_alpha | Pruning parameter | 0.0 | ↑ = More pruning |
6. Over/Underfitting & Pruning
6.1 Bias-Variance Tradeoff
Bias-Variance Tradeoff
Underfitting (Too Simple)
- Tree quá nông (shallow)
- Fails to capture patterns
- High Bias: Train và Test scores đều thấp
Overfitting (Too Complex)
- Tree quá sâu (deep)
- Memorizes noise in training data
- High Variance: Perfect train score, poor test score
6.2 Pruning - Cắt tỉa cây
Trimming the tree để improve generalization.
Pre-Pruning (Early Stopping)
Stop growing during training:
1# Control via hyperparameters2clf = DecisionTreeClassifier(3 max_depth=5, # Stop at depth 54 min_samples_split=10, # Need 10+ samples to split5 min_samples_leaf=5 # Each leaf must have 5+ samples6)Pros: Fast ⚡
Cons: Might stop too early ❌
Post-Pruning (Cost Complexity)
Grow full tree, then cut branches that add little predictive power:
1# Cost Complexity Pruning2clf = DecisionTreeClassifier(ccp_alpha=0.01)3clf.fit(X_train, y_train)45# Find optimal alpha6from sklearn.model_selection import cross_val_score7import numpy as np89alphas = np.arange(0, 0.1, 0.01)10scores = []1112for alpha in alphas:13 clf = DecisionTreeClassifier(ccp_alpha=alpha)14 score = cross_val_score(clf, X_train, y_train, cv=5).mean()15 scores.append(score)1617optimal_alpha = alphas[np.argmax(scores)]18print(f"Optimal ccp_alpha: {optimal_alpha}")7. Advantages & Limitations (Chi tiết từ PDF)
7.1 Pros ✅
| Advantage | Giải thích |
|---|---|
| Interpretability | Easy to visualize and explain ("White Box") |
| No Scaling Required | Works without normalization/standardization |
| Non-Linear | Handles complex non-linear relationships well |
| Mixed Data Types | Works với numerical & categorical features |
| Feature Importance | Tự động rank features quan trọng |
1# Get feature importance2importances = clf.feature_importances_3feature_names = ['Feature1', 'Feature2', 'Feature3']45for name, imp in zip(feature_names, importances):6 print(f"{name}: {imp:.4f}")7.2 Cons ❌
| Limitation | Giải thích | Solution |
|---|---|---|
| High Variance | Small data change → Completely different tree | Use Random Forest (ensemble) |
| Overfitting | Easy to build overly complex trees | Pruning, set max_depth |
| Instability | Sensitive to noise | Use ensemble methods |
| Greedy Algorithm | Locally optimal splits, not globally optimal | Accept or use ensemble |
7.3 When to Use?
✅ Use Decision Tree khi:
- Explainability is critical (Banking, Medical, Legal)
- Baseline model để so sánh
- Small to medium structured datasets
- Features are mixed (numerical & categorical)
❌ Don't use khi:
- Need highest accuracy → Use Random Forest, XGBoost
- Data is high-dimensional và sparse → Use Linear models
- Want stable predictions → Use ensemble methods
8. Quick Comparison với các algorithms khác
| Aspect | Decision Tree | Linear Reg | Logistic Reg | Random Forest |
|---|---|---|---|---|
| Non-linear data | ✅ Excellent | ❌ Poor | ❌ Poor | ✅ Excellent |
| Interpretability | ✅ High | ✅ High | ✅ High | ❌ Low |
| Scaling needed? | ✅ No | ⚠️ Yes | ⚠️ Yes | ✅ No |
| Outliers | ✅ Handles well | ❌ Sensitive | ❌ Sensitive | ✅ Handles well |
| Variance | ❌ High | ✅ Low | ✅ Low | ✅ Low (ensemble) |
| Overfitting risk | ❌ High | ✅ Low | ✅ Low | ✅ Low |
9. Real-World Examples
9.1 Classification Use Cases
Loan Approval: Risk level (Low/Medium/High)
1# Features: Income, Credit Score, Employment Status, Debt Ratio2# Target: Approved (Yes/No)Medical Diagnosis: Disease present? (Yes/No)
1# Features: Age, Blood Pressure, Cholesterol, BMI2# Target: Heart Disease (0/1)Customer Churn: Will customer leave?
1# Features: Usage, Support Tickets, Contract Type, Tenure2# Target: Churn (Yes/No)9.2 Regression Use Cases
Pricing: Estimating used car prices
1# Features: Year, Mileage, Brand, Condition2# Target: Price ($)Demand Forecasting: Predicting daily sales volume
1# Features: Day of Week, Season, Promotions, Weather2# Target: Sales Units10. Next Steps - Upgrade to Ensembles
⚠️ Why upgrade? Single trees are weak learners. To improve performance:
Random Forest (Bagging)
- Bags multiple trees to reduce variance
- Runs trees in parallel
- Reduces overfitting dramatically
Gradient Boosting (XGBoost)
- Trains trees sequentially to fix errors of previous ones
- High performance but risk of overfitting
- Tuning required
Chúng ta sẽ học chi tiết trong bài Ensemble Methods!
Bài tập tự luyện
- Bài tập 1: Tính Gini cho node [3 Red, 5 Blue, 2 Green]
- Bài tập 2: Train Decision Tree trên Titanic dataset, visualize tree
- Bài tập 3: So sánh performance với max_depth=[3, 5, 10, None]
- Bài tập 4: Implement Cost Complexity Pruning và tìm optimal alpha
