Lý thuyết
5-6 giờ
Bài 8/15

Decision Tree: Regression & Classification

Từ If-Else Rules đến Machine Learning Models - Gini, MSE, Pruning

Decision Tree: Regression & Classification

🎯 Mục tiêu bài học

Sau bài học này, học viên sẽ:

  • Hiểu Decision Tree như flowchart của if-else questions
  • Nắm vững Gini Impurity, Entropy, MSE để chọn best split
  • Tính toán thủ công Information Gain
  • Understand Overfitting, Pruning, và Hyperparameters
  • Implement Classification & Regression Trees với Scikit-learn
  • So sánh với Linear/Logistic Regression

1. Core Idea - Decision Tree là gì?

Cấu trúc Decision Tree

Root Node (All Data)
X₁ < 5? (Internal Node)
Leaf A: Class Red
X₂ > 10?
Leaf B: Class Blue
Leaf C: Class Red

1.1 Định nghĩa

Decision Tree là supervised learning algorithm học simple decision rules từ data features để dự đoán.

Cơ chế: Hoạt động như flowchart of if-else questions:

Text
1IF Age < 30 THEN
2 IF Income > 50k THEN
3 Prediction: Buy
4 ELSE
5 Prediction: Don't Buy
6ELSE
7 Prediction: Buy

1.2 Key Terms

TermĐịnh nghĩaVí dụ
Root NodeStarting point; toàn bộ dataset"All 1000 customers"
Internal NodeDecision point về 1 feature"Age > 18?"
SplitChia node thành sub-nodesSplit by "Income < 50k"
Leaf NodeTerminal node với final output"Class = Spam" hoặc "Price = $200k"

2. Types of Trees

2.1 Classification Tree

Output: Categorical (Class label)

Ví dụ: Spam or Not Spam? (Binary), Flower Species (Multiclass)

Prediction ở Leaf: Majority class trong leaf đó

Python
1from sklearn.tree import DecisionTreeClassifier
2
3# Create classifier
4clf = DecisionTreeClassifier(max_depth=3)
5clf.fit(X_train, y_train)
6
7# Predict
8y_pred = clf.predict(X_test)

2.2 Regression Tree

Output: Continuous (Number)

Ví dụ: House Price Prediction, Sales Forecast

Prediction ở Leaf: Mean (average) của target values trong leaf đó

Python
1from sklearn.tree import DecisionTreeRegressor
2
3# Create regressor
4reg = DecisionTreeRegressor(max_depth=5)
5reg.fit(X_train, y_train)
6
7# Predict
8y_pred = reg.predict(X_test)

Leaf Node Prediction:

Prediction Strategy

Classification: Majority Vote
Regression: Mean

3. Splitting Criteria - Chọn "Best Split"

Mục tiêu: The "Best Split" minimizes impurity (độ hỗn loạn/chaos).

Tree Building Process - Recursive Partitioning

1. Select: Iterate all features
2. Find: Calculate metric for all split points
3. Split: Choose split that decreases impurity most
4. Repeat: Apply recursively to children
5. Stop: When stopping condition met

3.1 Classification Metrics

Gini Impurity (Mặc định - Faster)

Đo xác suất misclassifying một random element. Range: [0, 0.5]

Gini=1i=1Cpi2Gini = 1 - \sum_{i=1}^{C} p_i^2

Ví dụ: Node có 60% Class A, 40% Class B Gini=1(0.62+0.42)=10.52=0.48Gini = 1 - (0.6^2 + 0.4^2) = 1 - 0.52 = 0.48

Entropy (Slower - log calculation)

Đo information/disorder. Range: [0, 1]

Entropy=i=1Cpilog2(pi)Entropy = -\sum_{i=1}^{C} p_i \log_2(p_i)

Ví dụ: Node có 60% Class A, 40% Class B Entropy=(0.6×log2(0.6)+0.4×log2(0.4))=0.971Entropy = -(0.6 \times \log_2(0.6) + 0.4 \times \log_2(0.4)) = 0.971

Information Gain

IG=Entropy(Parent)Weighted Avg Entropy(Children)IG = Entropy(\text{Parent}) - \text{Weighted Avg } Entropy(\text{Children})

Python
1# So sánh Gini vs Entropy
2from sklearn.tree import DecisionTreeClassifier
3
4# Using Gini (default, faster)
5clf_gini = DecisionTreeClassifier(criterion='gini')
6
7# Using Entropy
8clf_entropy = DecisionTreeClassifier(criterion='entropy')
9
10# Using Log Loss (similar to Entropy)
11clf_log = DecisionTreeClassifier(criterion='log_loss')

3.2 Regression Metrics

MSE (Mean Squared Error) - Default

Variance reduction:

MSE=1ni=1n(yiyˉ)2MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \bar{y})^2

MAE (Mean Absolute Error)

Robust to outliers:

MAE=1ni=1nyiyˉMAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \bar{y}|

Python
1from sklearn.tree import DecisionTreeRegressor
2
3# Using MSE (default)
4reg_mse = DecisionTreeRegressor(criterion='squared_error')
5
6# Using MAE (robust to outliers)
7reg_mae = DecisionTreeRegressor(criterion='absolute_error')

3. Vi du tinh toan thu cong chi tiet

3.1 Bai toan

Du doan khach hang MUA hay KHONG MUA:

IDTuoiThu nhap (trieu)Sinh vienMua
12530CoKhong
23580KhongCo
32850CoKhong
44590KhongCo
53260KhongCo
62225CoKhong
74070KhongCo
83055CoCo

Dataset: 5 Co, 3 Khong

3.2 Buoc 1: Tinh Gini goc

Giniroot=1(58)2(38)2=10.3910.141=0.468Gini_{root} = 1 - \left(\frac{5}{8}\right)^2 - \left(\frac{3}{8}\right)^2 = 1 - 0.391 - 0.141 = 0.468

3.3 Buoc 2: Thu split theo "Sinh vien"

Nhanh "Co" (4 samples): 1 Mua, 3 Khong mua GiniSV=Co=1(14)2(34)2=10.06250.5625=0.375Gini_{SV=Co} = 1 - \left(\frac{1}{4}\right)^2 - \left(\frac{3}{4}\right)^2 = 1 - 0.0625 - 0.5625 = 0.375

Nhanh "Khong" (4 samples): 4 Mua, 0 Khong mua GiniSV=Khong=1(44)2(04)2=110=0Gini_{SV=Khong} = 1 - \left(\frac{4}{4}\right)^2 - \left(\frac{0}{4}\right)^2 = 1 - 1 - 0 = 0

Weighted Gini: Ginisplit=48×0.375+48×0=0.1875Gini_{split} = \frac{4}{8} \times 0.375 + \frac{4}{8} \times 0 = 0.1875

Information Gain: IG=GinirootGinisplit=0.4680.1875=0.2805IG = Gini_{root} - Gini_{split} = 0.468 - 0.1875 = 0.2805

3.4 Buoc 3: Thu split theo Tuoi nho hon hoac bang 30

Nhanh Yes (4 samples): ID 1, 3, 6, 8 - 1 Mua, 3 Khong Gininho=1(0.25)2(0.75)2=0.375Gini_{nho} = 1 - (0.25)^2 - (0.75)^2 = 0.375

Nhanh No (4 samples): ID 2, 4, 5, 7 - 4 Mua, 0 Khong Ginilon=0Gini_{lon} = 0

Weighted Gini: Ginisplit=0.5×0.375+0.5×0=0.1875Gini_{split} = 0.5 \times 0.375 + 0.5 \times 0 = 0.1875

Information Gain: 0.4680.1875=0.28050.468 - 0.1875 = 0.2805

3.5 Buoc 4: So sanh va chon split tot nhat

FeatureWeighted GiniInformation Gain
Sinh vien0.18750.2805
Tuoi nho hon 300.18750.2805

Ca hai nhu nhau - Chon "Sinh vien" (categorical don gian hon)

3.6 Ket qua cay

Cay quyet dinh cuoi cung:

Dieu kienKet qua
Sinh vien = CoGini=0.375, 1 Mua, 3 Khong - Du doan: KHONG MUA
Sinh vien = KhongGini=0, 4 Mua, 0 Khong - Du doan: MUA

4. Entropy va Information Gain (Alternative)

4.1 Cong thuc Entropy

Entropy=i=1Cpilog2(pi)Entropy = -\sum_{i=1}^{C} p_i \log_2(p_i)

4.2 So sanh Gini vs Entropy

MetricRangeDac diem
Gini0 - 0.5Nhanh hon, sklearn default
Entropy0 - 1Ly thuyet information theory

5. Thuc hanh voi Scikit-learn

5.1 Train Decision Tree

Python
1import numpy as np
2import pandas as pd
3from sklearn.tree import DecisionTreeClassifier, plot_tree
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import accuracy_score, classification_report
6import matplotlib.pyplot as plt
7
8# Du lieu
9data = {
10 'Tuoi': [25, 35, 28, 45, 32, 22, 40, 30],
11 'Thu_nhap': [30, 80, 50, 90, 60, 25, 70, 55],
12 'Sinh_vien': [1, 0, 1, 0, 0, 1, 0, 1],
13 'Mua': [0, 1, 0, 1, 1, 0, 1, 1]
14}
15df = pd.DataFrame(data)
16
17X = df[['Tuoi', 'Thu_nhap', 'Sinh_vien']]
18y = df['Mua']
19
20# Train model
21model = DecisionTreeClassifier(
22 criterion='gini', # hoac 'entropy'
23 max_depth=3,
24 random_state=42
25)
26model.fit(X, y)
27
28# Visualize tree
29plt.figure(figsize=(15, 10))
30plot_tree(model,
31 feature_names=['Tuoi', 'Thu_nhap', 'Sinh_vien'],
32 class_names=['Khong mua', 'Mua'],
33 filled=True,
34 rounded=True)
35plt.title('Decision Tree')
36plt.show()
37
38# Feature Importance
39importance = pd.DataFrame({
40 'Feature': X.columns,
41 'Importance': model.feature_importances_
42}).sort_values('Importance', ascending=False)
43print(importance)

5. Hyperparameters - Kiểm soát Complexity

5.1 Key Hyperparameters

Control complexity để prevent overfitting:

Python
1from sklearn.tree import DecisionTreeClassifier
2
3model = DecisionTreeClassifier(
4 max_depth=5, # Max levels (chiều sâu tối đa)
5 min_samples_split=10, # Min samples to split a node
6 min_samples_leaf=5, # Min samples in a leaf node
7 max_features='sqrt', # Features to consider for split
8 criterion='gini', # 'gini', 'entropy', or 'log_loss'
9 random_state=42
10)
11
12model.fit(X_train, y_train)
ParameterMô tảDefaultEffect
max_depthMax number of levelsNone (unlimited)↑ = More complex, risk overfitting
min_samples_splitMin samples to split2↑ = Less splits, simpler tree
min_samples_leafMin samples in leaf1↑ = Larger leaves, smoother
max_featuresFeatures to considerNone (all)'sqrt', 'log2', or int
ccp_alphaPruning parameter0.0↑ = More pruning

6. Over/Underfitting & Pruning

6.1 Bias-Variance Tradeoff

Bias-Variance Tradeoff

Underfitting (High Bias)
Good Fit (Balanced)
Overfitting (High Variance)

Underfitting (Too Simple)

  • Tree quá nông (shallow)
  • Fails to capture patterns
  • High Bias: Train và Test scores đều thấp

Overfitting (Too Complex)

  • Tree quá sâu (deep)
  • Memorizes noise in training data
  • High Variance: Perfect train score, poor test score

6.2 Pruning - Cắt tỉa cây

Trimming the tree để improve generalization.

Pre-Pruning (Early Stopping)

Stop growing during training:

Python
1# Control via hyperparameters
2clf = DecisionTreeClassifier(
3 max_depth=5, # Stop at depth 5
4 min_samples_split=10, # Need 10+ samples to split
5 min_samples_leaf=5 # Each leaf must have 5+ samples
6)

Pros: Fast ⚡
Cons: Might stop too early ❌

Post-Pruning (Cost Complexity)

Grow full tree, then cut branches that add little predictive power:

Python
1# Cost Complexity Pruning
2clf = DecisionTreeClassifier(ccp_alpha=0.01)
3clf.fit(X_train, y_train)
4
5# Find optimal alpha
6from sklearn.model_selection import cross_val_score
7import numpy as np
8
9alphas = np.arange(0, 0.1, 0.01)
10scores = []
11
12for alpha in alphas:
13 clf = DecisionTreeClassifier(ccp_alpha=alpha)
14 score = cross_val_score(clf, X_train, y_train, cv=5).mean()
15 scores.append(score)
16
17optimal_alpha = alphas[np.argmax(scores)]
18print(f"Optimal ccp_alpha: {optimal_alpha}")

7. Advantages & Limitations (Chi tiết từ PDF)

7.1 Pros ✅

AdvantageGiải thích
InterpretabilityEasy to visualize and explain ("White Box")
No Scaling RequiredWorks without normalization/standardization
Non-LinearHandles complex non-linear relationships well
Mixed Data TypesWorks với numerical & categorical features
Feature ImportanceTự động rank features quan trọng
Python
1# Get feature importance
2importances = clf.feature_importances_
3feature_names = ['Feature1', 'Feature2', 'Feature3']
4
5for name, imp in zip(feature_names, importances):
6 print(f"{name}: {imp:.4f}")

7.2 Cons ❌

LimitationGiải thíchSolution
High VarianceSmall data change → Completely different treeUse Random Forest (ensemble)
OverfittingEasy to build overly complex treesPruning, set max_depth
InstabilitySensitive to noiseUse ensemble methods
Greedy AlgorithmLocally optimal splits, not globally optimalAccept or use ensemble

7.3 When to Use?

✅ Use Decision Tree khi:

  • Explainability is critical (Banking, Medical, Legal)
  • Baseline model để so sánh
  • Small to medium structured datasets
  • Features are mixed (numerical & categorical)

❌ Don't use khi:

  • Need highest accuracy → Use Random Forest, XGBoost
  • Data is high-dimensional và sparse → Use Linear models
  • Want stable predictions → Use ensemble methods

8. Quick Comparison với các algorithms khác

AspectDecision TreeLinear RegLogistic RegRandom Forest
Non-linear data✅ Excellent❌ Poor❌ Poor✅ Excellent
Interpretability✅ High✅ High✅ High❌ Low
Scaling needed?✅ No⚠️ Yes⚠️ Yes✅ No
Outliers✅ Handles well❌ Sensitive❌ Sensitive✅ Handles well
Variance❌ High✅ Low✅ Low✅ Low (ensemble)
Overfitting risk❌ High✅ Low✅ Low✅ Low

9. Real-World Examples

9.1 Classification Use Cases

Loan Approval: Risk level (Low/Medium/High)

Python
1# Features: Income, Credit Score, Employment Status, Debt Ratio
2# Target: Approved (Yes/No)

Medical Diagnosis: Disease present? (Yes/No)

Python
1# Features: Age, Blood Pressure, Cholesterol, BMI
2# Target: Heart Disease (0/1)

Customer Churn: Will customer leave?

Python
1# Features: Usage, Support Tickets, Contract Type, Tenure
2# Target: Churn (Yes/No)

9.2 Regression Use Cases

Pricing: Estimating used car prices

Python
1# Features: Year, Mileage, Brand, Condition
2# Target: Price ($)

Demand Forecasting: Predicting daily sales volume

Python
1# Features: Day of Week, Season, Promotions, Weather
2# Target: Sales Units

10. Next Steps - Upgrade to Ensembles

⚠️ Why upgrade? Single trees are weak learners. To improve performance:

Random Forest (Bagging)

  • Bags multiple trees to reduce variance
  • Runs trees in parallel
  • Reduces overfitting dramatically

Gradient Boosting (XGBoost)

  • Trains trees sequentially to fix errors of previous ones
  • High performance but risk of overfitting
  • Tuning required

Chúng ta sẽ học chi tiết trong bài Ensemble Methods!


Bài tập tự luyện

  1. Bài tập 1: Tính Gini cho node [3 Red, 5 Blue, 2 Green]
  2. Bài tập 2: Train Decision Tree trên Titanic dataset, visualize tree
  3. Bài tập 3: So sánh performance với max_depth=[3, 5, 10, None]
  4. Bài tập 4: Implement Cost Complexity Pruning và tìm optimal alpha

Tài liệu tham khảo