Decision Tree: Regression & Classification

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, học viên sẽ:

✅ Hiểu Decision Tree như flowchart của if-else questions

✅ Nắm vững Gini Impurity, Entropy, MSE để chọn best split

✅ Tính toán thủ công Information Gain

✅ Understand Overfitting, Pruning, và Hyperparameters

✅ Implement Classification & Regression Trees với Scikit-learn

✅ So sánh với Linear/Logistic Regression

Thời gian: 5-6 giờ | Độ khó: Theory

Task 0

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min

Thuật ngữ	Tiếng Việt	Giải thích đơn giản
Decision Tree	Cây quyết định	Mô hình phân nhánh dạng cây
Root Node	Nút gốc	Nút đầu tiên, chứa toàn bộ data
Leaf Node	Nút lá	Nút cuối, chứa kết quả dự đoán
Gini Impurity	Độ tạp chất Gini	Thước đo mức độ trộn lẫn của class
Information Gain	Độ lợi thông tin	Lượng giảm entropy sau khi split
Entropy	Entropy	Thước đo độ hỗn loạn của tập data
Pruning	Cắt tỉa	Loại bỏ nhánh thừa để tránh overfitting
max_depth	Độ sâu tối đa	Giới hạn chiều sâu của cây

Checkpoint

Bạn đã đọc qua bảng thuật ngữ? Hãy ghi nhớ chúng!

Task 1

🌳 Core Idea - Decision Tree là gì?

TB5 min

1. Core Idea - Decision Tree là gì?

Cấu trúc Decision Tree

Root Node (All Data)

X₁ < 5? (Internal Node)

Leaf A: Class Red

X₂ > 10?

Leaf B: Class Blue

Leaf C: Class Red

1.1 Định nghĩa

Decision Tree là supervised learning algorithm học simple decision rules từ data features để dự đoán.

Cơ chế: Hoạt động như flowchart of if-else questions:

Ví dụ

1IF Age < 30 THEN
2    IF Income > 50k THEN
3        Prediction: Buy
4    ELSE
5        Prediction: Don't Buy
6ELSE
7    Prediction: Buy

1.2 Key Terms

Term	Định nghĩa	Ví dụ
Root Node	Starting point; toàn bộ dataset	"All 1000 customers"
Internal Node	Decision point về 1 feature	"Age > 18?"
Split	Chia node thành sub-nodes	Split by "Income < 50k"
Leaf Node	Terminal node với final output	"Class = Spam" hoặc "Price = $200k"

2. Types of Trees

2.1 Classification Tree

Output: Categorical (Class label)

Ví dụ: Spam or Not Spam? (Binary), Flower Species (Multiclass)

Prediction ở Leaf: Majority class trong leaf đó

Python

1from sklearn.tree import DecisionTreeClassifier
2
3# Create classifier
4clf = DecisionTreeClassifier(max_depth=3)
5clf.fit(X_train, y_train)
6
7# Predict
8y_pred = clf.predict(X_test)

2.2 Regression Tree

Output: Continuous (Number)

Ví dụ: House Price Prediction, Sales Forecast

Prediction ở Leaf: Mean (average) của target values trong leaf đó

Python

1from sklearn.tree import DecisionTreeRegressor
2
3# Create regressor
4reg = DecisionTreeRegressor(max_depth=5)
5reg.fit(X_train, y_train)
6
7# Predict
8y_pred = reg.predict(X_test)

Leaf Node Prediction:

Prediction Strategy

Classification: Majority Vote

Regression: Mean

Checkpoint

Bạn có thể giải thích sự khác nhau giữa Classification Tree và Regression Tree không?

Task 2

🏆 Splitting Criteria

TB5 min

3. Splitting Criteria - Chọn "Best Split"

Mục tiêu: The "Best Split" minimizes impurity (độ hỗn loạn/chaos).

Tree Building Process - Recursive Partitioning

1. Select: Iterate all features

2. Find: Calculate metric for all split points

3. Split: Choose split that decreases impurity most

4. Repeat: Apply recursively to children

5. Stop: When stopping condition met

3.1 Classification Metrics

Gini Impurity (Mặc định - Faster)

Đo xác suất misclassifying một random element. Range: [0, 0.5]

$Gini = 1 - \sum_{i=1}^{C} p_i^2$

Ví dụ: Node có 60% Class A, 40% Class B $Gini = 1 - (0.6^2 + 0.4^2) = 1 - 0.52 = 0.48$

Entropy (Slower - log calculation)

Đo information/disorder. Range: [0, 1]

$Entropy = -\sum_{i=1}^{C} p_i \log_2(p_i)$

Ví dụ: Node có 60% Class A, 40% Class B $Entropy = -(0.6 \times \log_2(0.6) + 0.4 \times \log_2(0.4)) = 0.971$

Information Gain

$IG = Entropy(\text{Parent}) - \text{Weighted Avg } Entropy(\text{Children})$

Python

1# So sánh Gini vs Entropy
2from sklearn.tree import DecisionTreeClassifier
3
4# Using Gini (default, faster)
5clf_gini = DecisionTreeClassifier(criterion='gini')
6
7# Using Entropy
8clf_entropy = DecisionTreeClassifier(criterion='entropy')
9
10# Using Log Loss (similar to Entropy)
11clf_log = DecisionTreeClassifier(criterion='log_loss')

3.2 Regression Metrics

MSE (Mean Squared Error) - Default

Variance reduction:

$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \bar{y})^2$

MAE (Mean Absolute Error)

Robust to outliers:

$MAE = \frac{1}{n}\sum_{i=1}^{n}|y_i - \bar{y}|$

Python

1from sklearn.tree import DecisionTreeRegressor
2
3# Using MSE (default)
4reg_mse = DecisionTreeRegressor(criterion='squared_error')
5
6# Using MAE (robust to outliers)
7reg_mae = DecisionTreeRegressor(criterion='absolute_error')

Task 3

📝 Ví dụ tính toán thủ công chi tiết

TB5 min

3. Ví dụ tính toán thủ công chi tiết

3.1 Bài toán

Dự đoán khách hàng MUA hay KHÔNG MUA:

ID	Tuổi	Thu nhập (triệu)	Sinh viên	Mua
1	25	30	Co	Khong
2	35	80	Khong	Co
3	28	50	Co	Khong
4	45	90	Khong	Co
5	32	60	Khong	Co
6	22	25	Co	Khong
7	40	70	Khong	Co
8	30	55	Co	Co

Dataset: 5 Co, 3 Khong

3.2 Buoc 1: Tinh Gini goc

$Gini_{root} = 1 - \left(\frac{5}{8}\right)^2 - \left(\frac{3}{8}\right)^2 = 1 - 0.391 - 0.141 = 0.468$

3.3 Buoc 2: Thu split theo "Sinh vien"

Nhanh "Co" (4 samples): 1 Mua, 3 Khong mua $Gini_{SV=Co} = 1 - \left(\frac{1}{4}\right)^2 - \left(\frac{3}{4}\right)^2 = 1 - 0.0625 - 0.5625 = 0.375$

Nhanh "Khong" (4 samples): 4 Mua, 0 Khong mua $Gini_{SV=Khong} = 1 - \left(\frac{4}{4}\right)^2 - \left(\frac{0}{4}\right)^2 = 1 - 1 - 0 = 0$

Weighted Gini: $Gini_{split} = \frac{4}{8} \times 0.375 + \frac{4}{8} \times 0 = 0.1875$

Information Gain: $IG = Gini_{root} - Gini_{split} = 0.468 - 0.1875 = 0.2805$

3.4 Buoc 3: Thu split theo Tuoi nho hon hoac bang 30

Nhanh Yes (4 samples): ID 1, 3, 6, 8 - 1 Mua, 3 Khong $Gini_{nho} = 1 - (0.25)^2 - (0.75)^2 = 0.375$

Nhanh No (4 samples): ID 2, 4, 5, 7 - 4 Mua, 0 Khong $Gini_{lon} = 0$

Weighted Gini: $Gini_{split} = 0.5 \times 0.375 + 0.5 \times 0 = 0.1875$

Information Gain: $0.468 - 0.1875 = 0.2805$

3.5 Buoc 4: So sanh va chon split tot nhat

Feature	Weighted Gini	Information Gain
Sinh vien	0.1875	0.2805
Tuoi nho hon 30	0.1875	0.2805

Ca hai nhu nhau - Chon "Sinh vien" (categorical don gian hon)

3.6 Kết quả cây

Cây quyết định cuối cùng:

Điều kiện	Kết quả
Sinh viên = Có	Gini=0.375, 1 Mua, 3 Không - Dự đoán: KHÔNG MUA
Sinh viên = Không	Gini=0, 4 Mua, 0 Không - Dự đoán: MUA

Task 4

📊 Entropy và Information Gain

TB5 min

4. Entropy và Information Gain (Alternative)

4.1 Công thức Entropy

$Entropy = -\sum_{i=1}^{C} p_i \log_2(p_i)$

4.2 So sánh Gini vs Entropy

Metric	Range	Đặc điểm
Gini	0 - 0.5	Nhanh hơn, sklearn default
Entropy	0 - 1	Lý thuyết information theory

Task 5

💻 Thực hành với Scikit-learn

TB5 min

5. Thực hành với Scikit-learn

5.1 Train Decision Tree

Python

1import numpy as np
2import pandas as pd
3from sklearn.tree import DecisionTreeClassifier, plot_tree
4from sklearn.model_selection import train_test_split
5from sklearn.metrics import accuracy_score, classification_report
6import matplotlib.pyplot as plt
7
8# Du lieu
9data = {
10    'Tuoi': [25, 35, 28, 45, 32, 22, 40, 30],
11    'Thu_nhap': [30, 80, 50, 90, 60, 25, 70, 55],
12    'Sinh_vien': [1, 0, 1, 0, 0, 1, 0, 1],
13    'Mua': [0, 1, 0, 1, 1, 0, 1, 1]
14}
15df = pd.DataFrame(data)
16
17X = df[['Tuoi', 'Thu_nhap', 'Sinh_vien']]
18y = df['Mua']
19
20# Train model
21model = DecisionTreeClassifier(
22    criterion='gini',  # hoac 'entropy'
23    max_depth=3,
24    random_state=42
25)
26model.fit(X, y)
27
28# Visualize tree
29plt.figure(figsize=(15, 10))
30plot_tree(model, 
31          feature_names=['Tuoi', 'Thu_nhap', 'Sinh_vien'],
32          class_names=['Khong mua', 'Mua'],
33          filled=True,
34          rounded=True)
35plt.title('Decision Tree')
36plt.show()
37
38# Feature Importance
39importance = pd.DataFrame({
40    'Feature': X.columns,
41    'Importance': model.feature_importances_
42}).sort_values('Importance', ascending=False)
43print(importance)

Checkpoint

Bạn đã chạy code và visualize Decision Tree chưa?

Task 6

⚙️ Hyperparameters

TB5 min

5. Hyperparameters - Kiểm soát Complexity

5.1 Key Hyperparameters

Control complexity để prevent overfitting:

Python

1from sklearn.tree import DecisionTreeClassifier
2
3model = DecisionTreeClassifier(
4    max_depth=5,            # Max levels (chiều sâu tối đa)
5    min_samples_split=10,   # Min samples to split a node
6    min_samples_leaf=5,     # Min samples in a leaf node
7    max_features='sqrt',    # Features to consider for split
8    criterion='gini',       # 'gini', 'entropy', or 'log_loss'
9    random_state=42
10)
11
12model.fit(X_train, y_train)

Parameter	Mô tả	Default	Effect
max_depth	Max number of levels	None (unlimited)	↑ = More complex, risk overfitting
min_samples_split	Min samples to split	2	↑ = Less splits, simpler tree
min_samples_leaf	Min samples in leaf	1	↑ = Larger leaves, smoother
max_features	Features to consider	None (all)	'sqrt', 'log2', or int
ccp_alpha	Pruning parameter	0.0	↑ = More pruning

Task 7

✂️ Overfitting & Pruning

TB5 min

6. Over/Underfitting & Pruning

6.1 Bias-Variance Tradeoff

Bias-Variance Tradeoff

Underfitting (High Bias)

Good Fit (Balanced)

Overfitting (High Variance)

Underfitting (Too Simple)

Tree quá nông (shallow)
Fails to capture patterns
High Bias: Train và Test scores đều thấp

Overfitting (Too Complex)

Tree quá sâu (deep)
Memorizes noise in training data
High Variance: Perfect train score, poor test score

6.2 Pruning - Cắt tỉa cây

Trimming the tree để improve generalization.

Pre-Pruning (Early Stopping)

Stop growing during training:

Python

1# Control via hyperparameters
2clf = DecisionTreeClassifier(
3    max_depth=5,           # Stop at depth 5
4    min_samples_split=10,  # Need 10+ samples to split
5    min_samples_leaf=5     # Each leaf must have 5+ samples
6)

Pros: Fast ⚡
Cons: Might stop too early ❌

Post-Pruning (Cost Complexity)

Grow full tree, then cut branches that add little predictive power:

Python

1# Cost Complexity Pruning
2clf = DecisionTreeClassifier(ccp_alpha=0.01)
3clf.fit(X_train, y_train)
4
5# Find optimal alpha
6from sklearn.model_selection import cross_val_score
7import numpy as np
8
9alphas = np.arange(0, 0.1, 0.01)
10scores = []
11
12for alpha in alphas:
13    clf = DecisionTreeClassifier(ccp_alpha=alpha)
14    score = cross_val_score(clf, X_train, y_train, cv=5).mean()
15    scores.append(score)
16
17optimal_alpha = alphas[np.argmax(scores)]
18print(f"Optimal ccp_alpha: {optimal_alpha}")

Task 8

✅ Ưu & Nhược điểm

TB5 min

7. Advantages & Limitations (Chi tiết từ PDF)

7.1 Pros ✅

Advantage	Giải thích
Interpretability	Easy to visualize and explain ("White Box")
No Scaling Required	Works without normalization/standardization
Non-Linear	Handles complex non-linear relationships well
Mixed Data Types	Works với numerical & categorical features
Feature Importance	Tự động rank features quan trọng

Python

1# Get feature importance
2importances = clf.feature_importances_
3feature_names = ['Feature1', 'Feature2', 'Feature3']
4
5for name, imp in zip(feature_names, importances):
6    print(f"{name}: {imp:.4f}")

7.2 Cons ❌

Limitation	Giải thích	Solution
High Variance	Small data change → Completely different tree	Use Random Forest (ensemble)
Overfitting	Easy to build overly complex trees	Pruning, set max_depth
Instability	Sensitive to noise	Use ensemble methods
Greedy Algorithm	Locally optimal splits, not globally optimal	Accept or use ensemble

7.3 When to Use?

✅ Use Decision Tree khi:

Explainability is critical (Banking, Medical, Legal)
Baseline model để so sánh
Small to medium structured datasets
Features are mixed (numerical & categorical)

❌ Don't use khi:

Need highest accuracy → Use Random Forest, XGBoost
Data is high-dimensional và sparse → Use Linear models
Want stable predictions → Use ensemble methods

Task 9

🔄 So sánh với các algorithms khác

TB5 min

8. Quick Comparison với các algorithms khác

Aspect	Decision Tree	Linear Reg	Logistic Reg	Random Forest
Non-linear data	✅ Excellent	❌ Poor	❌ Poor	✅ Excellent
Interpretability	✅ High	✅ High	✅ High	❌ Low
Scaling needed?	✅ No	⚠️ Yes	⚠️ Yes	✅ No
Outliers	✅ Handles well	❌ Sensitive	❌ Sensitive	✅ Handles well
Variance	❌ High	✅ Low	✅ Low	✅ Low (ensemble)
Overfitting risk	❌ High	✅ Low	✅ Low	✅ Low

Task 10

🌐 Real-World Examples

TB5 min

9. Real-World Examples

9.1 Classification Use Cases

Loan Approval: Risk level (Low/Medium/High)

Python

1# Features: Income, Credit Score, Employment Status, Debt Ratio
2# Target: Approved (Yes/No)

Medical Diagnosis: Disease present? (Yes/No)

Python

1# Features: Age, Blood Pressure, Cholesterol, BMI
2# Target: Heart Disease (0/1)

Customer Churn: Will customer leave?

Python

1# Features: Usage, Support Tickets, Contract Type, Tenure
2# Target: Churn (Yes/No)

9.2 Regression Use Cases

Pricing: Estimating used car prices

Python

1# Features: Year, Mileage, Brand, Condition
2# Target: Price ($)

Demand Forecasting: Predicting daily sales volume

Python

1# Features: Day of Week, Season, Promotions, Weather
2# Target: Sales Units

10. Next Steps - Upgrade to Ensembles

Tại sao nâng cấp? Single trees là weak learners. Để cải thiện performance:

Random Forest (Bagging)

Bags multiple trees to reduce variance
Runs trees in parallel
Reduces overfitting dramatically

Gradient Boosting (XGBoost)

Trains trees sequentially to fix errors of previous ones
High performance but risk of overfitting
Tuning required

Chúng ta sẽ học chi tiết trong bài Ensemble Methods!

Task 11

📝 Tổng Kết

TB5 min

Key Takeaways:

🌳 Decision Tree phân nhánh dữ liệu dựa trên điều kiện if/else đơn giản
📏 Split criteria: Gini Impurity hoặc Information Gain (Entropy)
✂️ Pruning (pre + post) giúp tránh overfitting
💻 Scikit-learn: DecisionTreeClassifier(max_depth=5, criterion='gini')
🌟 Ưu điểm: Dễ hiểu, dễ visualize, không cần normalize

Bài tập tự luyện

Bài tập 1: Tính Gini cho node [3 Red, 5 Blue, 2 Green]
Bài tập 2: Train Decision Tree trên Titanic dataset, visualize tree
Bài tập 3: So sánh performance với max_depth=[3, 5, 10, None]
Bài tập 4: Implement Cost Complexity Pruning và tìm optimal alpha

Tài liệu tham khảo

Nguồn	Link
Scikit-learn Decision Trees	scikit-learn.org
Decision Tree Cheat Sheet	PDF
StatQuest - Decision Trees	youtube.com
Visual Introduction to ML	r2d3.us

Câu hỏi tự kiểm tra

Gini Impurity và Entropy khác nhau như thế nào? Khi nào dùng criterion nào?
Information Gain được tính như thế nào và nó giúp Decision Tree chọn best split ra sao?
Tại sao Decision Tree dễ bị Overfitting? Pruning giúp giải quyết vấn đề này như thế nào?
So sánh Decision Tree với Linear/Logistic Regression — khi nào nên dùng Decision Tree?

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Decision Tree!

Tiếp theo: Cùng học Evaluation Metrics — cách đánh giá model chính xác!

Checkpoint

Bạn đã nắm vững Decision Tree chưa? Sẵn sàng sang Evaluation Metrics!

Task 12