Predictive Analytics Basics

Predictive Analytics and Forecasting

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ có thể:

Xây dựng Simple và Multiple Linear Regression models
Implement Binary Classification cho churn prediction
Sử dụng Decision Trees và interpret feature importance
Thực hiện simple forecasting với moving averages và trend extrapolation
Đánh giá model performance với cross-validation và proper metrics
Giải thích kết quả cho non-technical stakeholders

Thông tin bài học

⏱️ Thời lượng: 2.5 giờ | 📊 Cấp độ: Nâng cao | 🛠️ Công cụ: Python, scikit-learn

Task 0

📖 Thuật ngữ quan trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Linear Regression	Hồi quy tuyến tính	Dự đoán giá trị liên tục dựa trên mối quan hệ tuyến tính
Multiple Regression	Hồi quy đa biến	Regression với nhiều biến độc lập (features)
Classification	Phân loại	Dự đoán categories (churn yes/no, fraud/legit)
Logistic Regression	Hồi quy logistic	Classification model trả về probability
Decision Tree	Cây quyết định	Model phân chia dữ liệu theo rules, dễ interpret
Feature Importance	Độ quan trọng feature	Đo lường ảnh hưởng của mỗi biến đến prediction
R² Score	Hệ số xác định	Phần trăm variance được model giải thích (0-1)
RMSE	Sai số bình phương trung bình	Root Mean Squared Error — đo prediction error
Cross-Validation	Kiểm chứng chéo	Đánh giá model trên nhiều data splits
Confusion Matrix	Ma trận nhầm lẫn	Bảng so sánh predicted vs actual classifications

Checkpoint

Predictive analytics có 3 loại chính: Regression (dự đoán số), Classification (dự đoán category), Time Series Forecasting (dự đoán tương lai). Bạn có thể cho ví dụ business problem phù hợp với regression vs classification?

Task 1

📈 Linear Regression

TB5 min

Simple Linear Regression

Python

1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4from sklearn.linear_model import LinearRegression
5from sklearn.model_selection import train_test_split
6from sklearn.metrics import mean_squared_error, r2_score
7
8# Sample data: Marketing spend vs Sales
9np.random.seed(42)
10n = 100
11marketing_spend = np.random.uniform(1000, 10000, n)
12noise = np.random.normal(0, 500, n)
13sales = 500 + 3.5 * marketing_spend + noise
14
15df = pd.DataFrame({'marketing_spend': marketing_spend, 'sales': sales})
16
17# Prepare & train
18X = df[['marketing_spend']]
19y = df['sales']
20X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
21
22model = LinearRegression()
23model.fit(X_train, y_train)
24
25print(f"Intercept: ${model.intercept_:.2f}")
26print(f"Coefficient: ${model.coef_[0]:.2f} per $1 marketing spend")
27
28# Evaluate
29y_pred = model.predict(X_test)
30rmse = np.sqrt(mean_squared_error(y_test, y_pred))
31r2 = r2_score(y_test, y_pred)
32print(f"\nRMSE: ${rmse:.2f}")
33print(f"R² Score: {r2:.4f} → Model explains {r2*100:.1f}% of variance")

Multiple Regression

Python

1np.random.seed(42)
2n = 200
3
4df = pd.DataFrame({
5    'marketing_spend': np.random.uniform(1000, 10000, n),
6    'store_traffic': np.random.uniform(500, 5000, n),
7    'avg_price': np.random.uniform(20, 100, n),
8    'num_promotions': np.random.randint(0, 10, n),
9    'is_weekend': np.random.choice([0, 1], n)
10})
11
12df['sales'] = (500 + 2.5 * df['marketing_spend'] + 1.2 * df['store_traffic']
13    - 3.0 * df['avg_price'] + 200 * df['num_promotions']
14    + 1500 * df['is_weekend'] + np.random.normal(0, 500, n))
15
16features = ['marketing_spend', 'store_traffic', 'avg_price', 'num_promotions', 'is_weekend']
17X = df[features]
18y = df['sales']
19
20X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
21model = LinearRegression()
22model.fit(X_train, y_train)
23
24print("Feature Coefficients:")
25for feature, coef in zip(features, model.coef_):
26    direction = "↑" if coef > 0 else "↓"
27    print(f"  {feature}: {coef:.2f} {direction}")
28
29y_pred = model.predict(X_test)
30print(f"\nR² Score: {r2_score(y_test, y_pred):.4f}")

Feature Importance (Standardized)

Python

1from sklearn.preprocessing import StandardScaler
2
3scaler = StandardScaler()
4X_scaled = scaler.fit_transform(X_train)
5model_scaled = LinearRegression()
6model_scaled.fit(X_scaled, y_train)
7
8importance = pd.DataFrame({
9    'feature': features,
10    'importance': np.abs(model_scaled.coef_)
11}).sort_values('importance', ascending=True)
12
13plt.figure(figsize=(10, 5))
14plt.barh(importance['feature'], importance['importance'])
15plt.xlabel('Absolute Standardized Coefficient')
16plt.title('Feature Importance')
17plt.show()

Standardized Coefficients

Raw coefficients không so sánh được vì features có scale khác nhau ($1 marketing ≠ 1 promotion). Standardize trước khi so sánh importance.

Checkpoint

Linear Regression dự đoán continuous values — R² cho biết model giải thích được bao nhiêu % variance, RMSE đo prediction error bằng đơn vị gốc. R² = 0.85 nghĩa là gì trong ngữ cảnh business?

Task 2

🏷️ Classification (Churn Prediction)

TB5 min

Binary Classification

Python

1from sklearn.linear_model import LogisticRegression
2from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
3
4np.random.seed(42)
5n = 1000
6
7customers = pd.DataFrame({
8    'tenure_months': np.random.randint(1, 72, n),
9    'monthly_spend': np.random.uniform(20, 200, n),
10    'support_tickets': np.random.poisson(2, n),
11    'last_login_days': np.random.exponential(30, n),
12    'contract_type': np.random.choice(['monthly', 'annual'], n)
13})
14
15churn_prob = (0.1 - 0.01 * customers['tenure_months']
16    - 0.002 * customers['monthly_spend']
17    + 0.1 * customers['support_tickets']
18    + 0.01 * customers['last_login_days']
19    + 0.2 * (customers['contract_type'] == 'monthly'))
20churn_prob = 1 / (1 + np.exp(-churn_prob))
21customers['churned'] = (np.random.random(n) < churn_prob).astype(int)
22
23print(f"Churn Rate: {customers['churned'].mean()*100:.1f}%")

Train & Evaluate

Python

1customers['is_monthly'] = (customers['contract_type'] == 'monthly').astype(int)
2features = ['tenure_months', 'monthly_spend', 'support_tickets', 'last_login_days', 'is_monthly']
3
4X = customers[features]
5y = customers['churned']
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
7
8clf = LogisticRegression(random_state=42)
9clf.fit(X_train, y_train)
10
11y_pred = clf.predict(X_test)
12y_prob = clf.predict_proba(X_test)[:, 1]
13
14print("Classification Report:")
15print(classification_report(y_test, y_pred))

Confusion Matrix & Risk Scoring

Python

1from sklearn.metrics import ConfusionMatrixDisplay
2
3fig, axes = plt.subplots(1, 2, figsize=(12, 5))
4
5cm = confusion_matrix(y_test, y_pred)
6ConfusionMatrixDisplay(cm, display_labels=['Stay', 'Churn']).plot(ax=axes[0])
7axes[0].set_title('Confusion Matrix')
8
9axes[1].hist(y_prob[y_test == 0], bins=20, alpha=0.5, label='Stayed')
10axes[1].hist(y_prob[y_test == 1], bins=20, alpha=0.5, label='Churned')
11axes[1].set_xlabel('Predicted Churn Probability')
12axes[1].set_title('Probability Distribution')
13axes[1].legend()
14plt.tight_layout()
15plt.show()
16
17tn, fp, fn, tp = cm.ravel()
18print(f"True Negatives (correctly predicted Stay): {tn}")
19print(f"False Positives (wrongly predicted Churn): {fp}")
20print(f"False Negatives (missed Churns): {fn}")
21print(f"True Positives (correctly predicted Churn): {tp}")
22
23# Risk scoring
24test_customers = X_test.copy()
25test_customers['churn_probability'] = y_prob
26test_customers['risk_score'] = pd.cut(y_prob, bins=[0, 0.3, 0.6, 1.0],
27                                       labels=['Low', 'Medium', 'High'])
28print("\nChurn Rate by Risk Segment:")
29test_customers['actual_churn'] = y_test.values
30print(test_customers.groupby('risk_score')['actual_churn'].agg(['count', 'sum', 'mean']).round(3))

Precision vs Recall Trade-off

High Precision: Ít false positives (không lãng phí retention budget)
High Recall: Ít false negatives (không bỏ sót churners)
Chọn dựa trên business cost: bỏ sót 1 churner tốn bao nhiêu vs chi retention cho 1 non-churner?

Checkpoint

Classification trả về probability → chuyển thành risk scores → prioritize actions cho High Risk customers trước. False Negative (bỏ sót churner) và False Positive (nhầm churner) — cái nào tốn kém hơn cho business?

Task 3

🌳 Decision Trees

TB5 min

Train & Visualize

Python

1from sklearn.tree import DecisionTreeClassifier, plot_tree
2
3tree = DecisionTreeClassifier(max_depth=3, random_state=42)
4tree.fit(X_train, y_train)
5
6plt.figure(figsize=(20, 10))
7plot_tree(tree, feature_names=features, class_names=['Stay', 'Churn'],
8          filled=True, rounded=True, fontsize=10)
9plt.title('Decision Tree for Churn Prediction')
10plt.tight_layout()
11plt.show()
12
13y_pred_tree = tree.predict(X_test)
14print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred_tree)*100:.1f}%")

Feature Importance

Python

1importance = pd.DataFrame({
2    'feature': features,
3    'importance': tree.feature_importances_
4}).sort_values('importance', ascending=False)
5
6print("Feature Importance (Decision Tree):")
7print(importance)
8
9plt.figure(figsize=(10, 5))
10plt.barh(importance['feature'], importance['importance'])
11plt.xlabel('Importance')
12plt.title('Feature Importance from Decision Tree')
13plt.gca().invert_yaxis()
14plt.show()

Decision Tree = Explainable AI

Decision Trees dễ giải thích cho stakeholders: "Nếu tenure < 12 tháng VÀ support_tickets > 3 → 78% khả năng churn." Đây là lợi thế lớn so với black-box models.

Checkpoint

Decision Trees cung cấp interpretable rules + feature importance — perfect cho analyst cần giải thích model cho business team. Tại sao max_depth=3 thường tốt hơn max_depth=20 cho business presentations?

Task 4

📅 Simple Forecasting

TB5 min

Moving Average

Python

1dates = pd.date_range('2022-01-01', periods=365*2, freq='D')
2np.random.seed(42)
3
4trend = np.linspace(1000, 1500, len(dates))
5seasonality = 200 * np.sin(2 * np.pi * np.arange(len(dates)) / 365)
6noise = np.random.normal(0, 50, len(dates))
7sales = trend + seasonality + noise
8
9ts = pd.DataFrame({'date': dates, 'sales': sales}).set_index('date')
10
11ts['MA_7'] = ts['sales'].rolling(7).mean()
12ts['MA_30'] = ts['sales'].rolling(30).mean()
13
14last_ma30 = ts['MA_30'].iloc[-1]
15print(f"30-day MA Forecast: ${last_ma30:.2f}")
16
17plt.figure(figsize=(14, 5))
18plt.plot(ts.index[-90:], ts['sales'][-90:], alpha=0.5, label='Actual')
19plt.plot(ts.index[-90:], ts['MA_30'][-90:], label='30-day MA')
20plt.axhline(last_ma30, color='r', linestyle='--', label=f'Forecast: ${last_ma30:.0f}')
21plt.legend()
22plt.title('Sales Forecast using Moving Average')
23plt.show()

Trend Extrapolation

Python

1from scipy import stats
2
3x = np.arange(len(ts))
4y = ts['sales'].values
5slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
6
7future_x = np.arange(len(ts), len(ts) + 30)
8forecast = intercept + slope * future_x
9
10print(f"Trend: ${slope:.2f} per day")
11print(f"30-day Forecast Range: ${forecast[0]:.2f} to ${forecast[-1]:.2f}")
12
13plt.figure(figsize=(14, 5))
14plt.plot(ts.index, ts['sales'], alpha=0.5, label='Historical')
15future_dates = pd.date_range(ts.index[-1] + pd.Timedelta(days=1), periods=30)
16plt.plot(future_dates, forecast, 'r--', linewidth=2, label='Forecast')
17plt.legend()
18plt.title('Sales Forecast using Trend Extrapolation')
19plt.show()

Seasonal Forecast

Python

1def seasonal_forecast(ts, periods=30):
2    """Simple seasonal forecast"""
3    ts['month'] = ts.index.month
4    monthly_avg = ts.groupby('month')['sales'].mean()
5    
6    recent = ts.iloc[-365:]
7    x = np.arange(len(recent))
8    y = recent['sales'].values
9    slope, intercept, _, _, _ = stats.linregress(x, y)
10    
11    forecast_dates = pd.date_range(ts.index[-1] + pd.Timedelta(days=1), periods=periods)
12    forecasts = []
13    for i, date in enumerate(forecast_dates):
14        base = intercept + slope * (len(recent) + i)
15        seasonal_factor = monthly_avg[date.month] / ts['sales'].mean()
16        forecasts.append({'date': date, 'forecast': base * seasonal_factor})
17    
18    return pd.DataFrame(forecasts)
19
20forecast_df = seasonal_forecast(ts, 30)
21print("Seasonal Forecast (first 10 days):")
22print(forecast_df.head(10))

Checkpoint

Simple forecasting — Moving Average cho short-term, Trend Extrapolation cho long-term, Seasonal Adjustment cho cyclical patterns. Khi nào moving average forecast sẽ cho kết quả sai lệch lớn?

Task 5

📏 Model Evaluation

TB5 min

Cross-Validation

Python

1from sklearn.model_selection import cross_val_score
2
3# Regression CV
4X = df[features]
5y = df['sales']
6model = LinearRegression()
7scores = cross_val_score(model, X, y, cv=5, scoring='r2')
8print("Regression Cross-Validation (R²):")
9print(f"  Scores: {scores.round(4)}")
10print(f"  Mean: {scores.mean():.4f} ± {scores.std():.4f}")
11
12# Classification CV
13clf = LogisticRegression(random_state=42)
14X_clf = customers[features]
15y_clf = customers['churned']
16scores = cross_val_score(clf, X_clf, y_clf, cv=5, scoring='accuracy')
17print(f"\nClassification Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")

Metrics Explained

Python

1def explain_metrics(y_true, y_pred, y_prob=None):
2    """Explain classification metrics in business terms"""
3    from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
4    
5    print("=" * 50)
6    print("CLASSIFICATION METRICS EXPLAINED")
7    print("=" * 50)
8    
9    accuracy = accuracy_score(y_true, y_pred)
10    print(f"\n📊 Accuracy: {accuracy:.2%}")
11    print(f"   → {accuracy:.0%} of all predictions are correct")
12    
13    precision = precision_score(y_true, y_pred)
14    print(f"\n🎯 Precision: {precision:.2%}")
15    print(f"   → When we predict Churn, we're right {precision:.0%} of the time")
16    
17    recall = recall_score(y_true, y_pred)
18    print(f"\n🔍 Recall: {recall:.2%}")
19    print(f"   → We catch {recall:.0%} of actual Churns")
20    
21    f1 = f1_score(y_true, y_pred)
22    print(f"\n⚖️ F1 Score: {f1:.2%}")
23    print(f"   → Balanced measure of precision and recall")
24    
25    if y_prob is not None:
26        auc = roc_auc_score(y_true, y_prob)
27        print(f"\n📈 AUC-ROC: {auc:.2%}")
28        print(f"   → Model's ability to distinguish classes (0.5=random, 1.0=perfect)")
29
30explain_metrics(y_test, y_pred, y_prob)

Chọn metric theo business context

E-commerce churn: Focus Recall (đừng bỏ sót churners)
Fraud detection: Focus Precision (đừng block customers nhầm)
General: F1 Score balances cả hai

Checkpoint

Cross-validation cho reliable performance estimate — 1 train/test split có thể "may mắn". CV mean ± std cho biết model ổn định hay không. AUC = 0.72 có nghĩa model tốt hay cần cải thiện?

Task 6

📋 Tổng kết

TB5 min

Kiến thức đã học

Chủ đề	Nội dung chính
Regression	Predict continuous values, coefficients, R², RMSE
Classification	Predict categories, probability, precision/recall
Decision Trees	Visual rules, feature importance, interpretable
Forecasting	Moving average, trend extrapolation, seasonal adjustment
Evaluation	Cross-validation, confusion matrix, AUC-ROC

Câu hỏi tự kiểm tra

Regression và Classification khác nhau thế nào?
Cross-validation tốt hơn train/test split vì sao?
Decision Tree có ưu điểm gì?
AUC-ROC đo lường điều gì?

Hoàn thành!

Bạn đã nắm vững Predictive Analytics fundamentals — đủ để collaborate với Data Scientists, interpret model results, và build simple predictive models cho business use cases.

Bài tiếp theo: Data Storytelling & Reporting

Task 7

Predictive Analytics Basics

Predictive Analytics Basics

🎯 Mục tiêu bài học

📖 Thuật ngữ quan trọng

Checkpoint

📈 Linear Regression

Simple Linear Regression

Multiple Regression

Feature Importance (Standardized)

Checkpoint

🏷️ Classification (Churn Prediction)

Binary Classification

Train & Evaluate

Confusion Matrix & Risk Scoring

Checkpoint

🌳 Decision Trees

Train & Visualize

Feature Importance

Checkpoint

📅 Simple Forecasting

Moving Average

Trend Extrapolation

Seasonal Forecast

Checkpoint

📏 Model Evaluation

Cross-Validation

Metrics Explained

Checkpoint

📋 Tổng kết

Kiến thức đã học

Câu hỏi tự kiểm tra

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu