MinAI - Về trang chủ
Hướng dẫn
10/132.5 giờ
Đang tải...

Predictive Analytics Basics

Giới thiệu forecasting, regression, và classification cho Data Analysts

Predictive Analytics Basics

Predictive Analytics and Forecasting

0

🎯 Mục tiêu bài học

TB5 min
Sau bài học này, bạn sẽ có thể:
  • Xây dựng Simple và Multiple Linear Regression models
  • Implement Binary Classification cho churn prediction
  • Sử dụng Decision Trees và interpret feature importance
  • Thực hiện simple forecasting với moving averages và trend extrapolation
  • Đánh giá model performance với cross-validation và proper metrics
  • Giải thích kết quả cho non-technical stakeholders
Thông tin bài học

⏱️ Thời lượng: 2.5 giờ | 📊 Cấp độ: Nâng cao | 🛠️ Công cụ: Python, scikit-learn

1

📖 Thuật ngữ quan trọng

TB5 min
Thuật ngữTiếng ViệtMô tả
Linear RegressionHồi quy tuyến tínhDự đoán giá trị liên tục dựa trên mối quan hệ tuyến tính
Multiple RegressionHồi quy đa biếnRegression với nhiều biến độc lập (features)
ClassificationPhân loạiDự đoán categories (churn yes/no, fraud/legit)
Logistic RegressionHồi quy logisticClassification model trả về probability
Decision TreeCây quyết địnhModel phân chia dữ liệu theo rules, dễ interpret
Feature ImportanceĐộ quan trọng featureĐo lường ảnh hưởng của mỗi biến đến prediction
R² ScoreHệ số xác địnhPhần trăm variance được model giải thích (0-1)
RMSESai số bình phương trung bìnhRoot Mean Squared Error — đo prediction error
Cross-ValidationKiểm chứng chéoĐánh giá model trên nhiều data splits
Confusion MatrixMa trận nhầm lẫnBảng so sánh predicted vs actual classifications

Checkpoint

Predictive analytics có 3 loại chính: Regression (dự đoán số), Classification (dự đoán category), Time Series Forecasting (dự đoán tương lai). Bạn có thể cho ví dụ business problem phù hợp với regression vs classification?

2

📈 Linear Regression

TB5 min

Simple Linear Regression

Python
1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4from sklearn.linear_model import LinearRegression
5from sklearn.model_selection import train_test_split
6from sklearn.metrics import mean_squared_error, r2_score
7
8# Sample data: Marketing spend vs Sales
9np.random.seed(42)
10n = 100
11marketing_spend = np.random.uniform(1000, 10000, n)
12noise = np.random.normal(0, 500, n)
13sales = 500 + 3.5 * marketing_spend + noise
14
15df = pd.DataFrame({'marketing_spend': marketing_spend, 'sales': sales})
16
17# Prepare & train
18X = df[['marketing_spend']]
19y = df['sales']
20X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
21
22model = LinearRegression()
23model.fit(X_train, y_train)
24
25print(f"Intercept: ${model.intercept_:.2f}")
26print(f"Coefficient: ${model.coef_[0]:.2f} per $1 marketing spend")
27
28# Evaluate
29y_pred = model.predict(X_test)
30rmse = np.sqrt(mean_squared_error(y_test, y_pred))
31r2 = r2_score(y_test, y_pred)
32print(f"\nRMSE: ${rmse:.2f}")
33print(f"R² Score: {r2:.4f} → Model explains {r2*100:.1f}% of variance")

Multiple Regression

Python
1np.random.seed(42)
2n = 200
3
4df = pd.DataFrame({
5 'marketing_spend': np.random.uniform(1000, 10000, n),
6 'store_traffic': np.random.uniform(500, 5000, n),
7 'avg_price': np.random.uniform(20, 100, n),
8 'num_promotions': np.random.randint(0, 10, n),
9 'is_weekend': np.random.choice([0, 1], n)
10})
11
12df['sales'] = (500 + 2.5 * df['marketing_spend'] + 1.2 * df['store_traffic']
13 - 3.0 * df['avg_price'] + 200 * df['num_promotions']
14 + 1500 * df['is_weekend'] + np.random.normal(0, 500, n))
15
16features = ['marketing_spend', 'store_traffic', 'avg_price', 'num_promotions', 'is_weekend']
17X = df[features]
18y = df['sales']
19
20X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
21model = LinearRegression()
22model.fit(X_train, y_train)
23
24print("Feature Coefficients:")
25for feature, coef in zip(features, model.coef_):
26 direction = "↑" if coef > 0 else "↓"
27 print(f" {feature}: {coef:.2f} {direction}")
28
29y_pred = model.predict(X_test)
30print(f"\nR² Score: {r2_score(y_test, y_pred):.4f}")

Feature Importance (Standardized)

Python
1from sklearn.preprocessing import StandardScaler
2
3scaler = StandardScaler()
4X_scaled = scaler.fit_transform(X_train)
5model_scaled = LinearRegression()
6model_scaled.fit(X_scaled, y_train)
7
8importance = pd.DataFrame({
9 'feature': features,
10 'importance': np.abs(model_scaled.coef_)
11}).sort_values('importance', ascending=True)
12
13plt.figure(figsize=(10, 5))
14plt.barh(importance['feature'], importance['importance'])
15plt.xlabel('Absolute Standardized Coefficient')
16plt.title('Feature Importance')
17plt.show()
Standardized Coefficients

Raw coefficients không so sánh được vì features có scale khác nhau ($1 marketing ≠ 1 promotion). Standardize trước khi so sánh importance.

Checkpoint

Linear Regression dự đoán continuous values — R² cho biết model giải thích được bao nhiêu % variance, RMSE đo prediction error bằng đơn vị gốc. R² = 0.85 nghĩa là gì trong ngữ cảnh business?

3

🏷️ Classification (Churn Prediction)

TB5 min

Binary Classification

Python
1from sklearn.linear_model import LogisticRegression
2from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
3
4np.random.seed(42)
5n = 1000
6
7customers = pd.DataFrame({
8 'tenure_months': np.random.randint(1, 72, n),
9 'monthly_spend': np.random.uniform(20, 200, n),
10 'support_tickets': np.random.poisson(2, n),
11 'last_login_days': np.random.exponential(30, n),
12 'contract_type': np.random.choice(['monthly', 'annual'], n)
13})
14
15churn_prob = (0.1 - 0.01 * customers['tenure_months']
16 - 0.002 * customers['monthly_spend']
17 + 0.1 * customers['support_tickets']
18 + 0.01 * customers['last_login_days']
19 + 0.2 * (customers['contract_type'] == 'monthly'))
20churn_prob = 1 / (1 + np.exp(-churn_prob))
21customers['churned'] = (np.random.random(n) < churn_prob).astype(int)
22
23print(f"Churn Rate: {customers['churned'].mean()*100:.1f}%")

Train & Evaluate

Python
1customers['is_monthly'] = (customers['contract_type'] == 'monthly').astype(int)
2features = ['tenure_months', 'monthly_spend', 'support_tickets', 'last_login_days', 'is_monthly']
3
4X = customers[features]
5y = customers['churned']
6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
7
8clf = LogisticRegression(random_state=42)
9clf.fit(X_train, y_train)
10
11y_pred = clf.predict(X_test)
12y_prob = clf.predict_proba(X_test)[:, 1]
13
14print("Classification Report:")
15print(classification_report(y_test, y_pred))

Confusion Matrix & Risk Scoring

Python
1from sklearn.metrics import ConfusionMatrixDisplay
2
3fig, axes = plt.subplots(1, 2, figsize=(12, 5))
4
5cm = confusion_matrix(y_test, y_pred)
6ConfusionMatrixDisplay(cm, display_labels=['Stay', 'Churn']).plot(ax=axes[0])
7axes[0].set_title('Confusion Matrix')
8
9axes[1].hist(y_prob[y_test == 0], bins=20, alpha=0.5, label='Stayed')
10axes[1].hist(y_prob[y_test == 1], bins=20, alpha=0.5, label='Churned')
11axes[1].set_xlabel('Predicted Churn Probability')
12axes[1].set_title('Probability Distribution')
13axes[1].legend()
14plt.tight_layout()
15plt.show()
16
17tn, fp, fn, tp = cm.ravel()
18print(f"True Negatives (correctly predicted Stay): {tn}")
19print(f"False Positives (wrongly predicted Churn): {fp}")
20print(f"False Negatives (missed Churns): {fn}")
21print(f"True Positives (correctly predicted Churn): {tp}")
22
23# Risk scoring
24test_customers = X_test.copy()
25test_customers['churn_probability'] = y_prob
26test_customers['risk_score'] = pd.cut(y_prob, bins=[0, 0.3, 0.6, 1.0],
27 labels=['Low', 'Medium', 'High'])
28print("\nChurn Rate by Risk Segment:")
29test_customers['actual_churn'] = y_test.values
30print(test_customers.groupby('risk_score')['actual_churn'].agg(['count', 'sum', 'mean']).round(3))
Precision vs Recall Trade-off
  • High Precision: Ít false positives (không lãng phí retention budget)
  • High Recall: Ít false negatives (không bỏ sót churners)
  • Chọn dựa trên business cost: bỏ sót 1 churner tốn bao nhiêu vs chi retention cho 1 non-churner?

Checkpoint

Classification trả về probability → chuyển thành risk scores → prioritize actions cho High Risk customers trước. False Negative (bỏ sót churner) và False Positive (nhầm churner) — cái nào tốn kém hơn cho business?

4

🌳 Decision Trees

TB5 min

Train & Visualize

Python
1from sklearn.tree import DecisionTreeClassifier, plot_tree
2
3tree = DecisionTreeClassifier(max_depth=3, random_state=42)
4tree.fit(X_train, y_train)
5
6plt.figure(figsize=(20, 10))
7plot_tree(tree, feature_names=features, class_names=['Stay', 'Churn'],
8 filled=True, rounded=True, fontsize=10)
9plt.title('Decision Tree for Churn Prediction')
10plt.tight_layout()
11plt.show()
12
13y_pred_tree = tree.predict(X_test)
14print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred_tree)*100:.1f}%")

Feature Importance

Python
1importance = pd.DataFrame({
2 'feature': features,
3 'importance': tree.feature_importances_
4}).sort_values('importance', ascending=False)
5
6print("Feature Importance (Decision Tree):")
7print(importance)
8
9plt.figure(figsize=(10, 5))
10plt.barh(importance['feature'], importance['importance'])
11plt.xlabel('Importance')
12plt.title('Feature Importance from Decision Tree')
13plt.gca().invert_yaxis()
14plt.show()
Decision Tree = Explainable AI

Decision Trees dễ giải thích cho stakeholders: "Nếu tenure < 12 tháng VÀ support_tickets > 3 → 78% khả năng churn." Đây là lợi thế lớn so với black-box models.

Checkpoint

Decision Trees cung cấp interpretable rules + feature importance — perfect cho analyst cần giải thích model cho business team. Tại sao max_depth=3 thường tốt hơn max_depth=20 cho business presentations?

5

📅 Simple Forecasting

TB5 min

Moving Average

Python
1dates = pd.date_range('2022-01-01', periods=365*2, freq='D')
2np.random.seed(42)
3
4trend = np.linspace(1000, 1500, len(dates))
5seasonality = 200 * np.sin(2 * np.pi * np.arange(len(dates)) / 365)
6noise = np.random.normal(0, 50, len(dates))
7sales = trend + seasonality + noise
8
9ts = pd.DataFrame({'date': dates, 'sales': sales}).set_index('date')
10
11ts['MA_7'] = ts['sales'].rolling(7).mean()
12ts['MA_30'] = ts['sales'].rolling(30).mean()
13
14last_ma30 = ts['MA_30'].iloc[-1]
15print(f"30-day MA Forecast: ${last_ma30:.2f}")
16
17plt.figure(figsize=(14, 5))
18plt.plot(ts.index[-90:], ts['sales'][-90:], alpha=0.5, label='Actual')
19plt.plot(ts.index[-90:], ts['MA_30'][-90:], label='30-day MA')
20plt.axhline(last_ma30, color='r', linestyle='--', label=f'Forecast: ${last_ma30:.0f}')
21plt.legend()
22plt.title('Sales Forecast using Moving Average')
23plt.show()

Trend Extrapolation

Python
1from scipy import stats
2
3x = np.arange(len(ts))
4y = ts['sales'].values
5slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
6
7future_x = np.arange(len(ts), len(ts) + 30)
8forecast = intercept + slope * future_x
9
10print(f"Trend: ${slope:.2f} per day")
11print(f"30-day Forecast Range: ${forecast[0]:.2f} to ${forecast[-1]:.2f}")
12
13plt.figure(figsize=(14, 5))
14plt.plot(ts.index, ts['sales'], alpha=0.5, label='Historical')
15future_dates = pd.date_range(ts.index[-1] + pd.Timedelta(days=1), periods=30)
16plt.plot(future_dates, forecast, 'r--', linewidth=2, label='Forecast')
17plt.legend()
18plt.title('Sales Forecast using Trend Extrapolation')
19plt.show()

Seasonal Forecast

Python
1def seasonal_forecast(ts, periods=30):
2 """Simple seasonal forecast"""
3 ts['month'] = ts.index.month
4 monthly_avg = ts.groupby('month')['sales'].mean()
5
6 recent = ts.iloc[-365:]
7 x = np.arange(len(recent))
8 y = recent['sales'].values
9 slope, intercept, _, _, _ = stats.linregress(x, y)
10
11 forecast_dates = pd.date_range(ts.index[-1] + pd.Timedelta(days=1), periods=periods)
12 forecasts = []
13 for i, date in enumerate(forecast_dates):
14 base = intercept + slope * (len(recent) + i)
15 seasonal_factor = monthly_avg[date.month] / ts['sales'].mean()
16 forecasts.append({'date': date, 'forecast': base * seasonal_factor})
17
18 return pd.DataFrame(forecasts)
19
20forecast_df = seasonal_forecast(ts, 30)
21print("Seasonal Forecast (first 10 days):")
22print(forecast_df.head(10))

Checkpoint

Simple forecasting — Moving Average cho short-term, Trend Extrapolation cho long-term, Seasonal Adjustment cho cyclical patterns. Khi nào moving average forecast sẽ cho kết quả sai lệch lớn?

6

📏 Model Evaluation

TB5 min

Cross-Validation

Python
1from sklearn.model_selection import cross_val_score
2
3# Regression CV
4X = df[features]
5y = df['sales']
6model = LinearRegression()
7scores = cross_val_score(model, X, y, cv=5, scoring='r2')
8print("Regression Cross-Validation (R²):")
9print(f" Scores: {scores.round(4)}")
10print(f" Mean: {scores.mean():.4f} ± {scores.std():.4f}")
11
12# Classification CV
13clf = LogisticRegression(random_state=42)
14X_clf = customers[features]
15y_clf = customers['churned']
16scores = cross_val_score(clf, X_clf, y_clf, cv=5, scoring='accuracy')
17print(f"\nClassification Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")

Metrics Explained

Python
1def explain_metrics(y_true, y_pred, y_prob=None):
2 """Explain classification metrics in business terms"""
3 from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
4
5 print("=" * 50)
6 print("CLASSIFICATION METRICS EXPLAINED")
7 print("=" * 50)
8
9 accuracy = accuracy_score(y_true, y_pred)
10 print(f"\n📊 Accuracy: {accuracy:.2%}")
11 print(f" → {accuracy:.0%} of all predictions are correct")
12
13 precision = precision_score(y_true, y_pred)
14 print(f"\n🎯 Precision: {precision:.2%}")
15 print(f" → When we predict Churn, we're right {precision:.0%} of the time")
16
17 recall = recall_score(y_true, y_pred)
18 print(f"\n🔍 Recall: {recall:.2%}")
19 print(f" → We catch {recall:.0%} of actual Churns")
20
21 f1 = f1_score(y_true, y_pred)
22 print(f"\n⚖️ F1 Score: {f1:.2%}")
23 print(f" → Balanced measure of precision and recall")
24
25 if y_prob is not None:
26 auc = roc_auc_score(y_true, y_prob)
27 print(f"\n📈 AUC-ROC: {auc:.2%}")
28 print(f" → Model's ability to distinguish classes (0.5=random, 1.0=perfect)")
29
30explain_metrics(y_test, y_pred, y_prob)
Chọn metric theo business context
  • E-commerce churn: Focus Recall (đừng bỏ sót churners)
  • Fraud detection: Focus Precision (đừng block customers nhầm)
  • General: F1 Score balances cả hai

Checkpoint

Cross-validation cho reliable performance estimate — 1 train/test split có thể "may mắn". CV mean ± std cho biết model ổn định hay không. AUC = 0.72 có nghĩa model tốt hay cần cải thiện?

7

📋 Tổng kết

TB5 min

Kiến thức đã học

Chủ đềNội dung chính
RegressionPredict continuous values, coefficients, R², RMSE
ClassificationPredict categories, probability, precision/recall
Decision TreesVisual rules, feature importance, interpretable
ForecastingMoving average, trend extrapolation, seasonal adjustment
EvaluationCross-validation, confusion matrix, AUC-ROC

Câu hỏi tự kiểm tra

  1. Regression và Classification khác nhau thế nào?
  2. Cross-validation tốt hơn train/test split vì sao?
  3. Decision Tree có ưu điểm gì?
  4. AUC-ROC đo lường điều gì?
Hoàn thành!

Bạn đã nắm vững Predictive Analytics fundamentals — đủ để collaborate với Data Scientists, interpret model results, và build simple predictive models cho business use cases.

Bài tiếp theo: Data Storytelling & Reporting