Predictive Analytics Basics
🎯 Mục tiêu bài học
- Xây dựng Simple và Multiple Linear Regression models
- Implement Binary Classification cho churn prediction
- Sử dụng Decision Trees và interpret feature importance
- Thực hiện simple forecasting với moving averages và trend extrapolation
- Đánh giá model performance với cross-validation và proper metrics
- Giải thích kết quả cho non-technical stakeholders
⏱️ Thời lượng: 2.5 giờ | 📊 Cấp độ: Nâng cao | 🛠️ Công cụ: Python, scikit-learn
📖 Thuật ngữ quan trọng
| Thuật ngữ | Tiếng Việt | Mô tả |
|---|---|---|
| Linear Regression | Hồi quy tuyến tính | Dự đoán giá trị liên tục dựa trên mối quan hệ tuyến tính |
| Multiple Regression | Hồi quy đa biến | Regression với nhiều biến độc lập (features) |
| Classification | Phân loại | Dự đoán categories (churn yes/no, fraud/legit) |
| Logistic Regression | Hồi quy logistic | Classification model trả về probability |
| Decision Tree | Cây quyết định | Model phân chia dữ liệu theo rules, dễ interpret |
| Feature Importance | Độ quan trọng feature | Đo lường ảnh hưởng của mỗi biến đến prediction |
| R² Score | Hệ số xác định | Phần trăm variance được model giải thích (0-1) |
| RMSE | Sai số bình phương trung bình | Root Mean Squared Error — đo prediction error |
| Cross-Validation | Kiểm chứng chéo | Đánh giá model trên nhiều data splits |
| Confusion Matrix | Ma trận nhầm lẫn | Bảng so sánh predicted vs actual classifications |
Checkpoint
Predictive analytics có 3 loại chính: Regression (dự đoán số), Classification (dự đoán category), Time Series Forecasting (dự đoán tương lai). Bạn có thể cho ví dụ business problem phù hợp với regression vs classification?
📈 Linear Regression
Simple Linear Regression
1import pandas as pd2import numpy as np3import matplotlib.pyplot as plt4from sklearn.linear_model import LinearRegression5from sklearn.model_selection import train_test_split6from sklearn.metrics import mean_squared_error, r2_score78# Sample data: Marketing spend vs Sales9np.random.seed(42)10n = 10011marketing_spend = np.random.uniform(1000, 10000, n)12noise = np.random.normal(0, 500, n)13sales = 500 + 3.5 * marketing_spend + noise1415df = pd.DataFrame({'marketing_spend': marketing_spend, 'sales': sales})1617# Prepare & train18X = df[['marketing_spend']]19y = df['sales']20X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)2122model = LinearRegression()23model.fit(X_train, y_train)2425print(f"Intercept: ${model.intercept_:.2f}")26print(f"Coefficient: ${model.coef_[0]:.2f} per $1 marketing spend")2728# Evaluate29y_pred = model.predict(X_test)30rmse = np.sqrt(mean_squared_error(y_test, y_pred))31r2 = r2_score(y_test, y_pred)32print(f"\nRMSE: ${rmse:.2f}")33print(f"R² Score: {r2:.4f} → Model explains {r2*100:.1f}% of variance")Multiple Regression
1np.random.seed(42)2n = 20034df = pd.DataFrame({5 'marketing_spend': np.random.uniform(1000, 10000, n),6 'store_traffic': np.random.uniform(500, 5000, n),7 'avg_price': np.random.uniform(20, 100, n),8 'num_promotions': np.random.randint(0, 10, n),9 'is_weekend': np.random.choice([0, 1], n)10})1112df['sales'] = (500 + 2.5 * df['marketing_spend'] + 1.2 * df['store_traffic']13 - 3.0 * df['avg_price'] + 200 * df['num_promotions']14 + 1500 * df['is_weekend'] + np.random.normal(0, 500, n))1516features = ['marketing_spend', 'store_traffic', 'avg_price', 'num_promotions', 'is_weekend']17X = df[features]18y = df['sales']1920X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)21model = LinearRegression()22model.fit(X_train, y_train)2324print("Feature Coefficients:")25for feature, coef in zip(features, model.coef_):26 direction = "↑" if coef > 0 else "↓"27 print(f" {feature}: {coef:.2f} {direction}")2829y_pred = model.predict(X_test)30print(f"\nR² Score: {r2_score(y_test, y_pred):.4f}")Feature Importance (Standardized)
1from sklearn.preprocessing import StandardScaler23scaler = StandardScaler()4X_scaled = scaler.fit_transform(X_train)5model_scaled = LinearRegression()6model_scaled.fit(X_scaled, y_train)78importance = pd.DataFrame({9 'feature': features,10 'importance': np.abs(model_scaled.coef_)11}).sort_values('importance', ascending=True)1213plt.figure(figsize=(10, 5))14plt.barh(importance['feature'], importance['importance'])15plt.xlabel('Absolute Standardized Coefficient')16plt.title('Feature Importance')17plt.show()Raw coefficients không so sánh được vì features có scale khác nhau ($1 marketing ≠ 1 promotion). Standardize trước khi so sánh importance.
Checkpoint
Linear Regression dự đoán continuous values — R² cho biết model giải thích được bao nhiêu % variance, RMSE đo prediction error bằng đơn vị gốc. R² = 0.85 nghĩa là gì trong ngữ cảnh business?
🏷️ Classification (Churn Prediction)
Binary Classification
1from sklearn.linear_model import LogisticRegression2from sklearn.metrics import accuracy_score, confusion_matrix, classification_report34np.random.seed(42)5n = 100067customers = pd.DataFrame({8 'tenure_months': np.random.randint(1, 72, n),9 'monthly_spend': np.random.uniform(20, 200, n),10 'support_tickets': np.random.poisson(2, n),11 'last_login_days': np.random.exponential(30, n),12 'contract_type': np.random.choice(['monthly', 'annual'], n)13})1415churn_prob = (0.1 - 0.01 * customers['tenure_months']16 - 0.002 * customers['monthly_spend']17 + 0.1 * customers['support_tickets']18 + 0.01 * customers['last_login_days']19 + 0.2 * (customers['contract_type'] == 'monthly'))20churn_prob = 1 / (1 + np.exp(-churn_prob))21customers['churned'] = (np.random.random(n) < churn_prob).astype(int)2223print(f"Churn Rate: {customers['churned'].mean()*100:.1f}%")Train & Evaluate
1customers['is_monthly'] = (customers['contract_type'] == 'monthly').astype(int)2features = ['tenure_months', 'monthly_spend', 'support_tickets', 'last_login_days', 'is_monthly']34X = customers[features]5y = customers['churned']6X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)78clf = LogisticRegression(random_state=42)9clf.fit(X_train, y_train)1011y_pred = clf.predict(X_test)12y_prob = clf.predict_proba(X_test)[:, 1]1314print("Classification Report:")15print(classification_report(y_test, y_pred))Confusion Matrix & Risk Scoring
1from sklearn.metrics import ConfusionMatrixDisplay23fig, axes = plt.subplots(1, 2, figsize=(12, 5))45cm = confusion_matrix(y_test, y_pred)6ConfusionMatrixDisplay(cm, display_labels=['Stay', 'Churn']).plot(ax=axes[0])7axes[0].set_title('Confusion Matrix')89axes[1].hist(y_prob[y_test == 0], bins=20, alpha=0.5, label='Stayed')10axes[1].hist(y_prob[y_test == 1], bins=20, alpha=0.5, label='Churned')11axes[1].set_xlabel('Predicted Churn Probability')12axes[1].set_title('Probability Distribution')13axes[1].legend()14plt.tight_layout()15plt.show()1617tn, fp, fn, tp = cm.ravel()18print(f"True Negatives (correctly predicted Stay): {tn}")19print(f"False Positives (wrongly predicted Churn): {fp}")20print(f"False Negatives (missed Churns): {fn}")21print(f"True Positives (correctly predicted Churn): {tp}")2223# Risk scoring24test_customers = X_test.copy()25test_customers['churn_probability'] = y_prob26test_customers['risk_score'] = pd.cut(y_prob, bins=[0, 0.3, 0.6, 1.0],27 labels=['Low', 'Medium', 'High'])28print("\nChurn Rate by Risk Segment:")29test_customers['actual_churn'] = y_test.values30print(test_customers.groupby('risk_score')['actual_churn'].agg(['count', 'sum', 'mean']).round(3))- High Precision: Ít false positives (không lãng phí retention budget)
- High Recall: Ít false negatives (không bỏ sót churners)
- Chọn dựa trên business cost: bỏ sót 1 churner tốn bao nhiêu vs chi retention cho 1 non-churner?
Checkpoint
Classification trả về probability → chuyển thành risk scores → prioritize actions cho High Risk customers trước. False Negative (bỏ sót churner) và False Positive (nhầm churner) — cái nào tốn kém hơn cho business?
🌳 Decision Trees
Train & Visualize
1from sklearn.tree import DecisionTreeClassifier, plot_tree23tree = DecisionTreeClassifier(max_depth=3, random_state=42)4tree.fit(X_train, y_train)56plt.figure(figsize=(20, 10))7plot_tree(tree, feature_names=features, class_names=['Stay', 'Churn'],8 filled=True, rounded=True, fontsize=10)9plt.title('Decision Tree for Churn Prediction')10plt.tight_layout()11plt.show()1213y_pred_tree = tree.predict(X_test)14print(f"Decision Tree Accuracy: {accuracy_score(y_test, y_pred_tree)*100:.1f}%")Feature Importance
1importance = pd.DataFrame({2 'feature': features,3 'importance': tree.feature_importances_4}).sort_values('importance', ascending=False)56print("Feature Importance (Decision Tree):")7print(importance)89plt.figure(figsize=(10, 5))10plt.barh(importance['feature'], importance['importance'])11plt.xlabel('Importance')12plt.title('Feature Importance from Decision Tree')13plt.gca().invert_yaxis()14plt.show()Decision Trees dễ giải thích cho stakeholders: "Nếu tenure < 12 tháng VÀ support_tickets > 3 → 78% khả năng churn." Đây là lợi thế lớn so với black-box models.
Checkpoint
Decision Trees cung cấp interpretable rules + feature importance — perfect cho analyst cần giải thích model cho business team. Tại sao max_depth=3 thường tốt hơn max_depth=20 cho business presentations?
📅 Simple Forecasting
Moving Average
1dates = pd.date_range('2022-01-01', periods=365*2, freq='D')2np.random.seed(42)34trend = np.linspace(1000, 1500, len(dates))5seasonality = 200 * np.sin(2 * np.pi * np.arange(len(dates)) / 365)6noise = np.random.normal(0, 50, len(dates))7sales = trend + seasonality + noise89ts = pd.DataFrame({'date': dates, 'sales': sales}).set_index('date')1011ts['MA_7'] = ts['sales'].rolling(7).mean()12ts['MA_30'] = ts['sales'].rolling(30).mean()1314last_ma30 = ts['MA_30'].iloc[-1]15print(f"30-day MA Forecast: ${last_ma30:.2f}")1617plt.figure(figsize=(14, 5))18plt.plot(ts.index[-90:], ts['sales'][-90:], alpha=0.5, label='Actual')19plt.plot(ts.index[-90:], ts['MA_30'][-90:], label='30-day MA')20plt.axhline(last_ma30, color='r', linestyle='--', label=f'Forecast: ${last_ma30:.0f}')21plt.legend()22plt.title('Sales Forecast using Moving Average')23plt.show()Trend Extrapolation
1from scipy import stats23x = np.arange(len(ts))4y = ts['sales'].values5slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)67future_x = np.arange(len(ts), len(ts) + 30)8forecast = intercept + slope * future_x910print(f"Trend: ${slope:.2f} per day")11print(f"30-day Forecast Range: ${forecast[0]:.2f} to ${forecast[-1]:.2f}")1213plt.figure(figsize=(14, 5))14plt.plot(ts.index, ts['sales'], alpha=0.5, label='Historical')15future_dates = pd.date_range(ts.index[-1] + pd.Timedelta(days=1), periods=30)16plt.plot(future_dates, forecast, 'r--', linewidth=2, label='Forecast')17plt.legend()18plt.title('Sales Forecast using Trend Extrapolation')19plt.show()Seasonal Forecast
1def seasonal_forecast(ts, periods=30):2 """Simple seasonal forecast"""3 ts['month'] = ts.index.month4 monthly_avg = ts.groupby('month')['sales'].mean()5 6 recent = ts.iloc[-365:]7 x = np.arange(len(recent))8 y = recent['sales'].values9 slope, intercept, _, _, _ = stats.linregress(x, y)10 11 forecast_dates = pd.date_range(ts.index[-1] + pd.Timedelta(days=1), periods=periods)12 forecasts = []13 for i, date in enumerate(forecast_dates):14 base = intercept + slope * (len(recent) + i)15 seasonal_factor = monthly_avg[date.month] / ts['sales'].mean()16 forecasts.append({'date': date, 'forecast': base * seasonal_factor})17 18 return pd.DataFrame(forecasts)1920forecast_df = seasonal_forecast(ts, 30)21print("Seasonal Forecast (first 10 days):")22print(forecast_df.head(10))Checkpoint
Simple forecasting — Moving Average cho short-term, Trend Extrapolation cho long-term, Seasonal Adjustment cho cyclical patterns. Khi nào moving average forecast sẽ cho kết quả sai lệch lớn?
📏 Model Evaluation
Cross-Validation
1from sklearn.model_selection import cross_val_score23# Regression CV4X = df[features]5y = df['sales']6model = LinearRegression()7scores = cross_val_score(model, X, y, cv=5, scoring='r2')8print("Regression Cross-Validation (R²):")9print(f" Scores: {scores.round(4)}")10print(f" Mean: {scores.mean():.4f} ± {scores.std():.4f}")1112# Classification CV13clf = LogisticRegression(random_state=42)14X_clf = customers[features]15y_clf = customers['churned']16scores = cross_val_score(clf, X_clf, y_clf, cv=5, scoring='accuracy')17print(f"\nClassification Accuracy: {scores.mean():.4f} ± {scores.std():.4f}")Metrics Explained
1def explain_metrics(y_true, y_pred, y_prob=None):2 """Explain classification metrics in business terms"""3 from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score4 5 print("=" * 50)6 print("CLASSIFICATION METRICS EXPLAINED")7 print("=" * 50)8 9 accuracy = accuracy_score(y_true, y_pred)10 print(f"\n📊 Accuracy: {accuracy:.2%}")11 print(f" → {accuracy:.0%} of all predictions are correct")12 13 precision = precision_score(y_true, y_pred)14 print(f"\n🎯 Precision: {precision:.2%}")15 print(f" → When we predict Churn, we're right {precision:.0%} of the time")16 17 recall = recall_score(y_true, y_pred)18 print(f"\n🔍 Recall: {recall:.2%}")19 print(f" → We catch {recall:.0%} of actual Churns")20 21 f1 = f1_score(y_true, y_pred)22 print(f"\n⚖️ F1 Score: {f1:.2%}")23 print(f" → Balanced measure of precision and recall")24 25 if y_prob is not None:26 auc = roc_auc_score(y_true, y_prob)27 print(f"\n📈 AUC-ROC: {auc:.2%}")28 print(f" → Model's ability to distinguish classes (0.5=random, 1.0=perfect)")2930explain_metrics(y_test, y_pred, y_prob)- E-commerce churn: Focus Recall (đừng bỏ sót churners)
- Fraud detection: Focus Precision (đừng block customers nhầm)
- General: F1 Score balances cả hai
Checkpoint
Cross-validation cho reliable performance estimate — 1 train/test split có thể "may mắn". CV mean ± std cho biết model ổn định hay không. AUC = 0.72 có nghĩa model tốt hay cần cải thiện?
📋 Tổng kết
Kiến thức đã học
| Chủ đề | Nội dung chính |
|---|---|
| Regression | Predict continuous values, coefficients, R², RMSE |
| Classification | Predict categories, probability, precision/recall |
| Decision Trees | Visual rules, feature importance, interpretable |
| Forecasting | Moving average, trend extrapolation, seasonal adjustment |
| Evaluation | Cross-validation, confusion matrix, AUC-ROC |
Câu hỏi tự kiểm tra
- Regression và Classification khác nhau thế nào?
- Cross-validation tốt hơn train/test split vì sao?
- Decision Tree có ưu điểm gì?
- AUC-ROC đo lường điều gì?
Bạn đã nắm vững Predictive Analytics fundamentals — đủ để collaborate với Data Scientists, interpret model results, và build simple predictive models cho business use cases.
Bài tiếp theo: Data Storytelling & Reporting
