🎯 Mục tiêu bài học
- Xây dựng Retention Matrix và visualize heatmap
- Phân tích Revenue Cohorts, ARPU, và Customer Lifetime Value (LTV)
- Tạo Behavioral Cohorts với RFM segmentation
- Viết Cohort Analysis SQL queries cho production
| Thông tin | Chi tiết |
|---|---|
| ⏱️ Thời lượng | 2.5 giờ |
| 📖 Chủ đề chính | Retention, Revenue Cohorts, RFM, LTV |
| 💡 Kiến thức cần có | Pandas, Window Functions cơ bản |
| 🎯 Output | Cohort analysis report hoàn chỉnh |
📖 Thuật ngữ quan trọng
| Thuật ngữ | Tiếng Việt | Mô tả |
|---|---|---|
| Cohort | Nhóm đồng nhất | Nhóm users có đặc điểm chung (thường là thời điểm acquisition) |
| Retention | Tỷ lệ giữ chân | % users quay lại sau N periods kể từ acquisition |
| Retention Matrix | Ma trận retention | Bảng cohort × period hiển thị retention rates |
| ARPU | Doanh thu/người dùng | Average Revenue Per User — trung bình revenue mỗi user |
| LTV | Giá trị trọn đời | Customer Lifetime Value — tổng giá trị khách hàng mang lại |
| RFM | Recency-Frequency-Monetary | Phân khúc KH dựa trên gần đây, tần suất, giá trị |
| Churn Rate | Tỷ lệ rời bỏ | % khách hàng không còn active sau N ngày |
| Reactivation | Tái kích hoạt | Khách hàng quay lại sau thời gian dormant |
| Acquisition Cohort | Nhóm theo thời gian | Cohort dựa trên tháng/tuần đầu tiên signup/purchase |
| Behavioral Cohort | Nhóm theo hành vi | Cohort dựa trên patterns sử dụng (power users, casual) |
Checkpoint
Cohort Analysis nhóm users theo acquisition time, theo dõi retention qua thời gian. RFM segments dựa trên Recency, Frequency, Monetary. LTV cho biết giá trị dài hạn của khách hàng. Cohort Analysis là công cụ #1 để hiểu retention — mọi PM và Data Analyst đều cần biết!
🗃️ Data Preparation & Cohort Assignment
Cohort Analysis nhóm users theo đặc điểm chung (thường là thời điểm acquisition) và theo dõi hành vi của họ qua thời gian. Đây là công cụ powerful để hiểu retention, engagement, và customer lifetime value.
Cohort Types
| Loại Cohort | Mô tả |
|---|---|
| 📅 Time-based | Acquisition month, first purchase |
| 📊 Behavior-based | Power users vs casual, high/low $ |
| 🎯 Segment-based | Product category, region, tier |
Create Sample Data
1import pandas as pd2import numpy as np3import matplotlib.pyplot as plt4import seaborn as sns56np.random.seed(42)7n_customers = 50008n_transactions = 50000910# Customer signup dates (spread over 12 months)11customers = pd.DataFrame({12 'customer_id': range(1, n_customers + 1),13 'signup_date': pd.to_datetime('2024-01-01') + pd.to_timedelta(14 np.random.randint(0, 365, n_customers), unit='D'15 )16})1718# Generate transactions19transactions = pd.DataFrame({20 'transaction_id': range(1, n_transactions + 1),21 'customer_id': np.random.randint(1, n_customers + 1, n_transactions),22 'transaction_date': pd.to_datetime('2024-01-01') + pd.to_timedelta(23 np.random.randint(0, 400, n_transactions), unit='D'24 ),25 'amount': np.random.lognormal(4, 1, n_transactions)26})2728# Merge and filter: only keep transactions after signup29transactions = transactions.merge(customers, on='customer_id')30transactions = transactions[31 transactions['transaction_date'] >= transactions['signup_date']32]3334print(f"Customers: {len(customers)}")35print(f"Transactions: {len(transactions)}")Calculate Cohort Fields
1def assign_cohorts(df, signup_col, activity_col, freq='M'):2 """Assign cohort and period to each row"""3 df = df.copy()4 df['cohort'] = df[signup_col].dt.to_period(freq)5 df['activity_period'] = df[activity_col].dt.to_period(freq)6 df['period_number'] = (7 df['activity_period'].view('int64') - 8 df['cohort'].view('int64')9 )10 return df1112transactions = assign_cohorts(13 transactions, 'signup_date', 'transaction_date', freq='M'14)15print(transactions[['customer_id', 'cohort', 'activity_period', 'period_number']].head(10))Checkpoint
Cohort = tháng signup. Period number = số tháng kể từ cohort month. Luôn filter transactions >= signup_date. Period 0 = tháng đầu tiên, Period 1 = tháng thứ hai... Đây là cách đo "tuổi" của customer!
📊 Retention Analysis
Build Retention Matrix
1def build_retention_matrix(df, value_col='customer_id', agg_func='nunique'):2 """Build cohort retention matrix"""3 cohort_data = df.groupby(['cohort', 'period_number']).agg({4 value_col: agg_func5 }).reset_index()6 7 cohort_matrix = cohort_data.pivot(8 index='cohort',9 columns='period_number',10 values=value_col11 )12 return cohort_matrix1314# Active users matrix15retention_counts = build_retention_matrix(transactions)16print("Retention Counts:")17print(retention_counts.iloc[:6, :7])1819# Calculate retention rate20cohort_sizes = retention_counts.iloc[:, 0]21retention_rate = retention_counts.divide(cohort_sizes, axis=0) * 1002223print("\nRetention Rate (%):")24print(retention_rate.iloc[:6, :7].round(1))Visualize Retention Heatmap
1def plot_retention_heatmap(retention_matrix, title='Retention Rate (%)'):2 """Plot retention heatmap"""3 plt.figure(figsize=(14, 10))4 5 annot = retention_matrix.round(1).astype(str) + '%'6 7 sns.heatmap(8 retention_matrix,9 annot=annot,10 fmt='',11 cmap='YlGnBu',12 vmin=0, vmax=100,13 linewidths=0.5,14 cbar_kws={'label': 'Retention Rate (%)'}15 )16 17 plt.title(title, fontsize=14)18 plt.xlabel('Months Since First Purchase')19 plt.ylabel('Cohort (Month)')20 plt.tight_layout()21 plt.show()2223plot_retention_heatmap(retention_rate.iloc[:12, :12])Retention Curves
1def plot_retention_curves(retention_matrix, top_n=6):2 """Plot retention curves for multiple cohorts"""3 plt.figure(figsize=(12, 6))4 5 for cohort in retention_matrix.index[:top_n]:6 values = retention_matrix.loc[cohort].dropna()7 plt.plot(values.index, values.values, marker='o', label=str(cohort))8 9 plt.xlabel('Months Since Acquisition')10 plt.ylabel('Retention Rate (%)')11 plt.title('Retention Curves by Cohort')12 plt.legend(title='Cohort')13 plt.grid(True, alpha=0.3)14 plt.ylim(0, 105)15 plt.show()1617plot_retention_curves(retention_rate)Checkpoint
Retention matrix = cohort × period. Heatmap visualization cho overview tuyệt vời. Retention curves so sánh cohorts theo thời gian. Column 0 luôn = 100%. Retention drop mạnh từ Month 0 → Month 1 là bình thường. "Good" retention tùy industry — SaaS ~90% M1, e-commerce ~30% M1!
💰 Revenue Cohorts & LTV
Revenue per Cohort
1# Revenue matrix2revenue_matrix = build_retention_matrix(3 transactions, value_col='amount', agg_func='sum'4)5print("Total Revenue by Cohort and Period:")6print(revenue_matrix.iloc[:6, :7].round(0))78# Cumulative revenue9cumulative_revenue = revenue_matrix.cumsum(axis=1)10print("\nCumulative Revenue:")11print(cumulative_revenue.iloc[:6, :7].round(0))Average Revenue per User (ARPU)
1cohort_users = transactions.groupby('cohort')['customer_id'].nunique()2arpu_matrix = revenue_matrix.divide(cohort_users, axis=0)34print("ARPU by Cohort and Period:")5print(arpu_matrix.iloc[:6, :7].round(2))67# Cumulative ARPU8cum_arpu = arpu_matrix.cumsum(axis=1)910plt.figure(figsize=(12, 6))11for cohort in cum_arpu.index[:6]:12 values = cum_arpu.loc[cohort].dropna()13 plt.plot(values.index, values.values, marker='o', label=str(cohort))1415plt.xlabel('Months Since Acquisition')16plt.ylabel('Cumulative ARPU ($)')17plt.title('Cumulative ARPU by Cohort')18plt.legend(title='Cohort')19plt.grid(True, alpha=0.3)20plt.show()Customer Lifetime Value (LTV)
1def calculate_ltv_by_cohort(df, periods=12):2 """Calculate LTV for each cohort"""3 results = []4 for cohort in df['cohort'].unique():5 cohort_data = df[df['cohort'] == cohort]6 cohort_users = cohort_data['customer_id'].nunique()7 8 period_revenue = cohort_data[9 cohort_data['period_number'] < periods10 ].groupby('period_number')['amount'].sum()11 12 total_revenue = period_revenue.sum()13 ltv = total_revenue / cohort_users14 15 results.append({16 'cohort': cohort,17 'users': cohort_users,18 'total_revenue': total_revenue,19 'ltv': ltv20 })21 return pd.DataFrame(results)2223ltv_df = calculate_ltv_by_cohort(transactions, periods=12)24print("LTV by Cohort:")25print(ltv_df.sort_values('cohort').head(10))2627# Plot LTV trend28plt.figure(figsize=(12, 5))29plt.bar(ltv_df['cohort'].astype(str), ltv_df['ltv'])30plt.xlabel('Cohort')31plt.ylabel('12-Month LTV ($)')32plt.title('Customer Lifetime Value by Acquisition Cohort')33plt.xticks(rotation=45)34plt.tight_layout()35plt.show()Checkpoint
Revenue matrix cho biết mỗi cohort tạo bao nhiêu revenue theo period. ARPU = Revenue / Users. Cumulative ARPU tiến tới LTV. LTV per cohort cho biết chất lượng acquisition theo thời gian. Nếu LTV trending down qua các cohorts mới — đó là dấu hiệu cần cải thiện product hoặc acquisition quality!
👥 Behavioral Cohorts & RFM
Purchase Frequency Segments
1customer_freq = transactions.groupby('customer_id').agg({2 'transaction_id': 'count',3 'amount': 'sum',4 'signup_date': 'first'5}).rename(columns={'transaction_id': 'frequency', 'amount': 'total_spent'})67# Create frequency segments8customer_freq['freq_segment'] = pd.cut(9 customer_freq['frequency'],10 bins=[0, 1, 3, 10, float('inf')],11 labels=['One-time', 'Occasional', 'Regular', 'Power User']12)1314print("Customer Segments:")15print(customer_freq['freq_segment'].value_counts())RFM Analysis
1reference_date = transactions['transaction_date'].max() + pd.Timedelta(days=1)23rfm = transactions.groupby('customer_id').agg({4 'transaction_date': lambda x: (reference_date - x.max()).days, # Recency5 'transaction_id': 'count', # Frequency6 'amount': 'sum' # Monetary7}).rename(columns={8 'transaction_date': 'recency',9 'transaction_id': 'frequency',10 'amount': 'monetary'11})1213# Create RFM scores (1-5)14rfm['R_score'] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])15rfm['F_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])16rfm['M_score'] = pd.qcut(rfm['monetary'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])1718# Simplified segments19def rfm_segment(row):20 if row['R_score'] >= 4 and row['F_score'] >= 4:21 return 'Champions'22 elif row['R_score'] >= 4 and row['F_score'] < 3:23 return 'New Customers'24 elif row['R_score'] < 3 and row['F_score'] >= 4:25 return 'At Risk'26 elif row['R_score'] < 2:27 return 'Lost'28 else:29 return 'Regular'3031rfm['segment'] = rfm.apply(rfm_segment, axis=1)32print("RFM Segments:")33print(rfm['segment'].value_counts())Channel Cohorts
1channels = ['Organic', 'Paid Search', 'Social', 'Email', 'Referral']2customers['channel'] = np.random.choice(3 channels, len(customers), p=[0.3, 0.25, 0.2, 0.15, 0.1]4)56transactions = transactions.merge(7 customers[['customer_id', 'channel']], on='customer_id', how='left'8)910# Channel performance11channel_analysis = transactions.groupby('channel').agg({12 'customer_id': 'nunique',13 'transaction_id': 'count',14 'amount': ['sum', 'mean']15}).round(2)16channel_analysis.columns = ['customers', 'transactions', 'total_revenue', 'avg_order']17print("Channel Performance:")18print(channel_analysis)Checkpoint
RFM scores phân khúc KH: Champions (R↑F↑), At Risk (R↓F↑), New (R↑F↓), Lost (R↓F↓). Channel cohorts so sánh acquisition sources. Purchase frequency cho behavioral insights. Champions chỉ ~20% users nhưng thường mang lại 80% revenue — focus retain họ!
💾 Cohort Analysis SQL
Basic Retention Query
1WITH customer_cohorts AS (2 SELECT 3 customer_id,4 DATE_TRUNC('month', MIN(order_date)) AS cohort_month5 FROM orders6 GROUP BY customer_id7),8transactions_with_cohort AS (9 SELECT 10 o.customer_id, c.cohort_month,11 DATE_TRUNC('month', o.order_date) AS activity_month,12 EXTRACT(YEAR FROM o.order_date) * 12 + EXTRACT(MONTH FROM o.order_date) -13 EXTRACT(YEAR FROM c.cohort_month) * 12 - EXTRACT(MONTH FROM c.cohort_month) AS period_number14 FROM orders o15 JOIN customer_cohorts c ON o.customer_id = c.customer_id16),17cohort_data AS (18 SELECT cohort_month, period_number,19 COUNT(DISTINCT customer_id) AS active_users20 FROM transactions_with_cohort21 GROUP BY cohort_month, period_number22),23cohort_sizes AS (24 SELECT cohort_month, active_users AS cohort_size25 FROM cohort_data WHERE period_number = 026)27SELECT 28 cd.cohort_month, cd.period_number,29 cd.active_users, cs.cohort_size,30 ROUND(cd.active_users * 100.0 / cs.cohort_size, 2) AS retention_rate31FROM cohort_data cd32JOIN cohort_sizes cs ON cd.cohort_month = cs.cohort_month33ORDER BY cd.cohort_month, cd.period_number;Revenue Cohort Query
1WITH customer_cohorts AS (2 SELECT customer_id,3 DATE_TRUNC('month', MIN(order_date)) AS cohort_month4 FROM orders GROUP BY customer_id5),6cohort_revenue AS (7 SELECT 8 c.cohort_month,9 EXTRACT(YEAR FROM o.order_date) * 12 + EXTRACT(MONTH FROM o.order_date) -10 EXTRACT(YEAR FROM c.cohort_month) * 12 - EXTRACT(MONTH FROM c.cohort_month) AS period,11 COUNT(DISTINCT o.customer_id) AS customers,12 SUM(o.amount) AS revenue13 FROM orders o14 JOIN customer_cohorts c ON o.customer_id = c.customer_id15 GROUP BY c.cohort_month, period16)17SELECT cohort_month, period, customers, revenue,18 ROUND(revenue / customers, 2) AS arpu,19 SUM(revenue) OVER (PARTITION BY cohort_month ORDER BY period) AS cumulative_revenue20FROM cohort_revenue21WHERE period <= 1222ORDER BY cohort_month, period;Pivot Retention Matrix
1SELECT 2 cohort_month,3 MAX(CASE WHEN period_number = 0 THEN retention_rate END) AS m0,4 MAX(CASE WHEN period_number = 1 THEN retention_rate END) AS m1,5 MAX(CASE WHEN period_number = 2 THEN retention_rate END) AS m2,6 MAX(CASE WHEN period_number = 3 THEN retention_rate END) AS m3,7 MAX(CASE WHEN period_number = 4 THEN retention_rate END) AS m4,8 MAX(CASE WHEN period_number = 5 THEN retention_rate END) AS m5,9 MAX(CASE WHEN period_number = 6 THEN retention_rate END) AS m610FROM retention_data11GROUP BY cohort_month12ORDER BY cohort_month;Checkpoint
SQL cohort analysis dùng CTEs chain: customer_cohorts → transactions_with_cohort → cohort_data → retention rates. PIVOT dùng CASE WHEN cho matrix format. SQL-based cohort analysis chạy trực tiếp trên database — hiệu quả cho big data mà không cần pull về Python!
📊 Advanced Metrics — Churn & Reactivation
Churn Analysis
1def calculate_churn(df, churn_days=90):2 """Calculate churn rate by cohort"""3 reference_date = df['transaction_date'].max()4 5 last_activity = df.groupby('customer_id').agg({6 'transaction_date': 'max',7 'cohort': 'first'8 })9 10 last_activity['days_inactive'] = (11 reference_date - last_activity['transaction_date']12 ).dt.days13 last_activity['is_churned'] = last_activity['days_inactive'] > churn_days14 15 churn_by_cohort = last_activity.groupby('cohort').agg({16 'is_churned': ['sum', 'count', 'mean']17 })18 churn_by_cohort.columns = ['churned', 'total', 'churn_rate']19 churn_by_cohort['churn_rate'] = (churn_by_cohort['churn_rate'] * 100).round(2)20 return churn_by_cohort2122churn_df = calculate_churn(transactions, churn_days=60)23print("Churn Rate by Cohort:")24print(churn_df)Reactivation Analysis
1def analyze_reactivation(df, dormant_days=60):2 """Analyze customer reactivation"""3 df_sorted = df.sort_values(['customer_id', 'transaction_date'])4 df_sorted['prev_transaction'] = df_sorted.groupby('customer_id')['transaction_date'].shift()5 df_sorted['days_gap'] = (6 df_sorted['transaction_date'] - df_sorted['prev_transaction']7 ).dt.days8 df_sorted['is_reactivation'] = df_sorted['days_gap'] > dormant_days9 10 reactivations = df_sorted[df_sorted['is_reactivation']].groupby(11 df_sorted['transaction_date'].dt.to_period('M')12 ).agg({13 'customer_id': 'nunique',14 'amount': 'sum'15 }).rename(columns={16 'customer_id': 'reactivated_customers',17 'amount': 'reactivation_revenue'18 })19 return reactivations2021reactivation_df = analyze_reactivation(transactions)22print("Reactivation Analysis:")23print(reactivation_df.tail(10))Cohort Comparison (Statistical)
1def compare_cohorts(df, cohort1, cohort2, metric='revenue'):2 """Statistical comparison between two cohorts"""3 from scipy import stats4 5 c1_data = df[df['cohort'] == cohort1]6 c2_data = df[df['cohort'] == cohort2]7 8 c1_vals = c1_data.groupby('customer_id')['amount'].sum()9 c2_vals = c2_data.groupby('customer_id')['amount'].sum()10 11 t_stat, p_value = stats.ttest_ind(c1_vals, c2_vals)12 13 print(f"\nCohort Comparison: {cohort1} vs {cohort2}")14 print(f"Mean Revenue — {cohort1}: ${c1_vals.mean():.2f}")15 print(f"Mean Revenue — {cohort2}: ${c2_vals.mean():.2f}")16 print(f"T-statistic: {t_stat:.4f}")17 print(f"P-value: {p_value:.4f}")18 19 if p_value < 0.05:20 print("✓ Statistically significant difference")21 else:22 print("✗ No significant difference")2324compare_cohorts(transactions, '2024-01', '2024-06', metric='revenue')Checkpoint
Churn = inactive > N days. Reactivation = gap > dormant_days giữa transactions. Statistical comparison (t-test) cho biết cohort differences có significant không. Churn definition tùy business: SaaS = 30 days, e-commerce = 90 days. Bạn define churn_days phù hợp với business model!
📋 Tổng kết
Kiến thức đã học
| Chủ đề | Nội dung chính |
|---|---|
| Cohort Types | Time-based, Behavioral, Segment-based |
| Data Preparation | assign_cohorts(), period_number calculation |
| Retention | Retention matrix, heatmap, retention curves |
| Revenue Cohorts | Revenue matrix, ARPU, cumulative ARPU |
| LTV | Customer Lifetime Value by cohort |
| Behavioral | Purchase frequency, RFM segmentation |
| Channel Cohorts | Acquisition channel comparison |
| SQL Queries | CTE-based retention, revenue, pivot queries |
| Advanced Metrics | Churn rate, reactivation, statistical comparison |
Key Takeaways
- ✅ Retention matrix là output chính — visualize bằng heatmap
- ✅ RFM segmentation cho actionable customer segments
- ✅ LTV by cohort đánh giá chất lượng acquisition theo thời gian
- ✅ SQL-based cohort analysis scales tốt cho big data
- ✅ Churn definition phải phù hợp với business model
- ✅ Statistical tests cho scientific cohort comparison
Câu hỏi tự kiểm tra
- Time-based cohort khác Behavioral cohort thế nào?
- Retention matrix được tạo bằng cách nào?
- RFM segmentation chia khách hàng thế nào?
- LTV by cohort đánh giá điều gì?
Bài tiếp theo: Funnel Analysis →
🎉 Tuyệt vời! Bạn đã thành thạo Cohort Analysis!
Nhớ: Retention matrix + RFM + LTV = bộ 3 không thể thiếu cho Customer Analytics. Business luôn cần những insights này!
