MinAI - Về trang chủ
Hướng dẫn
8/132.5 giờ
Đang tải...

Cohort Analysis

Phân tích hành vi khách hàng theo thời gian và nhóm

0

🎯 Mục tiêu bài học

TB5 min
Sau bài học này, bạn sẽ:
  • Xây dựng Retention Matrix và visualize heatmap
  • Phân tích Revenue Cohorts, ARPU, và Customer Lifetime Value (LTV)
  • Tạo Behavioral Cohorts với RFM segmentation
  • Viết Cohort Analysis SQL queries cho production
📋 Thông tin bài học
Thông tinChi tiết
⏱️ Thời lượng2.5 giờ
📖 Chủ đề chínhRetention, Revenue Cohorts, RFM, LTV
💡 Kiến thức cần cóPandas, Window Functions cơ bản
🎯 OutputCohort analysis report hoàn chỉnh
1

📖 Thuật ngữ quan trọng

TB5 min
Thuật ngữTiếng ViệtMô tả
CohortNhóm đồng nhấtNhóm users có đặc điểm chung (thường là thời điểm acquisition)
RetentionTỷ lệ giữ chân% users quay lại sau N periods kể từ acquisition
Retention MatrixMa trận retentionBảng cohort × period hiển thị retention rates
ARPUDoanh thu/người dùngAverage Revenue Per User — trung bình revenue mỗi user
LTVGiá trị trọn đờiCustomer Lifetime Value — tổng giá trị khách hàng mang lại
RFMRecency-Frequency-MonetaryPhân khúc KH dựa trên gần đây, tần suất, giá trị
Churn RateTỷ lệ rời bỏ% khách hàng không còn active sau N ngày
ReactivationTái kích hoạtKhách hàng quay lại sau thời gian dormant
Acquisition CohortNhóm theo thời gianCohort dựa trên tháng/tuần đầu tiên signup/purchase
Behavioral CohortNhóm theo hành viCohort dựa trên patterns sử dụng (power users, casual)

Checkpoint

Cohort Analysis nhóm users theo acquisition time, theo dõi retention qua thời gian. RFM segments dựa trên Recency, Frequency, Monetary. LTV cho biết giá trị dài hạn của khách hàng. Cohort Analysis là công cụ #1 để hiểu retention — mọi PM và Data Analyst đều cần biết!

2

🗃️ Data Preparation & Cohort Assignment

TB5 min
Cohort Analysis là gì?

Cohort Analysis nhóm users theo đặc điểm chung (thường là thời điểm acquisition) và theo dõi hành vi của họ qua thời gian. Đây là công cụ powerful để hiểu retention, engagement, và customer lifetime value.

Cohort Types

Types of Cohorts
Loại CohortMô tả
📅 Time-basedAcquisition month, first purchase
📊 Behavior-basedPower users vs casual, high/low $
🎯 Segment-basedProduct category, region, tier

Create Sample Data

Python
1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4import seaborn as sns
5
6np.random.seed(42)
7n_customers = 5000
8n_transactions = 50000
9
10# Customer signup dates (spread over 12 months)
11customers = pd.DataFrame({
12 'customer_id': range(1, n_customers + 1),
13 'signup_date': pd.to_datetime('2024-01-01') + pd.to_timedelta(
14 np.random.randint(0, 365, n_customers), unit='D'
15 )
16})
17
18# Generate transactions
19transactions = pd.DataFrame({
20 'transaction_id': range(1, n_transactions + 1),
21 'customer_id': np.random.randint(1, n_customers + 1, n_transactions),
22 'transaction_date': pd.to_datetime('2024-01-01') + pd.to_timedelta(
23 np.random.randint(0, 400, n_transactions), unit='D'
24 ),
25 'amount': np.random.lognormal(4, 1, n_transactions)
26})
27
28# Merge and filter: only keep transactions after signup
29transactions = transactions.merge(customers, on='customer_id')
30transactions = transactions[
31 transactions['transaction_date'] >= transactions['signup_date']
32]
33
34print(f"Customers: {len(customers)}")
35print(f"Transactions: {len(transactions)}")

Calculate Cohort Fields

Python
1def assign_cohorts(df, signup_col, activity_col, freq='M'):
2 """Assign cohort and period to each row"""
3 df = df.copy()
4 df['cohort'] = df[signup_col].dt.to_period(freq)
5 df['activity_period'] = df[activity_col].dt.to_period(freq)
6 df['period_number'] = (
7 df['activity_period'].view('int64') -
8 df['cohort'].view('int64')
9 )
10 return df
11
12transactions = assign_cohorts(
13 transactions, 'signup_date', 'transaction_date', freq='M'
14)
15print(transactions[['customer_id', 'cohort', 'activity_period', 'period_number']].head(10))

Checkpoint

Cohort = tháng signup. Period number = số tháng kể từ cohort month. Luôn filter transactions >= signup_date. Period 0 = tháng đầu tiên, Period 1 = tháng thứ hai... Đây là cách đo "tuổi" của customer!

3

📊 Retention Analysis

TB5 min

Build Retention Matrix

Python
1def build_retention_matrix(df, value_col='customer_id', agg_func='nunique'):
2 """Build cohort retention matrix"""
3 cohort_data = df.groupby(['cohort', 'period_number']).agg({
4 value_col: agg_func
5 }).reset_index()
6
7 cohort_matrix = cohort_data.pivot(
8 index='cohort',
9 columns='period_number',
10 values=value_col
11 )
12 return cohort_matrix
13
14# Active users matrix
15retention_counts = build_retention_matrix(transactions)
16print("Retention Counts:")
17print(retention_counts.iloc[:6, :7])
18
19# Calculate retention rate
20cohort_sizes = retention_counts.iloc[:, 0]
21retention_rate = retention_counts.divide(cohort_sizes, axis=0) * 100
22
23print("\nRetention Rate (%):")
24print(retention_rate.iloc[:6, :7].round(1))

Visualize Retention Heatmap

Python
1def plot_retention_heatmap(retention_matrix, title='Retention Rate (%)'):
2 """Plot retention heatmap"""
3 plt.figure(figsize=(14, 10))
4
5 annot = retention_matrix.round(1).astype(str) + '%'
6
7 sns.heatmap(
8 retention_matrix,
9 annot=annot,
10 fmt='',
11 cmap='YlGnBu',
12 vmin=0, vmax=100,
13 linewidths=0.5,
14 cbar_kws={'label': 'Retention Rate (%)'}
15 )
16
17 plt.title(title, fontsize=14)
18 plt.xlabel('Months Since First Purchase')
19 plt.ylabel('Cohort (Month)')
20 plt.tight_layout()
21 plt.show()
22
23plot_retention_heatmap(retention_rate.iloc[:12, :12])

Retention Curves

Python
1def plot_retention_curves(retention_matrix, top_n=6):
2 """Plot retention curves for multiple cohorts"""
3 plt.figure(figsize=(12, 6))
4
5 for cohort in retention_matrix.index[:top_n]:
6 values = retention_matrix.loc[cohort].dropna()
7 plt.plot(values.index, values.values, marker='o', label=str(cohort))
8
9 plt.xlabel('Months Since Acquisition')
10 plt.ylabel('Retention Rate (%)')
11 plt.title('Retention Curves by Cohort')
12 plt.legend(title='Cohort')
13 plt.grid(True, alpha=0.3)
14 plt.ylim(0, 105)
15 plt.show()
16
17plot_retention_curves(retention_rate)

Checkpoint

Retention matrix = cohort × period. Heatmap visualization cho overview tuyệt vời. Retention curves so sánh cohorts theo thời gian. Column 0 luôn = 100%. Retention drop mạnh từ Month 0 → Month 1 là bình thường. "Good" retention tùy industry — SaaS ~90% M1, e-commerce ~30% M1!

4

💰 Revenue Cohorts & LTV

TB5 min

Revenue per Cohort

Python
1# Revenue matrix
2revenue_matrix = build_retention_matrix(
3 transactions, value_col='amount', agg_func='sum'
4)
5print("Total Revenue by Cohort and Period:")
6print(revenue_matrix.iloc[:6, :7].round(0))
7
8# Cumulative revenue
9cumulative_revenue = revenue_matrix.cumsum(axis=1)
10print("\nCumulative Revenue:")
11print(cumulative_revenue.iloc[:6, :7].round(0))

Average Revenue per User (ARPU)

Python
1cohort_users = transactions.groupby('cohort')['customer_id'].nunique()
2arpu_matrix = revenue_matrix.divide(cohort_users, axis=0)
3
4print("ARPU by Cohort and Period:")
5print(arpu_matrix.iloc[:6, :7].round(2))
6
7# Cumulative ARPU
8cum_arpu = arpu_matrix.cumsum(axis=1)
9
10plt.figure(figsize=(12, 6))
11for cohort in cum_arpu.index[:6]:
12 values = cum_arpu.loc[cohort].dropna()
13 plt.plot(values.index, values.values, marker='o', label=str(cohort))
14
15plt.xlabel('Months Since Acquisition')
16plt.ylabel('Cumulative ARPU ($)')
17plt.title('Cumulative ARPU by Cohort')
18plt.legend(title='Cohort')
19plt.grid(True, alpha=0.3)
20plt.show()

Customer Lifetime Value (LTV)

Python
1def calculate_ltv_by_cohort(df, periods=12):
2 """Calculate LTV for each cohort"""
3 results = []
4 for cohort in df['cohort'].unique():
5 cohort_data = df[df['cohort'] == cohort]
6 cohort_users = cohort_data['customer_id'].nunique()
7
8 period_revenue = cohort_data[
9 cohort_data['period_number'] < periods
10 ].groupby('period_number')['amount'].sum()
11
12 total_revenue = period_revenue.sum()
13 ltv = total_revenue / cohort_users
14
15 results.append({
16 'cohort': cohort,
17 'users': cohort_users,
18 'total_revenue': total_revenue,
19 'ltv': ltv
20 })
21 return pd.DataFrame(results)
22
23ltv_df = calculate_ltv_by_cohort(transactions, periods=12)
24print("LTV by Cohort:")
25print(ltv_df.sort_values('cohort').head(10))
26
27# Plot LTV trend
28plt.figure(figsize=(12, 5))
29plt.bar(ltv_df['cohort'].astype(str), ltv_df['ltv'])
30plt.xlabel('Cohort')
31plt.ylabel('12-Month LTV ($)')
32plt.title('Customer Lifetime Value by Acquisition Cohort')
33plt.xticks(rotation=45)
34plt.tight_layout()
35plt.show()

Checkpoint

Revenue matrix cho biết mỗi cohort tạo bao nhiêu revenue theo period. ARPU = Revenue / Users. Cumulative ARPU tiến tới LTV. LTV per cohort cho biết chất lượng acquisition theo thời gian. Nếu LTV trending down qua các cohorts mới — đó là dấu hiệu cần cải thiện product hoặc acquisition quality!

5

👥 Behavioral Cohorts & RFM

TB5 min

Purchase Frequency Segments

Python
1customer_freq = transactions.groupby('customer_id').agg({
2 'transaction_id': 'count',
3 'amount': 'sum',
4 'signup_date': 'first'
5}).rename(columns={'transaction_id': 'frequency', 'amount': 'total_spent'})
6
7# Create frequency segments
8customer_freq['freq_segment'] = pd.cut(
9 customer_freq['frequency'],
10 bins=[0, 1, 3, 10, float('inf')],
11 labels=['One-time', 'Occasional', 'Regular', 'Power User']
12)
13
14print("Customer Segments:")
15print(customer_freq['freq_segment'].value_counts())

RFM Analysis

Python
1reference_date = transactions['transaction_date'].max() + pd.Timedelta(days=1)
2
3rfm = transactions.groupby('customer_id').agg({
4 'transaction_date': lambda x: (reference_date - x.max()).days, # Recency
5 'transaction_id': 'count', # Frequency
6 'amount': 'sum' # Monetary
7}).rename(columns={
8 'transaction_date': 'recency',
9 'transaction_id': 'frequency',
10 'amount': 'monetary'
11})
12
13# Create RFM scores (1-5)
14rfm['R_score'] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])
15rfm['F_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])
16rfm['M_score'] = pd.qcut(rfm['monetary'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])
17
18# Simplified segments
19def rfm_segment(row):
20 if row['R_score'] >= 4 and row['F_score'] >= 4:
21 return 'Champions'
22 elif row['R_score'] >= 4 and row['F_score'] < 3:
23 return 'New Customers'
24 elif row['R_score'] < 3 and row['F_score'] >= 4:
25 return 'At Risk'
26 elif row['R_score'] < 2:
27 return 'Lost'
28 else:
29 return 'Regular'
30
31rfm['segment'] = rfm.apply(rfm_segment, axis=1)
32print("RFM Segments:")
33print(rfm['segment'].value_counts())

Channel Cohorts

Python
1channels = ['Organic', 'Paid Search', 'Social', 'Email', 'Referral']
2customers['channel'] = np.random.choice(
3 channels, len(customers), p=[0.3, 0.25, 0.2, 0.15, 0.1]
4)
5
6transactions = transactions.merge(
7 customers[['customer_id', 'channel']], on='customer_id', how='left'
8)
9
10# Channel performance
11channel_analysis = transactions.groupby('channel').agg({
12 'customer_id': 'nunique',
13 'transaction_id': 'count',
14 'amount': ['sum', 'mean']
15}).round(2)
16channel_analysis.columns = ['customers', 'transactions', 'total_revenue', 'avg_order']
17print("Channel Performance:")
18print(channel_analysis)

Checkpoint

RFM scores phân khúc KH: Champions (R↑F↑), At Risk (R↓F↑), New (R↑F↓), Lost (R↓F↓). Channel cohorts so sánh acquisition sources. Purchase frequency cho behavioral insights. Champions chỉ ~20% users nhưng thường mang lại 80% revenue — focus retain họ!

6

💾 Cohort Analysis SQL

TB5 min

Basic Retention Query

SQL
1WITH customer_cohorts AS (
2 SELECT
3 customer_id,
4 DATE_TRUNC('month', MIN(order_date)) AS cohort_month
5 FROM orders
6 GROUP BY customer_id
7),
8transactions_with_cohort AS (
9 SELECT
10 o.customer_id, c.cohort_month,
11 DATE_TRUNC('month', o.order_date) AS activity_month,
12 EXTRACT(YEAR FROM o.order_date) * 12 + EXTRACT(MONTH FROM o.order_date) -
13 EXTRACT(YEAR FROM c.cohort_month) * 12 - EXTRACT(MONTH FROM c.cohort_month) AS period_number
14 FROM orders o
15 JOIN customer_cohorts c ON o.customer_id = c.customer_id
16),
17cohort_data AS (
18 SELECT cohort_month, period_number,
19 COUNT(DISTINCT customer_id) AS active_users
20 FROM transactions_with_cohort
21 GROUP BY cohort_month, period_number
22),
23cohort_sizes AS (
24 SELECT cohort_month, active_users AS cohort_size
25 FROM cohort_data WHERE period_number = 0
26)
27SELECT
28 cd.cohort_month, cd.period_number,
29 cd.active_users, cs.cohort_size,
30 ROUND(cd.active_users * 100.0 / cs.cohort_size, 2) AS retention_rate
31FROM cohort_data cd
32JOIN cohort_sizes cs ON cd.cohort_month = cs.cohort_month
33ORDER BY cd.cohort_month, cd.period_number;

Revenue Cohort Query

SQL
1WITH customer_cohorts AS (
2 SELECT customer_id,
3 DATE_TRUNC('month', MIN(order_date)) AS cohort_month
4 FROM orders GROUP BY customer_id
5),
6cohort_revenue AS (
7 SELECT
8 c.cohort_month,
9 EXTRACT(YEAR FROM o.order_date) * 12 + EXTRACT(MONTH FROM o.order_date) -
10 EXTRACT(YEAR FROM c.cohort_month) * 12 - EXTRACT(MONTH FROM c.cohort_month) AS period,
11 COUNT(DISTINCT o.customer_id) AS customers,
12 SUM(o.amount) AS revenue
13 FROM orders o
14 JOIN customer_cohorts c ON o.customer_id = c.customer_id
15 GROUP BY c.cohort_month, period
16)
17SELECT cohort_month, period, customers, revenue,
18 ROUND(revenue / customers, 2) AS arpu,
19 SUM(revenue) OVER (PARTITION BY cohort_month ORDER BY period) AS cumulative_revenue
20FROM cohort_revenue
21WHERE period <= 12
22ORDER BY cohort_month, period;

Pivot Retention Matrix

SQL
1SELECT
2 cohort_month,
3 MAX(CASE WHEN period_number = 0 THEN retention_rate END) AS m0,
4 MAX(CASE WHEN period_number = 1 THEN retention_rate END) AS m1,
5 MAX(CASE WHEN period_number = 2 THEN retention_rate END) AS m2,
6 MAX(CASE WHEN period_number = 3 THEN retention_rate END) AS m3,
7 MAX(CASE WHEN period_number = 4 THEN retention_rate END) AS m4,
8 MAX(CASE WHEN period_number = 5 THEN retention_rate END) AS m5,
9 MAX(CASE WHEN period_number = 6 THEN retention_rate END) AS m6
10FROM retention_data
11GROUP BY cohort_month
12ORDER BY cohort_month;

Checkpoint

SQL cohort analysis dùng CTEs chain: customer_cohorts → transactions_with_cohort → cohort_data → retention rates. PIVOT dùng CASE WHEN cho matrix format. SQL-based cohort analysis chạy trực tiếp trên database — hiệu quả cho big data mà không cần pull về Python!

7

📊 Advanced Metrics — Churn & Reactivation

TB5 min

Churn Analysis

Python
1def calculate_churn(df, churn_days=90):
2 """Calculate churn rate by cohort"""
3 reference_date = df['transaction_date'].max()
4
5 last_activity = df.groupby('customer_id').agg({
6 'transaction_date': 'max',
7 'cohort': 'first'
8 })
9
10 last_activity['days_inactive'] = (
11 reference_date - last_activity['transaction_date']
12 ).dt.days
13 last_activity['is_churned'] = last_activity['days_inactive'] > churn_days
14
15 churn_by_cohort = last_activity.groupby('cohort').agg({
16 'is_churned': ['sum', 'count', 'mean']
17 })
18 churn_by_cohort.columns = ['churned', 'total', 'churn_rate']
19 churn_by_cohort['churn_rate'] = (churn_by_cohort['churn_rate'] * 100).round(2)
20 return churn_by_cohort
21
22churn_df = calculate_churn(transactions, churn_days=60)
23print("Churn Rate by Cohort:")
24print(churn_df)

Reactivation Analysis

Python
1def analyze_reactivation(df, dormant_days=60):
2 """Analyze customer reactivation"""
3 df_sorted = df.sort_values(['customer_id', 'transaction_date'])
4 df_sorted['prev_transaction'] = df_sorted.groupby('customer_id')['transaction_date'].shift()
5 df_sorted['days_gap'] = (
6 df_sorted['transaction_date'] - df_sorted['prev_transaction']
7 ).dt.days
8 df_sorted['is_reactivation'] = df_sorted['days_gap'] > dormant_days
9
10 reactivations = df_sorted[df_sorted['is_reactivation']].groupby(
11 df_sorted['transaction_date'].dt.to_period('M')
12 ).agg({
13 'customer_id': 'nunique',
14 'amount': 'sum'
15 }).rename(columns={
16 'customer_id': 'reactivated_customers',
17 'amount': 'reactivation_revenue'
18 })
19 return reactivations
20
21reactivation_df = analyze_reactivation(transactions)
22print("Reactivation Analysis:")
23print(reactivation_df.tail(10))

Cohort Comparison (Statistical)

Python
1def compare_cohorts(df, cohort1, cohort2, metric='revenue'):
2 """Statistical comparison between two cohorts"""
3 from scipy import stats
4
5 c1_data = df[df['cohort'] == cohort1]
6 c2_data = df[df['cohort'] == cohort2]
7
8 c1_vals = c1_data.groupby('customer_id')['amount'].sum()
9 c2_vals = c2_data.groupby('customer_id')['amount'].sum()
10
11 t_stat, p_value = stats.ttest_ind(c1_vals, c2_vals)
12
13 print(f"\nCohort Comparison: {cohort1} vs {cohort2}")
14 print(f"Mean Revenue — {cohort1}: ${c1_vals.mean():.2f}")
15 print(f"Mean Revenue — {cohort2}: ${c2_vals.mean():.2f}")
16 print(f"T-statistic: {t_stat:.4f}")
17 print(f"P-value: {p_value:.4f}")
18
19 if p_value < 0.05:
20 print("✓ Statistically significant difference")
21 else:
22 print("✗ No significant difference")
23
24compare_cohorts(transactions, '2024-01', '2024-06', metric='revenue')

Checkpoint

Churn = inactive > N days. Reactivation = gap > dormant_days giữa transactions. Statistical comparison (t-test) cho biết cohort differences có significant không. Churn definition tùy business: SaaS = 30 days, e-commerce = 90 days. Bạn define churn_days phù hợp với business model!

8

📋 Tổng kết

TB5 min

Kiến thức đã học

Chủ đềNội dung chính
Cohort TypesTime-based, Behavioral, Segment-based
Data Preparationassign_cohorts(), period_number calculation
RetentionRetention matrix, heatmap, retention curves
Revenue CohortsRevenue matrix, ARPU, cumulative ARPU
LTVCustomer Lifetime Value by cohort
BehavioralPurchase frequency, RFM segmentation
Channel CohortsAcquisition channel comparison
SQL QueriesCTE-based retention, revenue, pivot queries
Advanced MetricsChurn rate, reactivation, statistical comparison

Key Takeaways

  1. Retention matrix là output chính — visualize bằng heatmap
  2. RFM segmentation cho actionable customer segments
  3. LTV by cohort đánh giá chất lượng acquisition theo thời gian
  4. SQL-based cohort analysis scales tốt cho big data
  5. Churn definition phải phù hợp với business model
  6. Statistical tests cho scientific cohort comparison

Câu hỏi tự kiểm tra

  1. Time-based cohort khác Behavioral cohort thế nào?
  2. Retention matrix được tạo bằng cách nào?
  3. RFM segmentation chia khách hàng thế nào?
  4. LTV by cohort đánh giá điều gì?

Bài tiếp theo: Funnel Analysis →

🎉 Tuyệt vời! Bạn đã thành thạo Cohort Analysis!

Nhớ: Retention matrix + RFM + LTV = bộ 3 không thể thiếu cho Customer Analytics. Business luôn cần những insights này!