Cohort Analysis | MinAI Learning

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

Xây dựng Retention Matrix và visualize heatmap
Phân tích Revenue Cohorts, ARPU, và Customer Lifetime Value (LTV)
Tạo Behavioral Cohorts với RFM segmentation
Viết Cohort Analysis SQL queries cho production

📋 Thông tin bài học

Thông tin	Chi tiết
⏱️ Thời lượng	2.5 giờ
📖 Chủ đề chính	Retention, Revenue Cohorts, RFM, LTV
💡 Kiến thức cần có	Pandas, Window Functions cơ bản
🎯 Output	Cohort analysis report hoàn chỉnh

Task 0

📖 Thuật ngữ quan trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Cohort	Nhóm đồng nhất	Nhóm users có đặc điểm chung (thường là thời điểm acquisition)
Retention	Tỷ lệ giữ chân	% users quay lại sau N periods kể từ acquisition
Retention Matrix	Ma trận retention	Bảng cohort × period hiển thị retention rates
ARPU	Doanh thu/người dùng	Average Revenue Per User — trung bình revenue mỗi user
LTV	Giá trị trọn đời	Customer Lifetime Value — tổng giá trị khách hàng mang lại
RFM	Recency-Frequency-Monetary	Phân khúc KH dựa trên gần đây, tần suất, giá trị
Churn Rate	Tỷ lệ rời bỏ	% khách hàng không còn active sau N ngày
Reactivation	Tái kích hoạt	Khách hàng quay lại sau thời gian dormant
Acquisition Cohort	Nhóm theo thời gian	Cohort dựa trên tháng/tuần đầu tiên signup/purchase
Behavioral Cohort	Nhóm theo hành vi	Cohort dựa trên patterns sử dụng (power users, casual)

Checkpoint

Cohort Analysis nhóm users theo acquisition time, theo dõi retention qua thời gian. RFM segments dựa trên Recency, Frequency, Monetary. LTV cho biết giá trị dài hạn của khách hàng. Cohort Analysis là công cụ #1 để hiểu retention — mọi PM và Data Analyst đều cần biết!

Task 1

🗃️ Data Preparation & Cohort Assignment

TB5 min

Cohort Analysis là gì?

Cohort Analysis nhóm users theo đặc điểm chung (thường là thời điểm acquisition) và theo dõi hành vi của họ qua thời gian. Đây là công cụ powerful để hiểu retention, engagement, và customer lifetime value.

Cohort Types

Types of Cohorts

Loại Cohort	Mô tả
📅 Time-based	Acquisition month, first purchase
📊 Behavior-based	Power users vs casual, high/low $
🎯 Segment-based	Product category, region, tier

Create Sample Data

Python

1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4import seaborn as sns
5
6np.random.seed(42)
7n_customers = 5000
8n_transactions = 50000
9
10# Customer signup dates (spread over 12 months)
11customers = pd.DataFrame({
12    'customer_id': range(1, n_customers + 1),
13    'signup_date': pd.to_datetime('2024-01-01') + pd.to_timedelta(
14        np.random.randint(0, 365, n_customers), unit='D'
15    )
16})
17
18# Generate transactions
19transactions = pd.DataFrame({
20    'transaction_id': range(1, n_transactions + 1),
21    'customer_id': np.random.randint(1, n_customers + 1, n_transactions),
22    'transaction_date': pd.to_datetime('2024-01-01') + pd.to_timedelta(
23        np.random.randint(0, 400, n_transactions), unit='D'
24    ),
25    'amount': np.random.lognormal(4, 1, n_transactions)
26})
27
28# Merge and filter: only keep transactions after signup
29transactions = transactions.merge(customers, on='customer_id')
30transactions = transactions[
31    transactions['transaction_date'] >= transactions['signup_date']
32]
33
34print(f"Customers: {len(customers)}")
35print(f"Transactions: {len(transactions)}")

Calculate Cohort Fields

Python

1def assign_cohorts(df, signup_col, activity_col, freq='M'):
2    """Assign cohort and period to each row"""
3    df = df.copy()
4    df['cohort'] = df[signup_col].dt.to_period(freq)
5    df['activity_period'] = df[activity_col].dt.to_period(freq)
6    df['period_number'] = (
7        df['activity_period'].view('int64') - 
8        df['cohort'].view('int64')
9    )
10    return df
11
12transactions = assign_cohorts(
13    transactions, 'signup_date', 'transaction_date', freq='M'
14)
15print(transactions[['customer_id', 'cohort', 'activity_period', 'period_number']].head(10))

Checkpoint

Cohort = tháng signup. Period number = số tháng kể từ cohort month. Luôn filter transactions >= signup_date. Period 0 = tháng đầu tiên, Period 1 = tháng thứ hai... Đây là cách đo "tuổi" của customer!

Task 2

📊 Retention Analysis

TB5 min

Build Retention Matrix

Python

1def build_retention_matrix(df, value_col='customer_id', agg_func='nunique'):
2    """Build cohort retention matrix"""
3    cohort_data = df.groupby(['cohort', 'period_number']).agg({
4        value_col: agg_func
5    }).reset_index()
6    
7    cohort_matrix = cohort_data.pivot(
8        index='cohort',
9        columns='period_number',
10        values=value_col
11    )
12    return cohort_matrix
13
14# Active users matrix
15retention_counts = build_retention_matrix(transactions)
16print("Retention Counts:")
17print(retention_counts.iloc[:6, :7])
18
19# Calculate retention rate
20cohort_sizes = retention_counts.iloc[:, 0]
21retention_rate = retention_counts.divide(cohort_sizes, axis=0) * 100
22
23print("\nRetention Rate (%):")
24print(retention_rate.iloc[:6, :7].round(1))

Visualize Retention Heatmap

Python

1def plot_retention_heatmap(retention_matrix, title='Retention Rate (%)'):
2    """Plot retention heatmap"""
3    plt.figure(figsize=(14, 10))
4    
5    annot = retention_matrix.round(1).astype(str) + '%'
6    
7    sns.heatmap(
8        retention_matrix,
9        annot=annot,
10        fmt='',
11        cmap='YlGnBu',
12        vmin=0, vmax=100,
13        linewidths=0.5,
14        cbar_kws={'label': 'Retention Rate (%)'}
15    )
16    
17    plt.title(title, fontsize=14)
18    plt.xlabel('Months Since First Purchase')
19    plt.ylabel('Cohort (Month)')
20    plt.tight_layout()
21    plt.show()
22
23plot_retention_heatmap(retention_rate.iloc[:12, :12])

Retention Curves

Python

1def plot_retention_curves(retention_matrix, top_n=6):
2    """Plot retention curves for multiple cohorts"""
3    plt.figure(figsize=(12, 6))
4    
5    for cohort in retention_matrix.index[:top_n]:
6        values = retention_matrix.loc[cohort].dropna()
7        plt.plot(values.index, values.values, marker='o', label=str(cohort))
8    
9    plt.xlabel('Months Since Acquisition')
10    plt.ylabel('Retention Rate (%)')
11    plt.title('Retention Curves by Cohort')
12    plt.legend(title='Cohort')
13    plt.grid(True, alpha=0.3)
14    plt.ylim(0, 105)
15    plt.show()
16
17plot_retention_curves(retention_rate)

Checkpoint

Retention matrix = cohort × period. Heatmap visualization cho overview tuyệt vời. Retention curves so sánh cohorts theo thời gian. Column 0 luôn = 100%. Retention drop mạnh từ Month 0 → Month 1 là bình thường. "Good" retention tùy industry — SaaS ~90% M1, e-commerce ~30% M1!

Task 3

💰 Revenue Cohorts & LTV

TB5 min

Revenue per Cohort

Python

1# Revenue matrix
2revenue_matrix = build_retention_matrix(
3    transactions, value_col='amount', agg_func='sum'
4)
5print("Total Revenue by Cohort and Period:")
6print(revenue_matrix.iloc[:6, :7].round(0))
7
8# Cumulative revenue
9cumulative_revenue = revenue_matrix.cumsum(axis=1)
10print("\nCumulative Revenue:")
11print(cumulative_revenue.iloc[:6, :7].round(0))

Average Revenue per User (ARPU)

Python

1cohort_users = transactions.groupby('cohort')['customer_id'].nunique()
2arpu_matrix = revenue_matrix.divide(cohort_users, axis=0)
3
4print("ARPU by Cohort and Period:")
5print(arpu_matrix.iloc[:6, :7].round(2))
6
7# Cumulative ARPU
8cum_arpu = arpu_matrix.cumsum(axis=1)
9
10plt.figure(figsize=(12, 6))
11for cohort in cum_arpu.index[:6]:
12    values = cum_arpu.loc[cohort].dropna()
13    plt.plot(values.index, values.values, marker='o', label=str(cohort))
14
15plt.xlabel('Months Since Acquisition')
16plt.ylabel('Cumulative ARPU ($)')
17plt.title('Cumulative ARPU by Cohort')
18plt.legend(title='Cohort')
19plt.grid(True, alpha=0.3)
20plt.show()

Customer Lifetime Value (LTV)

Python

1def calculate_ltv_by_cohort(df, periods=12):
2    """Calculate LTV for each cohort"""
3    results = []
4    for cohort in df['cohort'].unique():
5        cohort_data = df[df['cohort'] == cohort]
6        cohort_users = cohort_data['customer_id'].nunique()
7        
8        period_revenue = cohort_data[
9            cohort_data['period_number'] < periods
10        ].groupby('period_number')['amount'].sum()
11        
12        total_revenue = period_revenue.sum()
13        ltv = total_revenue / cohort_users
14        
15        results.append({
16            'cohort': cohort,
17            'users': cohort_users,
18            'total_revenue': total_revenue,
19            'ltv': ltv
20        })
21    return pd.DataFrame(results)
22
23ltv_df = calculate_ltv_by_cohort(transactions, periods=12)
24print("LTV by Cohort:")
25print(ltv_df.sort_values('cohort').head(10))
26
27# Plot LTV trend
28plt.figure(figsize=(12, 5))
29plt.bar(ltv_df['cohort'].astype(str), ltv_df['ltv'])
30plt.xlabel('Cohort')
31plt.ylabel('12-Month LTV ($)')
32plt.title('Customer Lifetime Value by Acquisition Cohort')
33plt.xticks(rotation=45)
34plt.tight_layout()
35plt.show()

Checkpoint

Revenue matrix cho biết mỗi cohort tạo bao nhiêu revenue theo period. ARPU = Revenue / Users. Cumulative ARPU tiến tới LTV. LTV per cohort cho biết chất lượng acquisition theo thời gian. Nếu LTV trending down qua các cohorts mới — đó là dấu hiệu cần cải thiện product hoặc acquisition quality!

Task 4

👥 Behavioral Cohorts & RFM

TB5 min

Purchase Frequency Segments

Python

1customer_freq = transactions.groupby('customer_id').agg({
2    'transaction_id': 'count',
3    'amount': 'sum',
4    'signup_date': 'first'
5}).rename(columns={'transaction_id': 'frequency', 'amount': 'total_spent'})
6
7# Create frequency segments
8customer_freq['freq_segment'] = pd.cut(
9    customer_freq['frequency'],
10    bins=[0, 1, 3, 10, float('inf')],
11    labels=['One-time', 'Occasional', 'Regular', 'Power User']
12)
13
14print("Customer Segments:")
15print(customer_freq['freq_segment'].value_counts())

RFM Analysis

Python

1reference_date = transactions['transaction_date'].max() + pd.Timedelta(days=1)
2
3rfm = transactions.groupby('customer_id').agg({
4    'transaction_date': lambda x: (reference_date - x.max()).days,  # Recency
5    'transaction_id': 'count',                                       # Frequency
6    'amount': 'sum'                                                  # Monetary
7}).rename(columns={
8    'transaction_date': 'recency',
9    'transaction_id': 'frequency',
10    'amount': 'monetary'
11})
12
13# Create RFM scores (1-5)
14rfm['R_score'] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])
15rfm['F_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])
16rfm['M_score'] = pd.qcut(rfm['monetary'].rank(method='first'), 5, labels=[1, 2, 3, 4, 5])
17
18# Simplified segments
19def rfm_segment(row):
20    if row['R_score'] >= 4 and row['F_score'] >= 4:
21        return 'Champions'
22    elif row['R_score'] >= 4 and row['F_score'] < 3:
23        return 'New Customers'
24    elif row['R_score'] < 3 and row['F_score'] >= 4:
25        return 'At Risk'
26    elif row['R_score'] < 2:
27        return 'Lost'
28    else:
29        return 'Regular'
30
31rfm['segment'] = rfm.apply(rfm_segment, axis=1)
32print("RFM Segments:")
33print(rfm['segment'].value_counts())

Channel Cohorts

Python

1channels = ['Organic', 'Paid Search', 'Social', 'Email', 'Referral']
2customers['channel'] = np.random.choice(
3    channels, len(customers), p=[0.3, 0.25, 0.2, 0.15, 0.1]
4)
5
6transactions = transactions.merge(
7    customers[['customer_id', 'channel']], on='customer_id', how='left'
8)
9
10# Channel performance
11channel_analysis = transactions.groupby('channel').agg({
12    'customer_id': 'nunique',
13    'transaction_id': 'count',
14    'amount': ['sum', 'mean']
15}).round(2)
16channel_analysis.columns = ['customers', 'transactions', 'total_revenue', 'avg_order']
17print("Channel Performance:")
18print(channel_analysis)

Checkpoint

RFM scores phân khúc KH: Champions (R↑F↑), At Risk (R↓F↑), New (R↑F↓), Lost (R↓F↓). Channel cohorts so sánh acquisition sources. Purchase frequency cho behavioral insights. Champions chỉ ~20% users nhưng thường mang lại 80% revenue — focus retain họ!

Task 5

💾 Cohort Analysis SQL

TB5 min

Basic Retention Query

SQL

1WITH customer_cohorts AS (
2    SELECT 
3        customer_id,
4        DATE_TRUNC('month', MIN(order_date)) AS cohort_month
5    FROM orders
6    GROUP BY customer_id
7),
8transactions_with_cohort AS (
9    SELECT 
10        o.customer_id, c.cohort_month,
11        DATE_TRUNC('month', o.order_date) AS activity_month,
12        EXTRACT(YEAR FROM o.order_date) * 12 + EXTRACT(MONTH FROM o.order_date) -
13        EXTRACT(YEAR FROM c.cohort_month) * 12 - EXTRACT(MONTH FROM c.cohort_month) AS period_number
14    FROM orders o
15    JOIN customer_cohorts c ON o.customer_id = c.customer_id
16),
17cohort_data AS (
18    SELECT cohort_month, period_number,
19        COUNT(DISTINCT customer_id) AS active_users
20    FROM transactions_with_cohort
21    GROUP BY cohort_month, period_number
22),
23cohort_sizes AS (
24    SELECT cohort_month, active_users AS cohort_size
25    FROM cohort_data WHERE period_number = 0
26)
27SELECT 
28    cd.cohort_month, cd.period_number,
29    cd.active_users, cs.cohort_size,
30    ROUND(cd.active_users * 100.0 / cs.cohort_size, 2) AS retention_rate
31FROM cohort_data cd
32JOIN cohort_sizes cs ON cd.cohort_month = cs.cohort_month
33ORDER BY cd.cohort_month, cd.period_number;

Revenue Cohort Query

SQL

1WITH customer_cohorts AS (
2    SELECT customer_id,
3        DATE_TRUNC('month', MIN(order_date)) AS cohort_month
4    FROM orders GROUP BY customer_id
5),
6cohort_revenue AS (
7    SELECT 
8        c.cohort_month,
9        EXTRACT(YEAR FROM o.order_date) * 12 + EXTRACT(MONTH FROM o.order_date) -
10        EXTRACT(YEAR FROM c.cohort_month) * 12 - EXTRACT(MONTH FROM c.cohort_month) AS period,
11        COUNT(DISTINCT o.customer_id) AS customers,
12        SUM(o.amount) AS revenue
13    FROM orders o
14    JOIN customer_cohorts c ON o.customer_id = c.customer_id
15    GROUP BY c.cohort_month, period
16)
17SELECT cohort_month, period, customers, revenue,
18    ROUND(revenue / customers, 2) AS arpu,
19    SUM(revenue) OVER (PARTITION BY cohort_month ORDER BY period) AS cumulative_revenue
20FROM cohort_revenue
21WHERE period <= 12
22ORDER BY cohort_month, period;

Pivot Retention Matrix

SQL

1SELECT 
2    cohort_month,
3    MAX(CASE WHEN period_number = 0 THEN retention_rate END) AS m0,
4    MAX(CASE WHEN period_number = 1 THEN retention_rate END) AS m1,
5    MAX(CASE WHEN period_number = 2 THEN retention_rate END) AS m2,
6    MAX(CASE WHEN period_number = 3 THEN retention_rate END) AS m3,
7    MAX(CASE WHEN period_number = 4 THEN retention_rate END) AS m4,
8    MAX(CASE WHEN period_number = 5 THEN retention_rate END) AS m5,
9    MAX(CASE WHEN period_number = 6 THEN retention_rate END) AS m6
10FROM retention_data
11GROUP BY cohort_month
12ORDER BY cohort_month;

Checkpoint

SQL cohort analysis dùng CTEs chain: customer_cohorts → transactions_with_cohort → cohort_data → retention rates. PIVOT dùng CASE WHEN cho matrix format. SQL-based cohort analysis chạy trực tiếp trên database — hiệu quả cho big data mà không cần pull về Python!

Task 6

📊 Advanced Metrics — Churn & Reactivation

TB5 min

Churn Analysis

Python

1def calculate_churn(df, churn_days=90):
2    """Calculate churn rate by cohort"""
3    reference_date = df['transaction_date'].max()
4    
5    last_activity = df.groupby('customer_id').agg({
6        'transaction_date': 'max',
7        'cohort': 'first'
8    })
9    
10    last_activity['days_inactive'] = (
11        reference_date - last_activity['transaction_date']
12    ).dt.days
13    last_activity['is_churned'] = last_activity['days_inactive'] > churn_days
14    
15    churn_by_cohort = last_activity.groupby('cohort').agg({
16        'is_churned': ['sum', 'count', 'mean']
17    })
18    churn_by_cohort.columns = ['churned', 'total', 'churn_rate']
19    churn_by_cohort['churn_rate'] = (churn_by_cohort['churn_rate'] * 100).round(2)
20    return churn_by_cohort
21
22churn_df = calculate_churn(transactions, churn_days=60)
23print("Churn Rate by Cohort:")
24print(churn_df)

Reactivation Analysis

Python

1def analyze_reactivation(df, dormant_days=60):
2    """Analyze customer reactivation"""
3    df_sorted = df.sort_values(['customer_id', 'transaction_date'])
4    df_sorted['prev_transaction'] = df_sorted.groupby('customer_id')['transaction_date'].shift()
5    df_sorted['days_gap'] = (
6        df_sorted['transaction_date'] - df_sorted['prev_transaction']
7    ).dt.days
8    df_sorted['is_reactivation'] = df_sorted['days_gap'] > dormant_days
9    
10    reactivations = df_sorted[df_sorted['is_reactivation']].groupby(
11        df_sorted['transaction_date'].dt.to_period('M')
12    ).agg({
13        'customer_id': 'nunique',
14        'amount': 'sum'
15    }).rename(columns={
16        'customer_id': 'reactivated_customers',
17        'amount': 'reactivation_revenue'
18    })
19    return reactivations
20
21reactivation_df = analyze_reactivation(transactions)
22print("Reactivation Analysis:")
23print(reactivation_df.tail(10))

Cohort Comparison (Statistical)

Python

1def compare_cohorts(df, cohort1, cohort2, metric='revenue'):
2    """Statistical comparison between two cohorts"""
3    from scipy import stats
4    
5    c1_data = df[df['cohort'] == cohort1]
6    c2_data = df[df['cohort'] == cohort2]
7    
8    c1_vals = c1_data.groupby('customer_id')['amount'].sum()
9    c2_vals = c2_data.groupby('customer_id')['amount'].sum()
10    
11    t_stat, p_value = stats.ttest_ind(c1_vals, c2_vals)
12    
13    print(f"\nCohort Comparison: {cohort1} vs {cohort2}")
14    print(f"Mean Revenue — {cohort1}: ${c1_vals.mean():.2f}")
15    print(f"Mean Revenue — {cohort2}: ${c2_vals.mean():.2f}")
16    print(f"T-statistic: {t_stat:.4f}")
17    print(f"P-value: {p_value:.4f}")
18    
19    if p_value < 0.05:
20        print("✓ Statistically significant difference")
21    else:
22        print("✗ No significant difference")
23
24compare_cohorts(transactions, '2024-01', '2024-06', metric='revenue')

Checkpoint

Churn = inactive > N days. Reactivation = gap > dormant_days giữa transactions. Statistical comparison (t-test) cho biết cohort differences có significant không. Churn definition tùy business: SaaS = 30 days, e-commerce = 90 days. Bạn define churn_days phù hợp với business model!

Task 7

📋 Tổng kết

TB5 min

Kiến thức đã học

Chủ đề	Nội dung chính
Cohort Types	Time-based, Behavioral, Segment-based
Data Preparation	assign_cohorts(), period_number calculation
Retention	Retention matrix, heatmap, retention curves
Revenue Cohorts	Revenue matrix, ARPU, cumulative ARPU
LTV	Customer Lifetime Value by cohort
Behavioral	Purchase frequency, RFM segmentation
Channel Cohorts	Acquisition channel comparison
SQL Queries	CTE-based retention, revenue, pivot queries
Advanced Metrics	Churn rate, reactivation, statistical comparison

Key Takeaways

✅ Retention matrix là output chính — visualize bằng heatmap
✅ RFM segmentation cho actionable customer segments
✅ LTV by cohort đánh giá chất lượng acquisition theo thời gian
✅ SQL-based cohort analysis scales tốt cho big data
✅ Churn definition phải phù hợp với business model
✅ Statistical tests cho scientific cohort comparison

Câu hỏi tự kiểm tra

Time-based cohort khác Behavioral cohort thế nào?
Retention matrix được tạo bằng cách nào?
RFM segmentation chia khách hàng thế nào?
LTV by cohort đánh giá điều gì?

Bài tiếp theo: Funnel Analysis →

🎉 Tuyệt vời! Bạn đã thành thạo Cohort Analysis!

Nhớ: Retention matrix + RFM + LTV = bộ 3 không thể thiếu cho Customer Analytics. Business luôn cần những insights này!

Task 8