Thực Hành Pandas

🎯 Mục tiêu bài học

TB5 min

Sau bài thực hành này, bạn sẽ:

✅ Thành thạo đọc, khám phá, lọc dữ liệu từ CSV

✅ Sử dụng Groupby + Aggregation để phân tích dữ liệu kinh doanh

✅ Merge nhiều bảng và tính toán trên kết quả

✅ Tạo báo cáo Pivot Table chuyên nghiệp

Thời gian: 2 giờ | Độ khó: Beginner → Hard | Yêu cầu: Hoàn thành Bài 6 (Pandas)

Task 0

🟢 Phần 1: Tạo và Khám Phá DataFrame (Easy)

TB5 min

Ôn lại lý thuyết (Bài 6):

DataFrame = bảng 2D, Series = 1 cột
Tạo: pd.DataFrame(dict), pd.read_csv()
Khám phá: shape, head(), info(), describe()
Xử lý null: isnull(), dropna(), fillna()

Bài 1.1: Tạo dataset bán hàng

Python

1import pandas as pd
2import numpy as np
3np.random.seed(42)
4
5n = 200
6df = pd.DataFrame({
7    "order_id": range(1001, 1001 + n),
8    "date": pd.date_range("2024-01-01", periods=n, freq="D"),
9    "city": np.random.choice(["Hà Nội", "TP.HCM", "Đà Nẵng", "Cần Thơ"], n),
10    "category": np.random.choice(["Electronics", "Clothing", "Food", "Books"], n),
11    "quantity": np.random.randint(1, 10, n),
12    "unit_price": np.random.choice([50, 100, 200, 500, 1000], n),
13})
14df["revenue"] = df["quantity"] * df["unit_price"]
15
16print(df.shape)
17print(df.head())
18print(df.info())
19print(df.describe())

Bài 1.2: Khám phá nhanh

Đề bài: Trả lời các câu hỏi về dataset

Python

1# (a) Có bao nhiêu đơn hàng mỗi thành phố?
2print(df['city'].value_counts())
3
4# (b) Category nào có doanh thu cao nhất?
5print(df.groupby('category')['revenue'].sum().sort_values(ascending=False))
6
7# (c) Đơn hàng lớn nhất (revenue)?
8print(df.loc[df['revenue'].idxmax()])
9
10# (d) Có bao nhiêu % đơn hàng > 1000?
11pct = (df['revenue'] > 1000).mean() * 100
12print(f"Đơn hàng > 1000: {pct:.1f}%")

Checkpoint

Bạn đã nắm được workflow: shape → head → info → describe → value_counts chưa?

Task 1

🟡 Phần 2: Selection và Filtering (Medium)

TB5 min

Ôn lại lý thuyết (Bài 6):

Chọn cột: df['col'], df[['col1', 'col2']]
Chọn dòng: df.loc[index], df.iloc[position]
Boolean Indexing: df[df['col'] > 10]
Kết hợp điều kiện: & (AND), | (OR), ~ (NOT) → nhớ dùng () !
Helpers: isin(), between(), query(), str.contains()
Tạo cột mới: df['new'] = df['a'] + df['b'], pd.cut(), dt.month

Bài 2.1: Lọc đơn hàng

Đề bài: Lọc theo nhiều điều kiện khác nhau

Python

1# (a) Đơn hàng tại Hà Nội, category Electronics
2hn_elec = df[(df['city'] == 'Hà Nội') & (df['category'] == 'Electronics')]
3print(f"HN Electronics: {len(hn_elec)} đơn")
4
5# (b) Đơn hàng revenue từ 500-2000
6mid_range = df[df['revenue'].between(500, 2000)]
7print(f"Revenue 500-2000: {len(mid_range)} đơn")
8
9# (c) Đơn ở HN hoặc TP.HCM, quantity ≥ 5
10big_orders = df[
11    (df['city'].isin(['Hà Nội', 'TP.HCM'])) & 
12    (df['quantity'] >= 5)
13]
14print(f"Big orders HN/HCM: {len(big_orders)} đơn")
15
16# (d) Dùng query()
17result = df.query('city == "Đà Nẵng" and revenue > 1000')
18print(f"DN revenue > 1000: {len(result)} đơn")

Bài 2.2: Tạo cột mới

Đề bài: Thêm các cột phân tích

Python

1# (a) Cột month và weekday
2df['month'] = df['date'].dt.month
3df['weekday'] = df['date'].dt.day_name()
4
5# (b) Revenue tier
6df['tier'] = pd.cut(df['revenue'], 
7    bins=[0, 200, 1000, 5000, float('inf')],
8    labels=['Small', 'Medium', 'Large', 'Premium'])
9
10# (c) Running total (cumulative)
11df_sorted = df.sort_values('date')
12df_sorted['cumulative_revenue'] = df_sorted['revenue'].cumsum()
13
14print(df[['order_id', 'revenue', 'month', 'tier']].head(10))

Checkpoint

Boolean indexing trong Pandas tương tự NumPy nhưng dùng & thay cho and, | thay cho or. Bạn đã nhớ dùng ngoặc () chưa?

Task 2

🟡 Phần 3: Groupby Analysis (Medium)

TB5 min

Ôn lại lý thuyết (Bài 6):

Groupby: Split-Apply-Combine paradigm
.groupby('col').agg(): Tổng hợp nhiều cách (sum, mean, count, max...)
.groupby('col').transform(): Giữ nguyên shape, broadcast kết quả
Groupby 2+ cột: groupby(['col1', 'col2'])
reset_index(): Chuyển MultiIndex thành cột thường
Time series: pd.date_range(), dt.month, dt.day_name(), pct_change()

Bài 3.1: Phân tích theo thành phố

Đề bài: Tổng hợp doanh thu các thành phố

Python

1city_stats = df.groupby('city').agg(
2    total_revenue = ('revenue', 'sum'),
3    avg_revenue = ('revenue', 'mean'),
4    total_orders = ('order_id', 'count'),
5    avg_quantity = ('quantity', 'mean')
6).round(0)
7
8# Sắp xếp theo tổng doanh thu
9city_stats = city_stats.sort_values('total_revenue', ascending=False)
10print(city_stats)
11
12# Thêm cột % doanh thu
13city_stats['revenue_pct'] = (
14    city_stats['total_revenue'] / city_stats['total_revenue'].sum() * 100
15).round(1)
16print(city_stats)

Bài 3.2: Cross-analysis (City × Category)

Đề bài: Phân tích chéo thành phố và danh mục

Python

1# Groupby 2 cột
2cross = df.groupby(['city', 'category']).agg(
3    revenue = ('revenue', 'sum'),
4    orders = ('order_id', 'count')
5).reset_index()
6
7# Top category mỗi thành phố
8top_per_city = cross.loc[
9    cross.groupby('city')['revenue'].idxmax()
10]
11print("Top category per city:")
12print(top_per_city[['city', 'category', 'revenue']])
13
14# Transform: % đóng góp trong mỗi thành phố
15cross['city_pct'] = cross.groupby('city')['revenue'].transform(
16    lambda x: x / x.sum() * 100
17).round(1)
18print(cross)

Bài 3.3: Time Series Analysis

Đề bài: Phân tích xu hướng doanh thu theo thời gian

Python

1# Doanh thu theo tháng
2monthly = df.groupby('month').agg(
3    revenue = ('revenue', 'sum'),
4    orders = ('order_id', 'count'),
5    avg_order_value = ('revenue', 'mean')
6).round(0)
7print(monthly)
8
9# Tăng trưởng so tháng trước
10monthly['growth_pct'] = monthly['revenue'].pct_change() * 100
11print(monthly[['revenue', 'growth_pct']].round(1))
12
13# Doanh thu theo ngày trong tuần
14weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 
15                 'Friday', 'Saturday', 'Sunday']
16weekday_rev = df.groupby('weekday')['revenue'].mean()
17weekday_rev = weekday_rev.reindex(weekday_order)
18print(weekday_rev.round(0))

Checkpoint

Bạn đã phân biệt được .agg() vs .transform() chưa? .agg() trả về kết quả đã gộp, .transform() giữ nguyên shape.

Task 3

🔴 Phần 4: Merge và Pivot (Hard)

TB5 min

Ôn lại lý thuyết (Bài 6):

Merge: Nối bảng giống SQL JOIN
- pd.merge(df1, df2, on='key', how='left/right/inner/outer')
- how='left': giữ tất cả dòng df1, how='inner': chỉ giữ khớp
Pivot Table: Chuyển dữ liệu dạng dài → rộng (Excel Pivot)
- pivot_table(values, index, columns, aggfunc)
- margins=True: thêm dòng/cột tổng
RFM: Recency (mới mua), Frequency (tần suất), Monetary (giá trị)

Bài 4.1: Multi-table Analysis

Đề bài: Merge 3 bảng — orders + products + customers

Python

1# Tạo bảng Products
2products = pd.DataFrame({
3    "product_id": [1, 2, 3, 4, 5],
4    "product_name": ["Laptop", "Phone", "Tablet", "Headphones", "Keyboard"],
5    "cost": [800, 500, 300, 50, 30]
6})
7
8# Tạo bảng Customers
9customers = pd.DataFrame({
10    "customer_id": range(1, 11),
11    "name": [f"Customer_{i}" for i in range(1, 11)],
12    "segment": np.random.choice(["Gold", "Silver", "Bronze"], 10)
13})
14
15# Tạo bảng Orders
16np.random.seed(42)
17orders = pd.DataFrame({
18    "order_id": range(1, 51),
19    "customer_id": np.random.randint(1, 11, 50),
20    "product_id": np.random.randint(1, 6, 50),
21    "quantity": np.random.randint(1, 5, 50),
22    "unit_price": np.random.choice([100, 200, 500, 1000, 1500], 50)
23})
24
25# Merge 3 bảng
26full = (orders
27    .merge(products, on='product_id')
28    .merge(customers, on='customer_id'))
29
30# Tính profit
31full['revenue'] = full['quantity'] * full['unit_price']
32full['profit'] = full['revenue'] - full['quantity'] * full['cost']
33
34# Phân tích: Top customers by profit
35top_customers = (full.groupby(['customer_id', 'name', 'segment'])
36    .agg(total_orders=('order_id', 'count'),
37         total_revenue=('revenue', 'sum'),
38         total_profit=('profit', 'sum'))
39    .sort_values('total_profit', ascending=False)
40    .reset_index())
41print(top_customers.head())

Bài 4.2: Pivot Table Report

Đề bài: Tạo báo cáo Pivot tổng hợp

Python

1# Pivot: Revenue theo segment × product
2pivot = full.pivot_table(
3    values='revenue',
4    index='segment',
5    columns='product_name',
6    aggfunc='sum',
7    fill_value=0,
8    margins=True,           # Thêm tổng dòng/cột
9    margins_name='TOTAL'
10)
11print(pivot)
12
13# Pivot: Profit margin theo segment
14margin_pivot = full.pivot_table(
15    values=['revenue', 'profit'],
16    index='segment',
17    aggfunc='sum'
18)
19margin_pivot['margin_%'] = (margin_pivot['profit'] / margin_pivot['revenue'] * 100).round(1)
20print(margin_pivot)

Bài 4.3: RFM Analysis (Real-world)

Đề bài: Phân khúc khách hàng bằng RFM (Recency - Frequency - Monetary)

Python

1# Tạo data mua hàng
2np.random.seed(42)
3n = 500
4purchases = pd.DataFrame({
5    "customer_id": np.random.randint(1, 51, n),
6    "date": pd.date_range("2024-01-01", periods=n, freq="12H"),
7    "amount": np.random.exponential(200, n).round(0)
8})
9
10# RFM
11today = purchases['date'].max() + pd.Timedelta(days=1)
12
13rfm = purchases.groupby('customer_id').agg(
14    recency = ('date', lambda x: (today - x.max()).days),
15    frequency = ('customer_id', 'count'),
16    monetary = ('amount', 'sum')
17).round(0)
18
19# Scoring: chia mỗi metric thành 4 nhóm (1-4)
20rfm['R_score'] = pd.qcut(rfm['recency'], 4, labels=[4, 3, 2, 1])
21rfm['F_score'] = pd.qcut(rfm['frequency'].rank(method='first'), 4, labels=[1, 2, 3, 4])
22rfm['M_score'] = pd.qcut(rfm['monetary'], 4, labels=[1, 2, 3, 4])
23
24# Tổng RFM score
25rfm['RFM_score'] = (rfm['R_score'].astype(int) + 
26                     rfm['F_score'].astype(int) + 
27                     rfm['M_score'].astype(int))
28
29# Phân khúc
30def segment(score):
31    if score >= 10: return "Champion"
32    elif score >= 8: return "Loyal"
33    elif score >= 6: return "Potential"
34    else: return "At Risk"
35
36rfm['segment'] = rfm['RFM_score'].apply(segment)
37print(rfm['segment'].value_counts())
38print(rfm.groupby('segment')[['recency', 'frequency', 'monetary']].mean().round(0))

RFM Analysis là kỹ thuật phân khúc khách hàng phổ biến nhất trong marketing analytics. Bạn sẽ gặp lại trong các bài Challenge!

Checkpoint

Bạn đã hoàn thành bài RFM chưa? Dây là kỹ thuật phân tích rất quan trọng trong thực tế!

Task 4

📝 Tổng Kết

TB5 min

Câu hỏi tự kiểm tra

RFM Analysis gồm những chỉ số nào? Mỗi chỉ số đo lường điều gì về khách hàng?
Khi merge nhiều bảng dữ liệu, làm sao biết nên dùng how='inner' hay how='left'?
pd.qcut() khác pd.cut() như thế nào? Khi nào nên dùng mỗi loại?
Pivot table với margins=True thêm thông tin gì vào báo cáo?

✅ Checklist hoàn thành

🟢 Phần 1 (Easy): Tạo dataset, khám phá — xong?
🟡 Phần 2 (Medium): Selection, filtering, tạo cột mới — xong?
🟡 Phần 3 (Medium): Groupby city, cross-analysis, time series — xong?
🔴 Phần 4 (Hard): Multi-table merge, pivot report, RFM — xong?

Pandas Cheat Sheet

Python

1# Read & Explore
2df = pd.read_csv("file.csv")
3df.shape, df.info(), df.describe()
4
5# Filter
6df[df['col'] > 10]
7df.query('col > 10')
8
9# Group & Aggregate
10df.groupby('key').agg(total=('val', 'sum'))
11
12# Merge
13pd.merge(df1, df2, on='key', how='left')
14
15# Pivot
16df.pivot_table(values='val', index='row', columns='col', aggfunc='sum')

Bài tiếp theo: Visualization — Trực quan hóa dữ liệu với Matplotlib, Seaborn, Plotly! 📈

Task 5

🎯 Mục tiêu bài học

🟢 Phần 1: Tạo và Khám Phá DataFrame (Easy)

Bài 1.1: Tạo dataset bán hàng

Bài 1.2: Khám phá nhanh

Checkpoint

🟡 Phần 2: Selection và Filtering (Medium)

Bài 2.1: Lọc đơn hàng

Bài 2.2: Tạo cột mới

Checkpoint

🟡 Phần 3: Groupby Analysis (Medium)

Bài 3.1: Phân tích theo thành phố

Bài 3.2: Cross-analysis (City × Category)

Bài 3.3: Time Series Analysis

Checkpoint

🔴 Phần 4: Merge và Pivot (Hard)

Bài 4.1: Multi-table Analysis

Bài 4.2: Pivot Table Report

Bài 4.3: RFM Analysis (Real-world)

Checkpoint

📝 Tổng Kết

Câu hỏi tự kiểm tra

✅ Checklist hoàn thành

Pandas Cheat Sheet

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu