MinAI - Về trang chủ
Hướng dẫn
3/132 giờ
Đang tải...

Data Manipulation

Merge, Pivot, GroupBy và Transform trong Pandas

0

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Merge/Join DataFrames (inner, left, right, outer)

✅ Concatenate datasets vertically + horizontally

✅ GroupBy: split-apply-combine pattern

✅ Pivot Tables + Melt cho data reshaping

✅ Window Functions: rolling, expanding, EWMA

Thời gian: 2 giờ | Độ khó: Intermediate | Tool: Python + Pandas

1

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min
Thuật ngữTiếng ViệtMô tả
MergeNối bảngJoin 2 DataFrames on key columns
Inner JoinNối trongChỉ giữ records match cả 2 bảng
Left JoinNối tráiGiữ tất cả từ bảng trái
Outer JoinNối ngoàiGiữ tất cả từ cả 2 bảng
ConcatGhép nốiStack DataFrames vertically/horizontally
GroupByNhóm theoSplit-apply-combine pattern
Pivot TableBảng tổng hợpReshape + aggregate data
MeltChuyển dạngWide → Long format
TransformBiến đổiGroup calc → same-shape output
RollingCửa sổ trượtMoving window calculations

Checkpoint

Merge = JOIN (inner/left/right/outer). GroupBy = split-apply-combine. Pivot = reshape + aggregate. Melt = wide→long. Giống SQL JOINs + GROUP BY!

2

🔗 1. Merging & Joining

TB5 min

1.1 Merge Types

Merge Types
Join TypeGiữ dữ liệuMô tả
INNER JOINChỉ matchesChỉ giữ records match cả 2 bảng
LEFT JOINTất cả bên tráiGiữ tất cả từ bảng trái + matches
RIGHT JOINTất cả bên phảiGiữ tất cả từ bảng phải + matches
OUTER JOINTất cả cả 2Giữ tất cả từ cả 2 bảng

1.2 pd.merge()

Python
1import pandas as pd
2
3orders = pd.DataFrame({
4 'order_id': [1, 2, 3, 4],
5 'customer_id': [101, 102, 103, 104],
6 'amount': [500, 300, 700, 200]
7})
8
9customers = pd.DataFrame({
10 'customer_id': [101, 102, 105],
11 'name': ['Alice', 'Bob', 'Eve'],
12 'city': ['Hanoi', 'HCMC', 'Danang']
13})
14
15# Join types
16inner = pd.merge(orders, customers, on='customer_id') # default
17left = pd.merge(orders, customers, on='customer_id', how='left')
18right = pd.merge(orders, customers, on='customer_id', how='right')
19outer = pd.merge(orders, customers, on='customer_id', how='outer')

1.3 Advanced Merge

Python
1# Different column names
2merged = pd.merge(df1, df2, left_on='id', right_on='key')
3
4# Multiple columns
5merged = pd.merge(orders, targets, on=['year', 'quarter'])
6
7# Handle duplicates
8merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))
9
10# Indicator column
11merged = pd.merge(orders, customers, on='customer_id',
12 how='outer', indicator=True)
13# _merge: left_only, right_only, both
14
15# Validate
16merged = pd.merge(df1, df2, on='id', validate='one_to_one')

Checkpoint

pd.merge() = SQL JOIN. 4 types: inner (matches), left (all left), right (all right), outer (all). indicator=True shows match status!

3

📦 2. Concatenation

TB5 min
Python
1# Vertical (stack rows)
2combined = pd.concat([df1, df2], ignore_index=True)
3
4# Horizontal
5combined_h = pd.concat([df1, df2], axis=1)
6
7# With keys (hierarchical index)
8q1 = pd.DataFrame({'sales': [100, 200]}, index=['Jan', 'Feb'])
9q2 = pd.DataFrame({'sales': [300, 400]}, index=['Mar', 'Apr'])
10combined = pd.concat([q1, q2], keys=['Q1', 'Q2'])

Merge = join on keys (like SQL JOIN). Concat = stack datasets (like UNION). Dùng ignore_index=True để reset index sau concat!

Checkpoint

pd.concat() = stack DataFrames. axis=0 (vertical), axis=1 (horizontal). keys tạo hierarchical index. Missing columns → NaN!

4

📊 3. GroupBy

TB5 min

3.1 Basic GroupBy

Python
1sales = pd.DataFrame({
2 'date': pd.date_range('2024-01-01', periods=10),
3 'product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
4 'region': ['North', 'North', 'South', 'South', 'North',
5 'South', 'North', 'North', 'South', 'South'],
6 'amount': [100, 150, 200, 180, 120, 160, 140, 170, 190, 130]
7})
8
9# Group by single column
10print(sales.groupby('product')['amount'].sum())
11
12# Group by multiple columns
13print(sales.groupby(['product', 'region'])['amount'].sum())

3.2 Aggregation Functions

Python
1# Multiple aggregations
2agg_result = sales.groupby('product')['amount'].agg(['sum', 'mean', 'min', 'max'])
3
4# Named aggregations (clean output)
5agg_result = sales.groupby('product').agg(
6 total_sales=('amount', 'sum'),
7 avg_sale=('amount', 'mean'),
8 transaction_count=('date', 'count')
9)
10
11# Custom function
12def range_func(x):
13 return x.max() - x.min()
14agg_result = sales.groupby('product')['amount'].agg(range_func)

3.3 Transform & Filter

Python
1# Transform: same-shape output
2sales['group_mean'] = sales.groupby('product')['amount'].transform('mean')
3sales['pct_of_group'] = sales.groupby('product')['amount'].transform(
4 lambda x: x / x.sum() * 100
5)
6sales['rank_in_group'] = sales.groupby('product')['amount'].transform('rank')
7
8# Filter groups
9filtered = sales.groupby('product').filter(lambda x: x['amount'].sum() > 500)

GroupBy Power: agg() cho summary, transform() cho same-shape output (add back to original), filter() để keep/remove entire groups!

Checkpoint

GroupBy = split-apply-combine. agg() = summary. transform() = same shape (add group stats per row). filter() = keep/drop groups. Giống SQL GROUP BY!

5

🔄 4. Pivot Tables & Reshaping

TB5 min

5.1 pivot_table()

Python
1# Basic pivot
2pivot = sales.pivot_table(
3 values='amount',
4 index='product',
5 columns='region',
6 aggfunc='sum'
7)
8
9# Multiple aggregations + totals
10pivot = sales.pivot_table(
11 values='amount',
12 index='product',
13 columns='region',
14 aggfunc=['sum', 'mean', 'count'],
15 margins=True,
16 margins_name='Total'
17)

5.2 pivot() - No aggregation

Python
1df = pd.DataFrame({
2 'date': ['2024-01', '2024-01', '2024-02', '2024-02'],
3 'product': ['A', 'B', 'A', 'B'],
4 'sales': [100, 150, 120, 180]
5})
6pivoted = df.pivot(index='date', columns='product', values='sales')

5.3 melt() - Wide to Long

Python
1wide_df = pd.DataFrame({
2 'product': ['A', 'B'],
3 'Q1': [100, 150], 'Q2': [120, 180], 'Q3': [140, 200]
4})
5
6long_df = pd.melt(wide_df, id_vars=['product'],
7 value_vars=['Q1', 'Q2', 'Q3'],
8 var_name='quarter', value_name='sales')

5.4 Stack & Unstack

Python
1# Multi-index → columns
2df_grouped = sales.groupby(['product', 'region'])['amount'].sum()
3unstacked = df_grouped.unstack() # region → columns
4stacked = unstacked.stack() # Back to multi-index
5
6# Cross tabulation
7ct = pd.crosstab(sales['product'], sales['region'],
8 values=sales['amount'], aggfunc='sum',
9 margins=True)

Checkpoint

pivot_table() = reshape + aggregate (margins=True cho totals). melt() = wide→long. unstack() = index→columns. crosstab() = frequency/value table!

6

📈 5. Window Functions

TB5 min

6.1 Rolling Windows

Python
1ts = pd.DataFrame({
2 'date': pd.date_range('2024-01-01', periods=30),
3 'sales': np.random.randint(100, 200, 30)
4}).set_index('date')
5
6# Moving averages
7ts['MA_7'] = ts['sales'].rolling(window=7).mean()
8ts['MA_14'] = ts['sales'].rolling(window=14).mean()
9
10# Rolling statistics
11ts['rolling_std'] = ts['sales'].rolling(7).std()
12ts['rolling_min'] = ts['sales'].rolling(7).min()
13ts['rolling_max'] = ts['sales'].rolling(7).max()

6.2 Expanding & EWMA

Python
1# Cumulative
2ts['cumsum'] = ts['sales'].expanding().sum()
3ts['cummax'] = ts['sales'].expanding().max()
4ts['cummean'] = ts['sales'].expanding().mean()
5
6# EWMA - more weight to recent values
7ts['ewma'] = ts['sales'].ewm(span=7).mean()

Checkpoint

Rolling = moving window (MA, std). Expanding = cumulative (YTD). EWMA = exponential weighted (recent bias). Essential cho time series analysis!

7

📋 Tổng kết

TB5 min

Kiến thức đã học

OperationMethodUse Case
Mergepd.merge()Join tables on keys
Concatpd.concat()Stack DataFrames
GroupBydf.groupby()Aggregate by groups
Transformtransform()Group calculations same shape
Pivotpivot_table()Reshape for analysis
Meltpd.melt()Wide to long format
Rollingrolling()Moving window calculations

Câu hỏi tự kiểm tra

  1. merge()concat() khác nhau thế nào?
  2. transform() khác agg() ở điểm nào?
  3. pivot_table()melt() dùng trong trường hợp nào?
  4. rolling() dùng để tính gì?

Bài tiếp theo: Pandas Advanced — MultiIndex, Method Chaining, Optimization →

🎉 Tuyệt vời! Bạn đã thành thạo data manipulation với Pandas!

Nhớ: Merge = JOIN tables, Concat = UNION tables, GroupBy = SQL GROUP BY. Nắm 3 operations này là giải quyết được hầu hết bài toán!