Data Manipulation

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Merge/Join DataFrames (inner, left, right, outer)

✅ Concatenate datasets vertically + horizontally

✅ GroupBy: split-apply-combine pattern

✅ Pivot Tables + Melt cho data reshaping

✅ Window Functions: rolling, expanding, EWMA

Thời gian: 2 giờ | Độ khó: Intermediate | Tool: Python + Pandas

Task 0

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Merge	Nối bảng	Join 2 DataFrames on key columns
Inner Join	Nối trong	Chỉ giữ records match cả 2 bảng
Left Join	Nối trái	Giữ tất cả từ bảng trái
Outer Join	Nối ngoài	Giữ tất cả từ cả 2 bảng
Concat	Ghép nối	Stack DataFrames vertically/horizontally
GroupBy	Nhóm theo	Split-apply-combine pattern
Pivot Table	Bảng tổng hợp	Reshape + aggregate data
Melt	Chuyển dạng	Wide → Long format
Transform	Biến đổi	Group calc → same-shape output
Rolling	Cửa sổ trượt	Moving window calculations

Checkpoint

Merge = JOIN (inner/left/right/outer). GroupBy = split-apply-combine. Pivot = reshape + aggregate. Melt = wide→long. Giống SQL JOINs + GROUP BY!

Task 1

🔗 1. Merging & Joining

TB5 min

1.1 Merge Types

Merge Types

Join Type	Giữ dữ liệu	Mô tả
INNER JOIN	Chỉ matches	Chỉ giữ records match cả 2 bảng
LEFT JOIN	Tất cả bên trái	Giữ tất cả từ bảng trái + matches
RIGHT JOIN	Tất cả bên phải	Giữ tất cả từ bảng phải + matches
OUTER JOIN	Tất cả cả 2	Giữ tất cả từ cả 2 bảng

1.2 pd.merge()

Python

1import pandas as pd
2
3orders = pd.DataFrame({
4    'order_id': [1, 2, 3, 4],
5    'customer_id': [101, 102, 103, 104],
6    'amount': [500, 300, 700, 200]
7})
8
9customers = pd.DataFrame({
10    'customer_id': [101, 102, 105],
11    'name': ['Alice', 'Bob', 'Eve'],
12    'city': ['Hanoi', 'HCMC', 'Danang']
13})
14
15# Join types
16inner = pd.merge(orders, customers, on='customer_id')          # default
17left  = pd.merge(orders, customers, on='customer_id', how='left')
18right = pd.merge(orders, customers, on='customer_id', how='right')
19outer = pd.merge(orders, customers, on='customer_id', how='outer')

1.3 Advanced Merge

Python

1# Different column names
2merged = pd.merge(df1, df2, left_on='id', right_on='key')
3
4# Multiple columns
5merged = pd.merge(orders, targets, on=['year', 'quarter'])
6
7# Handle duplicates
8merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))
9
10# Indicator column
11merged = pd.merge(orders, customers, on='customer_id',
12                  how='outer', indicator=True)
13# _merge: left_only, right_only, both
14
15# Validate
16merged = pd.merge(df1, df2, on='id', validate='one_to_one')

Checkpoint

pd.merge() = SQL JOIN. 4 types: inner (matches), left (all left), right (all right), outer (all). indicator=True shows match status!

Task 2

📦 2. Concatenation

TB5 min

Python

1# Vertical (stack rows)
2combined = pd.concat([df1, df2], ignore_index=True)
3
4# Horizontal
5combined_h = pd.concat([df1, df2], axis=1)
6
7# With keys (hierarchical index)
8q1 = pd.DataFrame({'sales': [100, 200]}, index=['Jan', 'Feb'])
9q2 = pd.DataFrame({'sales': [300, 400]}, index=['Mar', 'Apr'])
10combined = pd.concat([q1, q2], keys=['Q1', 'Q2'])

Merge = join on keys (like SQL JOIN). Concat = stack datasets (like UNION). Dùng ignore_index=True để reset index sau concat!

Checkpoint

pd.concat() = stack DataFrames. axis=0 (vertical), axis=1 (horizontal). keys tạo hierarchical index. Missing columns → NaN!

Task 3

📊 3. GroupBy

TB5 min

3.1 Basic GroupBy

Python

1sales = pd.DataFrame({
2    'date': pd.date_range('2024-01-01', periods=10),
3    'product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
4    'region': ['North', 'North', 'South', 'South', 'North',
5               'South', 'North', 'North', 'South', 'South'],
6    'amount': [100, 150, 200, 180, 120, 160, 140, 170, 190, 130]
7})
8
9# Group by single column
10print(sales.groupby('product')['amount'].sum())
11
12# Group by multiple columns
13print(sales.groupby(['product', 'region'])['amount'].sum())

3.2 Aggregation Functions

Python

1# Multiple aggregations
2agg_result = sales.groupby('product')['amount'].agg(['sum', 'mean', 'min', 'max'])
3
4# Named aggregations (clean output)
5agg_result = sales.groupby('product').agg(
6    total_sales=('amount', 'sum'),
7    avg_sale=('amount', 'mean'),
8    transaction_count=('date', 'count')
9)
10
11# Custom function
12def range_func(x):
13    return x.max() - x.min()
14agg_result = sales.groupby('product')['amount'].agg(range_func)

3.3 Transform & Filter

Python

1# Transform: same-shape output
2sales['group_mean'] = sales.groupby('product')['amount'].transform('mean')
3sales['pct_of_group'] = sales.groupby('product')['amount'].transform(
4    lambda x: x / x.sum() * 100
5)
6sales['rank_in_group'] = sales.groupby('product')['amount'].transform('rank')
7
8# Filter groups
9filtered = sales.groupby('product').filter(lambda x: x['amount'].sum() > 500)

GroupBy Power: agg() cho summary, transform() cho same-shape output (add back to original), filter() để keep/remove entire groups!

Checkpoint

GroupBy = split-apply-combine. agg() = summary. transform() = same shape (add group stats per row). filter() = keep/drop groups. Giống SQL GROUP BY!

Task 4

🔄 4. Pivot Tables & Reshaping

TB5 min

5.1 pivot_table()

Python

1# Basic pivot
2pivot = sales.pivot_table(
3    values='amount',
4    index='product',
5    columns='region',
6    aggfunc='sum'
7)
8
9# Multiple aggregations + totals
10pivot = sales.pivot_table(
11    values='amount',
12    index='product',
13    columns='region',
14    aggfunc=['sum', 'mean', 'count'],
15    margins=True,
16    margins_name='Total'
17)

5.2 pivot() - No aggregation

Python

1df = pd.DataFrame({
2    'date': ['2024-01', '2024-01', '2024-02', '2024-02'],
3    'product': ['A', 'B', 'A', 'B'],
4    'sales': [100, 150, 120, 180]
5})
6pivoted = df.pivot(index='date', columns='product', values='sales')

5.3 melt() - Wide to Long

Python

1wide_df = pd.DataFrame({
2    'product': ['A', 'B'],
3    'Q1': [100, 150], 'Q2': [120, 180], 'Q3': [140, 200]
4})
5
6long_df = pd.melt(wide_df, id_vars=['product'],
7                  value_vars=['Q1', 'Q2', 'Q3'],
8                  var_name='quarter', value_name='sales')

5.4 Stack & Unstack

Python

1# Multi-index → columns
2df_grouped = sales.groupby(['product', 'region'])['amount'].sum()
3unstacked = df_grouped.unstack()  # region → columns
4stacked = unstacked.stack()        # Back to multi-index
5
6# Cross tabulation
7ct = pd.crosstab(sales['product'], sales['region'],
8                 values=sales['amount'], aggfunc='sum',
9                 margins=True)

Checkpoint

pivot_table() = reshape + aggregate (margins=True cho totals). melt() = wide→long. unstack() = index→columns. crosstab() = frequency/value table!

Task 5

📈 5. Window Functions

TB5 min

6.1 Rolling Windows

Python

1ts = pd.DataFrame({
2    'date': pd.date_range('2024-01-01', periods=30),
3    'sales': np.random.randint(100, 200, 30)
4}).set_index('date')
5
6# Moving averages
7ts['MA_7'] = ts['sales'].rolling(window=7).mean()
8ts['MA_14'] = ts['sales'].rolling(window=14).mean()
9
10# Rolling statistics
11ts['rolling_std'] = ts['sales'].rolling(7).std()
12ts['rolling_min'] = ts['sales'].rolling(7).min()
13ts['rolling_max'] = ts['sales'].rolling(7).max()

6.2 Expanding & EWMA

Python

1# Cumulative
2ts['cumsum'] = ts['sales'].expanding().sum()
3ts['cummax'] = ts['sales'].expanding().max()
4ts['cummean'] = ts['sales'].expanding().mean()
5
6# EWMA - more weight to recent values
7ts['ewma'] = ts['sales'].ewm(span=7).mean()

Checkpoint

Rolling = moving window (MA, std). Expanding = cumulative (YTD). EWMA = exponential weighted (recent bias). Essential cho time series analysis!

Task 6

📋 Tổng kết

TB5 min

Kiến thức đã học

Operation	Method	Use Case
Merge	`pd.merge()`	Join tables on keys
Concat	`pd.concat()`	Stack DataFrames
GroupBy	`df.groupby()`	Aggregate by groups
Transform	`transform()`	Group calculations same shape
Pivot	`pivot_table()`	Reshape for analysis
Melt	`pd.melt()`	Wide to long format
Rolling	`rolling()`	Moving window calculations

Câu hỏi tự kiểm tra

merge() và concat() khác nhau thế nào?
transform() khác agg() ở điểm nào?
pivot_table() và melt() dùng trong trường hợp nào?
rolling() dùng để tính gì?

Bài tiếp theo: Pandas Advanced — MultiIndex, Method Chaining, Optimization →

🎉 Tuyệt vời! Bạn đã thành thạo data manipulation với Pandas!

Nhớ: Merge = JOIN tables, Concat = UNION tables, GroupBy = SQL GROUP BY. Nắm 3 operations này là giải quyết được hầu hết bài toán!

Task 7

🎯 Mục tiêu bài học

📖 Bảng Thuật Ngữ Quan Trọng

Checkpoint

🔗 1. Merging & Joining

1.1 Merge Types

1.2 pd.merge()

1.3 Advanced Merge

Checkpoint

📦 2. Concatenation

Checkpoint

📊 3. GroupBy

3.1 Basic GroupBy

3.2 Aggregation Functions

3.3 Transform & Filter

Checkpoint

🔄 4. Pivot Tables & Reshaping

5.1 pivot_table()

5.2 pivot() - No aggregation

5.3 melt() - Wide to Long

5.4 Stack & Unstack

Checkpoint

📈 5. Window Functions

6.1 Rolling Windows

6.2 Expanding & EWMA

Checkpoint

📋 Tổng kết

Kiến thức đã học

Câu hỏi tự kiểm tra

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu