Lý thuyết
Bài 3/17

Data Manipulation

Merge, Pivot, GroupBy và Transform trong Pandas

Data Manipulation

Data Manipulation and Processing

1. Introduction

Data Manipulation Skills

Kỹ năng manipulation data là core competency của Data Analyst. Bài này cover các techniques quan trọng nhất: merging, pivoting, grouping, và transforming data.


2. Merging & Joining

2.1 Merge Types

Text
1┌─────────────────────────────────────────────────────────┐
2│ Join Types │
3├─────────────────────────────────────────────────────────┤
4│ │
5│ INNER JOIN LEFT JOIN │
6│ ┌───┬───┐ ┌───┬───┐ │
7│ │ │███│ │███│███│ │
8│ │ │███│ │███│███│ │
9│ └───┴───┘ └───┴───┘ │
10│ Only matches All from left │
11│ │
12│ RIGHT JOIN OUTER JOIN │
13│ ┌───┬───┐ ┌───┬───┐ │
14│ │███│███│ │███│███│ │
15│ │███│███│ │███│███│ │
16│ └───┴───┘ └───┴───┘ │
17│ All from right All from both │
18│ │
19└─────────────────────────────────────────────────────────┘

2.2 pd.merge()

Python
1import pandas as pd
2
3# Sample data
4orders = pd.DataFrame({
5 'order_id': [1, 2, 3, 4],
6 'customer_id': [101, 102, 103, 104],
7 'amount': [500, 300, 700, 200]
8})
9
10customers = pd.DataFrame({
11 'customer_id': [101, 102, 105],
12 'name': ['Alice', 'Bob', 'Eve'],
13 'city': ['Hanoi', 'HCMC', 'Danang']
14})
15
16# Inner join (default)
17inner = pd.merge(orders, customers, on='customer_id')
18print("Inner Join:")
19print(inner)
20
21# Left join
22left = pd.merge(orders, customers, on='customer_id', how='left')
23print("\nLeft Join:")
24print(left)
25
26# Right join
27right = pd.merge(orders, customers, on='customer_id', how='right')
28print("\nRight Join:")
29print(right)
30
31# Outer join
32outer = pd.merge(orders, customers, on='customer_id', how='outer')
33print("\nOuter Join:")
34print(outer)

2.3 Advanced Merge Options

Python
1# Merge on different column names
2df1 = pd.DataFrame({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})
3df2 = pd.DataFrame({'key': [1, 2, 4], 'score': [10, 20, 40]})
4
5merged = pd.merge(df1, df2, left_on='id', right_on='key')
6
7# Merge on multiple columns
8orders = pd.DataFrame({
9 'year': [2023, 2023, 2024],
10 'quarter': [1, 2, 1],
11 'sales': [100, 200, 150]
12})
13
14targets = pd.DataFrame({
15 'year': [2023, 2023, 2024],
16 'quarter': [1, 2, 1],
17 'target': [120, 180, 140]
18})
19
20merged = pd.merge(orders, targets, on=['year', 'quarter'])
21
22# Handle duplicate column names
23merged = pd.merge(df1, df2, left_on='id', right_on='key',
24 suffixes=('_left', '_right'))
25
26# Validate merge
27# one_to_one, one_to_many, many_to_one, many_to_many
28merged = pd.merge(df1, df2, left_on='id', right_on='key',
29 validate='one_to_one')
30
31# Indicator column
32merged = pd.merge(orders, customers, on='customer_id',
33 how='outer', indicator=True)
34# _merge column shows: left_only, right_only, both

2.4 Join on Index

Python
1# Set index and use join()
2df1 = df1.set_index('id')
3df2 = df2.set_index('key')
4
5# Default is left join
6result = df1.join(df2)
7
8# Or use merge with index
9result = pd.merge(df1, df2, left_index=True, right_index=True)

3. Concatenation

3.1 pd.concat()

Python
1# Vertical concatenation (stack rows)
2df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
3df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
4
5combined = pd.concat([df1, df2], ignore_index=True)
6print(combined)
7# A B
8# 0 1 3
9# 1 2 4
10# 2 5 7
11# 3 6 8
12
13# Horizontal concatenation
14combined_h = pd.concat([df1, df2], axis=1)
15print(combined_h)
16# A B A B
17# 0 1 3 5 7
18# 1 2 4 6 8
19
20# With different columns
21df3 = pd.DataFrame({'A': [1, 2], 'C': [9, 10]})
22combined = pd.concat([df1, df3], ignore_index=True)
23# Missing columns filled with NaN

3.2 Concat với keys

Python
1# Add hierarchical index
2q1 = pd.DataFrame({'sales': [100, 200]}, index=['Jan', 'Feb'])
3q2 = pd.DataFrame({'sales': [300, 400]}, index=['Mar', 'Apr'])
4
5combined = pd.concat([q1, q2], keys=['Q1', 'Q2'])
6print(combined)
7# sales
8# Q1 Jan 100
9# Feb 200
10# Q2 Mar 300
11# Apr 400

4. GroupBy

GroupBy Power

GroupBy là một trong những công cụ mạnh nhất của Pandas, cho phép split-apply-combine data theo nhóm.

4.1 Basic GroupBy

Python
1# Sample data
2sales = pd.DataFrame({
3 'date': pd.date_range('2024-01-01', periods=10),
4 'product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
5 'region': ['North', 'North', 'South', 'South', 'North',
6 'South', 'North', 'North', 'South', 'South'],
7 'amount': [100, 150, 200, 180, 120, 160, 140, 170, 190, 130]
8})
9
10# Group by single column
11grouped = sales.groupby('product')
12print(grouped.sum())
13
14# Group by multiple columns
15grouped = sales.groupby(['product', 'region'])
16print(grouped.sum())
17
18# Access groups
19print(grouped.groups)
20print(grouped.get_group(('A', 'North')))

4.2 Aggregation Functions

Python
1# Single aggregation
2print(sales.groupby('product')['amount'].sum())
3print(sales.groupby('product')['amount'].mean())
4print(sales.groupby('product')['amount'].count())
5
6# Multiple aggregations
7agg_result = sales.groupby('product')['amount'].agg(['sum', 'mean', 'min', 'max'])
8print(agg_result)
9
10# Different aggregations per column
11agg_result = sales.groupby('product').agg({
12 'amount': ['sum', 'mean'],
13 'date': 'count'
14})
15print(agg_result)
16
17# Named aggregations (cleaner output)
18agg_result = sales.groupby('product').agg(
19 total_sales=('amount', 'sum'),
20 avg_sale=('amount', 'mean'),
21 transaction_count=('date', 'count')
22)
23print(agg_result)
24
25# Custom aggregation function
26def range_func(x):
27 return x.max() - x.min()
28
29agg_result = sales.groupby('product')['amount'].agg(range_func)

4.3 Transform

Python
1# Transform returns same-shape output
2# Useful for adding group-level calculations back to original data
3
4# Add group mean to each row
5sales['group_mean'] = sales.groupby('product')['amount'].transform('mean')
6
7# Normalize within group
8sales['normalized'] = sales.groupby('product')['amount'].transform(
9 lambda x: (x - x.mean()) / x.std()
10)
11
12# Rank within group
13sales['rank_in_group'] = sales.groupby('product')['amount'].transform('rank')
14
15# Percentage of group total
16sales['pct_of_group'] = sales.groupby('product')['amount'].transform(
17 lambda x: x / x.sum() * 100
18)

4.4 Filter Groups

Python
1# Keep only groups with sum > 500
2filtered = sales.groupby('product').filter(lambda x: x['amount'].sum() > 500)
3
4# Keep groups with more than 3 transactions
5filtered = sales.groupby('product').filter(lambda x: len(x) > 3)

4.5 Apply

Python
1# Apply custom function to each group
2def top_n(group, n=2, column='amount'):
3 return group.nlargest(n, column)
4
5result = sales.groupby('product').apply(top_n, n=2)
6
7# Reset index after apply
8result = sales.groupby('product').apply(top_n).reset_index(drop=True)

5. Pivot Tables

5.1 pivot_table()

Python
1# Basic pivot table
2pivot = sales.pivot_table(
3 values='amount',
4 index='product',
5 columns='region',
6 aggfunc='sum'
7)
8print(pivot)
9# region North South
10# product
11# A 360 390
12# B 470 290
13
14# Multiple aggregations
15pivot = sales.pivot_table(
16 values='amount',
17 index='product',
18 columns='region',
19 aggfunc=['sum', 'mean', 'count']
20)
21
22# Multiple values
23pivot = sales.pivot_table(
24 values=['amount'], # Can add more columns
25 index='product',
26 columns='region',
27 aggfunc='sum',
28 margins=True, # Add totals
29 margins_name='Total'
30)

5.2 pivot() - Reshaping without aggregation

Python
1# When there's only one value per cell
2df = pd.DataFrame({
3 'date': ['2024-01', '2024-01', '2024-02', '2024-02'],
4 'product': ['A', 'B', 'A', 'B'],
5 'sales': [100, 150, 120, 180]
6})
7
8pivoted = df.pivot(index='date', columns='product', values='sales')
9print(pivoted)
10# product A B
11# date
12# 2024-01 100 150
13# 2024-02 120 180

5.3 melt() - Unpivot

Python
1# Wide to long format
2wide_df = pd.DataFrame({
3 'product': ['A', 'B'],
4 'Q1': [100, 150],
5 'Q2': [120, 180],
6 'Q3': [140, 200]
7})
8
9long_df = pd.melt(
10 wide_df,
11 id_vars=['product'],
12 value_vars=['Q1', 'Q2', 'Q3'],
13 var_name='quarter',
14 value_name='sales'
15)
16print(long_df)
17# product quarter sales
18# 0 A Q1 100
19# 1 B Q1 150
20# 2 A Q2 120
21# 3 B Q2 180
22# 4 A Q3 140
23# 5 B Q3 200

6. Stack & Unstack

Python
1# Multi-index DataFrame
2df = sales.groupby(['product', 'region'])['amount'].sum()
3print(df)
4# product region
5# A North 360
6# South 390
7# B North 470
8# South 290
9
10# Unstack - pivot inner level to columns
11unstacked = df.unstack()
12print(unstacked)
13# region North South
14# product
15# A 360 390
16# B 470 290
17
18# Stack - opposite of unstack
19stacked = unstacked.stack()
20print(stacked)
21# Back to original format
22
23# Unstack specific level
24df_multi = sales.groupby(['product', 'region', 'date'])['amount'].sum()
25unstacked_date = df_multi.unstack(level='date')

7. Window Functions

7.1 Rolling Windows

Python
1# Time series data
2ts = pd.DataFrame({
3 'date': pd.date_range('2024-01-01', periods=30),
4 'sales': np.random.randint(100, 200, 30)
5})
6ts = ts.set_index('date')
7
8# Rolling mean (moving average)
9ts['MA_7'] = ts['sales'].rolling(window=7).mean()
10ts['MA_14'] = ts['sales'].rolling(window=14).mean()
11
12# Rolling statistics
13ts['rolling_std'] = ts['sales'].rolling(7).std()
14ts['rolling_min'] = ts['sales'].rolling(7).min()
15ts['rolling_max'] = ts['sales'].rolling(7).max()
16
17# Rolling sum
18ts['rolling_sum'] = ts['sales'].rolling(7).sum()
19
20# Custom rolling function
21ts['rolling_range'] = ts['sales'].rolling(7).apply(lambda x: x.max() - x.min())

7.2 Expanding Windows

Python
1# Cumulative calculations
2ts['cumsum'] = ts['sales'].expanding().sum()
3ts['cummax'] = ts['sales'].expanding().max()
4ts['cummin'] = ts['sales'].expanding().min()
5ts['cummean'] = ts['sales'].expanding().mean()
6
7# Year-to-date totals
8ts['ytd_sales'] = ts['sales'].expanding().sum()

7.3 Exponential Weighted Moving Average

Python
1# EWMA - gives more weight to recent values
2ts['ewma'] = ts['sales'].ewm(span=7).mean()
3ts['ewma_20'] = ts['sales'].ewm(span=20).mean()

8. Advanced Techniques

8.1 Method Chaining

Python
1# Chain multiple operations
2result = (
3 sales
4 .query('amount > 100')
5 .assign(
6 amount_k=lambda x: x['amount'] / 1000,
7 month=lambda x: x['date'].dt.month
8 )
9 .groupby(['product', 'month'])
10 .agg(
11 total=('amount', 'sum'),
12 avg=('amount', 'mean')
13 )
14 .reset_index()
15 .sort_values('total', ascending=False)
16)

8.2 pipe()

Python
1# Custom functions in chain
2def add_percent_column(df, col):
3 df[f'{col}_pct'] = df[col] / df[col].sum() * 100
4 return df
5
6def filter_top_n(df, col, n):
7 return df.nlargest(n, col)
8
9result = (
10 sales
11 .groupby('product')['amount']
12 .sum()
13 .reset_index()
14 .pipe(add_percent_column, 'amount')
15 .pipe(filter_top_n, 'amount', 5)
16)

8.3 Cross Tabulation

Python
1# Cross tabulation
2ct = pd.crosstab(
3 sales['product'],
4 sales['region'],
5 values=sales['amount'],
6 aggfunc='sum'
7)
8print(ct)
9
10# With margins
11ct = pd.crosstab(
12 sales['product'],
13 sales['region'],
14 margins=True,
15 margins_name='Total'
16)
17
18# Normalize
19ct_normalized = pd.crosstab(
20 sales['product'],
21 sales['region'],
22 normalize='all' # or 'index', 'columns'
23)

9. Thực hành

Practice Exercise

Exercise: Sales Analysis

Python
1import pandas as pd
2import numpy as np
3
4# Generate sample data
5np.random.seed(42)
6n_orders = 500
7
8orders = pd.DataFrame({
9 'order_id': range(1, n_orders + 1),
10 'date': pd.date_range('2024-01-01', periods=n_orders, freq='D'),
11 'customer_id': np.random.randint(1, 100, n_orders),
12 'product_id': np.random.randint(1, 20, n_orders),
13 'quantity': np.random.randint(1, 10, n_orders),
14 'unit_price': np.random.uniform(10, 100, n_orders).round(2)
15})
16orders['total_amount'] = orders['quantity'] * orders['unit_price']
17
18products = pd.DataFrame({
19 'product_id': range(1, 21),
20 'product_name': [f'Product_{i}' for i in range(1, 21)],
21 'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Home'], 20)
22})
23
24customers = pd.DataFrame({
25 'customer_id': range(1, 101),
26 'customer_name': [f'Customer_{i}' for i in range(1, 101)],
27 'region': np.random.choice(['North', 'South', 'East', 'West'], 100)
28})
29
30# Tasks:
31# 1. Merge orders với products và customers
32# 2. Tính monthly sales by category
33# 3. Tìm top 5 customers by total spending
34# 4. Tạo pivot table: category vs region, values = total_amount
35# 5. Tính 7-day rolling average sales
36
37# YOUR CODE HERE
💡 Xem đáp án
Python
1# 1. Merge all tables
2df = (
3 orders
4 .merge(products, on='product_id')
5 .merge(customers, on='customer_id')
6)
7print(f"Merged shape: {df.shape}")
8print(df.head())
9
10# 2. Monthly sales by category
11df['month'] = df['date'].dt.to_period('M')
12monthly_category = (
13 df
14 .groupby(['month', 'category'])['total_amount']
15 .sum()
16 .unstack()
17 .round(2)
18)
19print("\nMonthly Sales by Category:")
20print(monthly_category)
21
22# 3. Top 5 customers
23top_customers = (
24 df
25 .groupby(['customer_id', 'customer_name'])['total_amount']
26 .sum()
27 .reset_index()
28 .nlargest(5, 'total_amount')
29)
30print("\nTop 5 Customers:")
31print(top_customers)
32
33# 4. Pivot table
34pivot = pd.pivot_table(
35 df,
36 values='total_amount',
37 index='category',
38 columns='region',
39 aggfunc='sum',
40 margins=True,
41 margins_name='Total'
42).round(2)
43print("\nPivot Table - Category vs Region:")
44print(pivot)
45
46# 5. 7-day rolling average
47daily_sales = df.groupby('date')['total_amount'].sum()
48rolling_avg = daily_sales.rolling(7).mean()
49print("\n7-Day Rolling Average:")
50print(rolling_avg.tail(10))
51
52# Visualization
53import matplotlib.pyplot as plt
54
55fig, axes = plt.subplots(2, 2, figsize=(14, 10))
56
57# Monthly trend
58monthly_category.plot(ax=axes[0, 0], marker='o')
59axes[0, 0].set_title('Monthly Sales by Category')
60axes[0, 0].set_xlabel('Month')
61axes[0, 0].set_ylabel('Sales')
62
63# Top customers
64axes[0, 1].barh(top_customers['customer_name'], top_customers['total_amount'])
65axes[0, 1].set_title('Top 5 Customers by Spending')
66axes[0, 1].set_xlabel('Total Amount')
67
68# Region distribution
69region_sales = df.groupby('region')['total_amount'].sum()
70axes[1, 0].pie(region_sales, labels=region_sales.index, autopct='%1.1f%%')
71axes[1, 0].set_title('Sales by Region')
72
73# Rolling average
74axes[1, 1].plot(daily_sales.index, daily_sales.values, alpha=0.5, label='Daily')
75axes[1, 1].plot(rolling_avg.index, rolling_avg.values, label='7-day MA')
76axes[1, 1].set_title('Daily Sales with Rolling Average')
77axes[1, 1].legend()
78axes[1, 1].tick_params(axis='x', rotation=45)
79
80plt.tight_layout()
81plt.show()

10. Tổng kết

OperationMethodUse Case
Mergepd.merge()Join tables on keys
Concatpd.concat()Stack DataFrames
GroupBydf.groupby()Aggregate by groups
Transformtransform()Group calculations same shape
Pivotpivot_table()Reshape for analysis
Meltpd.melt()Wide to long format
Rollingrolling()Moving window calculations

Bài tiếp theo: Pandas Advanced - MultiIndex, Method Chaining, Optimization