Lý thuyết
Bài 3/15

Pandas Nâng cao

MultiIndex, GroupBy nâng cao, Merge strategies và Performance optimization

Pandas Nâng cao

Python Pandas Data Analysis

1. MultiIndex (Hierarchical Index)

1.1 Tạo MultiIndex DataFrame

Python
1import pandas as pd
2import numpy as np
3
4# Method 1: From tuples
5arrays = [
6 ['2023', '2023', '2024', '2024'],
7 ['Q1', 'Q2', 'Q1', 'Q2']
8]
9index = pd.MultiIndex.from_arrays(arrays, names=['Year', 'Quarter'])
10df = pd.DataFrame({
11 'Revenue': [100, 150, 120, 180],
12 'Profit': [20, 35, 28, 45]
13}, index=index)
14
15print(df)
16# Revenue Profit
17# Year Quarter
18# 2023 Q1 100 20
19# Q2 150 35
20# 2024 Q1 120 28
21# Q2 180 45
Python
1# Method 2: set_index với multiple columns
2sales_data = pd.DataFrame({
3 'Region': ['North', 'North', 'South', 'South'],
4 'Product': ['A', 'B', 'A', 'B'],
5 'Sales': [100, 200, 150, 250],
6 'Quantity': [10, 20, 15, 25]
7})
8
9df_multi = sales_data.set_index(['Region', 'Product'])
10print(df_multi)
11# Sales Quantity
12# Region Product
13# North A 100 10
14# B 200 20
15# South A 150 15
16# B 250 25

1.2 Truy cập MultiIndex

Python
1# Lấy tất cả data của một region
2df_multi.loc['North']
3# Sales Quantity
4# Product
5# A 100 10
6# B 200 20
7
8# Lấy specific row
9df_multi.loc[('North', 'A')]
10# Sales 100
11# Quantity 10
12
13# Cross-section với xs()
14df_multi.xs('A', level='Product')
15# Sales Quantity
16# Region
17# North 100 10
18# South 150 15

1.3 Stack và Unstack

Python
1# Pivot data giữa row và column index
2df_wide = df_multi.unstack() # Move Product to columns
3print(df_wide)
4# Sales Quantity
5# Product A B A B
6# Region
7# North 100 200 10 20
8# South 150 250 15 25
9
10# Stack back
11df_long = df_wide.stack()

2. GroupBy Nâng cao

2.1 Multiple Aggregations

Python
1# Sample data
2df = pd.DataFrame({
3 'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
4 'Product': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
5 'Sales': [100, 200, 150, 250, 120, 180],
6 'Quantity': [10, 20, 15, 25, 12, 18]
7})
8
9# Multiple agg functions
10result = df.groupby('Category').agg({
11 'Sales': ['sum', 'mean', 'max'],
12 'Quantity': ['sum', 'count']
13})
14print(result)
15# Sales Quantity
16# sum mean max sum count
17# Category
18# A 420 140.0 200 42 3
19# B 580 193.3 250 58 3

2.2 Custom Aggregation

Python
1# Named aggregations (pandas 0.25+)
2result = df.groupby('Category').agg(
3 total_sales=('Sales', 'sum'),
4 avg_sales=('Sales', 'mean'),
5 top_sale=('Sales', 'max'),
6 num_transactions=('Sales', 'count'),
7 sales_range=('Sales', lambda x: x.max() - x.min())
8)

2.3 Transform vs Apply

Python
1# Transform: Returns same shape as input
2df['Sales_pct'] = df.groupby('Category')['Sales'].transform(
3 lambda x: x / x.sum() * 100
4)
5
6# Apply: Returns aggregated result
7def top_n_sales(group, n=2):
8 return group.nlargest(n, 'Sales')
9
10df.groupby('Category').apply(top_n_sales, n=2)

2.4 Filter Groups

Python
1# Chỉ giữ categories có tổng sales > 500
2df_filtered = df.groupby('Category').filter(lambda x: x['Sales'].sum() > 500)

3. Merge Strategies

3.1 Các loại Join

Python
1# Sample DataFrames
2customers = pd.DataFrame({
3 'customer_id': [1, 2, 3, 4],
4 'name': ['Alice', 'Bob', 'Charlie', 'David']
5})
6
7orders = pd.DataFrame({
8 'order_id': [101, 102, 103, 104],
9 'customer_id': [1, 2, 2, 5],
10 'amount': [100, 200, 150, 300]
11})
12
13# Inner join (default)
14inner = pd.merge(customers, orders, on='customer_id', how='inner')
15# Chỉ giữ matching rows
16
17# Left join
18left = pd.merge(customers, orders, on='customer_id', how='left')
19# Giữ tất cả customers, NaN cho orders không match
20
21# Right join
22right = pd.merge(customers, orders, on='customer_id', how='right')
23# Giữ tất cả orders, NaN cho customers không match
24
25# Outer join
26outer = pd.merge(customers, orders, on='customer_id', how='outer')
27# Giữ tất cả rows từ cả hai bảng

3.2 Merge với Different Column Names

Python
1orders_v2 = pd.DataFrame({
2 'order_id': [101, 102],
3 'cust_id': [1, 2], # Different column name
4 'amount': [100, 200]
5})
6
7merged = pd.merge(
8 customers,
9 orders_v2,
10 left_on='customer_id',
11 right_on='cust_id'
12)

3.3 Merge Indicators

Python
1# Indicator column cho biết row đến từ đâu
2merged = pd.merge(
3 customers, orders,
4 on='customer_id',
5 how='outer',
6 indicator=True
7)
8# _merge: 'left_only', 'right_only', 'both'
9
10# Tìm customers chưa có orders
11no_orders = merged[merged['_merge'] == 'left_only']

3.4 Concatenation

Python
1# Vertical concat (stack rows)
2df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
3df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
4combined = pd.concat([df1, df2], ignore_index=True)
5
6# Horizontal concat (add columns)
7combined_h = pd.concat([df1, df2], axis=1)

4. Performance Optimization

4.1 Chọn đúng Data Types

Python
1# Check memory usage
2df.info(memory_usage='deep')
3
4# Downcast numeric types
5df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
6df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')
7
8# Use category for low-cardinality strings
9df['category_col'] = df['category_col'].astype('category')

4.2 Efficient Operations

Python
1# ❌ Slow: iterrows
2for idx, row in df.iterrows():
3 df.loc[idx, 'new_col'] = row['col1'] + row['col2']
4
5# ✅ Fast: vectorized
6df['new_col'] = df['col1'] + df['col2']
7
8# ✅ Fast: apply với axis=1 (nếu cần complex logic)
9df['new_col'] = df.apply(lambda row: row['col1'] + row['col2'], axis=1)
10
11# ✅ Fastest: numpy
12import numpy as np
13df['new_col'] = np.where(df['col1'] > 0, df['col1'], df['col2'])

4.3 Chunked Reading

Python
1# Đọc file lớn theo chunks
2chunks = []
3for chunk in pd.read_csv('large_file.csv', chunksize=100000):
4 # Process each chunk
5 processed = chunk[chunk['value'] > 0] # Filter
6 chunks.append(processed)
7
8df_final = pd.concat(chunks, ignore_index=True)

4.4 Query Method

Python
1# query() có thể nhanh hơn boolean indexing
2# với large DataFrames
3
4# Boolean indexing
5df[(df['A'] > 5) & (df['B'] < 10)]
6
7# query method
8df.query('A > 5 and B < 10')
9
10# query với variables
11threshold = 5
12df.query('A > @threshold')

5. Practical Examples

5.1 Sales Analysis Pipeline

Python
1import pandas as pd
2
3# Load data
4sales = pd.read_csv('sales.csv')
5products = pd.read_csv('products.csv')
6customers = pd.read_csv('customers.csv')
7
8# Merge all tables
9df = (
10 sales
11 .merge(products, on='product_id')
12 .merge(customers, on='customer_id')
13)
14
15# Create analysis
16analysis = (
17 df
18 .assign(
19 revenue=lambda x: x['quantity'] * x['unit_price'],
20 profit=lambda x: x['revenue'] * x['margin']
21 )
22 .groupby(['region', 'category'])
23 .agg(
24 total_revenue=('revenue', 'sum'),
25 total_profit=('profit', 'sum'),
26 order_count=('order_id', 'nunique'),
27 avg_order_value=('revenue', 'mean')
28 )
29 .sort_values('total_revenue', ascending=False)
30)

5.2 Time Series Resampling

Python
1# Set datetime index
2df['date'] = pd.to_datetime(df['date'])
3df = df.set_index('date')
4
5# Resample to monthly
6monthly = df.resample('M').agg({
7 'sales': 'sum',
8 'quantity': 'sum',
9 'transactions': 'count'
10})
11
12# Rolling calculations
13df['rolling_avg_7d'] = df['sales'].rolling(window=7).mean()
14df['rolling_sum_30d'] = df['sales'].rolling(window=30).sum()

6. Thực hành

Exercises

Exercise 1: MultiIndex Operations

Python
1# Tạo DataFrame với MultiIndex: Region -> Year -> Quarter
2# Tính tổng sales theo Region, và % contribution của mỗi Quarter
3
4# YOUR CODE HERE
💡 Xem đáp án
Python
1data = pd.DataFrame({
2 'Region': ['North']*4 + ['South']*4,
3 'Year': [2023, 2023, 2024, 2024] * 2,
4 'Quarter': ['Q1', 'Q2', 'Q1', 'Q2'] * 2,
5 'Sales': [100, 150, 120, 180, 80, 120, 100, 140]
6})
7
8df = data.set_index(['Region', 'Year', 'Quarter'])
9
10# Tổng theo Region
11region_totals = df.groupby('Region')['Sales'].sum()
12
13# % contribution per Quarter
14df['Pct'] = df.groupby(level='Region')['Sales'].transform(
15 lambda x: x / x.sum() * 100
16)
17print(df)

Exercise 2: Complex Aggregation

Python
1# Với DataFrame orders, tính cho mỗi customer:
2# - Tổng orders, Tổng amount
3# - Average order value
4# - Days since first order
5# - Days since last order
6
7# YOUR CODE HERE
💡 Xem đáp án
Python
1import pandas as pd
2from datetime import datetime
3
4orders = pd.DataFrame({
5 'customer_id': [1, 1, 2, 2, 1, 3],
6 'order_date': pd.to_datetime(['2024-01-01', '2024-01-15', '2024-01-05',
7 '2024-02-01', '2024-02-10', '2024-01-20']),
8 'amount': [100, 200, 150, 180, 120, 90]
9})
10
11today = datetime.now()
12
13customer_summary = orders.groupby('customer_id').agg(
14 total_orders=('amount', 'count'),
15 total_amount=('amount', 'sum'),
16 avg_order_value=('amount', 'mean'),
17 first_order=('order_date', 'min'),
18 last_order=('order_date', 'max')
19)
20
21customer_summary['days_since_first'] = (today - customer_summary['first_order']).dt.days
22customer_summary['days_since_last'] = (today - customer_summary['last_order']).dt.days
23
24print(customer_summary)

7. Tổng kết

ConceptUse CaseKey Methods
MultiIndexHierarchical dataset_index(), xs(), stack(), unstack()
GroupByAggregationsagg(), transform(), apply(), filter()
MergeCombine tablesmerge(), concat(), join()
PerformanceLarge dataastype(), vectorization, query(), chunking

Best Practices:

  • Luôn check df.info() và optimize dtypes
  • Prefer vectorized operations over loops
  • Use query() cho complex filters
  • Chunk large files khi đọc

Bài tiếp theo: Data Quality & Validation