Lý thuyết
Bài 4/17

Pandas Advanced

Multi-index, Method Chaining, Performance Optimization

Pandas Advanced

Advanced Python Data Analysis

1. MultiIndex (Hierarchical Index)

MultiIndex Power

MultiIndex cho phép làm việc với data có nhiều dimensions, tạo hierarchical rows/columns - rất hữu ích cho time series, panel data, và complex aggregations.

1.1 Creating MultiIndex

Python
1import pandas as pd
2import numpy as np
3
4# From tuples
5index = pd.MultiIndex.from_tuples([
6 ('2024', 'Q1'), ('2024', 'Q2'), ('2024', 'Q3'), ('2024', 'Q4'),
7 ('2025', 'Q1'), ('2025', 'Q2')
8], names=['Year', 'Quarter'])
9
10df = pd.DataFrame({
11 'Sales': [100, 120, 150, 180, 130, 160],
12 'Profit': [20, 25, 30, 35, 28, 32]
13}, index=index)
14print(df)
15
16# From product (all combinations)
17years = ['2024', '2025']
18quarters = ['Q1', 'Q2', 'Q3', 'Q4']
19regions = ['North', 'South']
20
21index = pd.MultiIndex.from_product(
22 [years, quarters, regions],
23 names=['Year', 'Quarter', 'Region']
24)
25df = pd.DataFrame({
26 'Sales': np.random.randint(100, 200, len(index))
27}, index=index)
28
29# From DataFrame columns
30df = pd.DataFrame({
31 'year': [2024, 2024, 2024, 2024, 2025, 2025],
32 'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q1', 'Q1'],
33 'region': ['North', 'South', 'North', 'South', 'North', 'South'],
34 'sales': [100, 150, 120, 180, 130, 160]
35})
36
37df_multi = df.set_index(['year', 'quarter', 'region'])
38print(df_multi)

1.2 Accessing MultiIndex Data

Python
1# Sample MultiIndex DataFrame
2df = pd.DataFrame({
3 'year': [2024, 2024, 2024, 2024, 2025, 2025, 2025, 2025],
4 'quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
5 'region': ['North', 'North', 'South', 'South', 'North', 'North', 'South', 'South'],
6 'sales': [100, 120, 150, 180, 130, 160, 140, 170]
7}).set_index(['year', 'quarter', 'region'])
8
9# Select by outermost level
10print(df.loc[2024])
11
12# Select by multiple levels
13print(df.loc[(2024, 'Q1')])
14
15# Select specific cell
16print(df.loc[(2024, 'Q1', 'North'), 'sales'])
17
18# Cross-section with xs()
19print(df.xs('Q1', level='quarter'))
20print(df.xs('North', level='region'))
21print(df.xs((2024, 'Q1'), level=['year', 'quarter']))
22
23# Slice with slice()
24print(df.loc[(slice(None), 'Q1'), :]) # All years, Q1 only

1.3 Swapping and Sorting Levels

Python
1# Swap levels
2df_swapped = df.swaplevel('year', 'quarter')
3print(df_swapped)
4
5# Reorder levels
6df_reordered = df.reorder_levels(['region', 'year', 'quarter'])
7
8# Sort index
9df_sorted = df.sort_index()
10df_sorted = df.sort_index(level='region')
11df_sorted = df.sort_index(level=['year', 'quarter'], ascending=[True, False])

1.4 Aggregating with MultiIndex

Python
1# Sum across a level
2print(df.sum(level='year')) # Sum per year
3print(df.sum(level=['year', 'region'])) # Sum per year-region
4
5# Alternative with groupby
6print(df.groupby(level='year').sum())
7print(df.groupby(level=['year', 'region']).sum())
8
9# Unstack to pivot
10unstacked = df.unstack(level='region')
11print(unstacked)
12
13# Stack back
14stacked = unstacked.stack()

2. Column MultiIndex

2.1 Creating Column MultiIndex

Python
1# From tuples
2columns = pd.MultiIndex.from_tuples([
3 ('Sales', 'Q1'), ('Sales', 'Q2'), ('Sales', 'Q3'), ('Sales', 'Q4'),
4 ('Profit', 'Q1'), ('Profit', 'Q2'), ('Profit', 'Q3'), ('Profit', 'Q4')
5], names=['Metric', 'Quarter'])
6
7df = pd.DataFrame(
8 np.random.randint(100, 500, (3, 8)),
9 index=['Product A', 'Product B', 'Product C'],
10 columns=columns
11)
12print(df)
13
14# Access columns
15print(df['Sales']) # All Sales columns
16print(df['Sales', 'Q1']) # Sales Q1 only
17print(df.loc[:, ('Sales', 'Q1')]) # Same with loc

2.2 Stacking Column MultiIndex

Python
1# Stack columns to rows
2stacked = df.stack(level='Quarter')
3print(stacked)
4
5# Stack all levels
6fully_stacked = df.stack(level=['Metric', 'Quarter'])
7print(fully_stacked)
8
9# Unstack to recreate
10unstacked = stacked.unstack()

3. Method Chaining

Clean Code with Chaining

Method chaining tạo code readable, maintainable, và dễ debug hơn. Mỗi step là một transformation rõ ràng.

3.1 Chaining Pattern

Python
1# Without chaining (hard to read)
2df = pd.read_csv('data.csv')
3df = df[df['amount'] > 0]
4df['year'] = pd.to_datetime(df['date']).dt.year
5df = df.groupby('year')['amount'].sum().reset_index()
6df = df.sort_values('amount', ascending=False)
7
8# With chaining (clean and readable)
9result = (
10 pd.read_csv('data.csv')
11 .query('amount > 0')
12 .assign(year=lambda x: pd.to_datetime(x['date']).dt.year)
13 .groupby('year')['amount']
14 .sum()
15 .reset_index()
16 .sort_values('amount', ascending=False)
17)

3.2 Key Methods for Chaining

Python
1# assign() - Add columns
2df = df.assign(
3 profit=lambda x: x['revenue'] - x['cost'],
4 margin=lambda x: x['profit'] / x['revenue']
5)
6
7# query() - Filter rows (SQL-like)
8df = df.query('amount > 100 and category == "A"')
9
10# pipe() - Apply custom function
11def custom_transform(df, multiplier):
12 return df.assign(adjusted=df['value'] * multiplier)
13
14result = df.pipe(custom_transform, multiplier=1.1)
15
16# loc[] with chaining
17result = (
18 df
19 .loc[lambda x: x['status'] == 'active']
20 .sort_values('score', ascending=False)
21 .head(10)
22)

3.3 Complete Chaining Example

Python
1# Sales analysis pipeline
2sales_summary = (
3 pd.read_csv('sales.csv')
4
5 # Clean
6 .dropna(subset=['amount', 'date'])
7 .query('amount > 0')
8
9 # Transform
10 .assign(
11 date=lambda x: pd.to_datetime(x['date']),
12 year=lambda x: x['date'].dt.year,
13 month=lambda x: x['date'].dt.month,
14 quarter=lambda x: x['date'].dt.quarter,
15 amount_k=lambda x: x['amount'] / 1000
16 )
17
18 # Aggregate
19 .groupby(['year', 'quarter', 'product'])
20 .agg(
21 total_sales=('amount', 'sum'),
22 avg_sale=('amount', 'mean'),
23 transaction_count=('amount', 'count')
24 )
25 .reset_index()
26
27 # Calculate metrics
28 .assign(
29 sales_per_transaction=lambda x: x['total_sales'] / x['transaction_count']
30 )
31
32 # Sort and format
33 .sort_values(['year', 'quarter', 'total_sales'], ascending=[True, True, False])
34 .round(2)
35)

4. Performance Optimization

4.1 Data Types Optimization

Python
1# Check memory usage
2print(df.info(memory_usage='deep'))
3print(df.memory_usage(deep=True))
4
5# Downcast numeric types
6def optimize_dtypes(df):
7 """Optimize DataFrame memory usage"""
8
9 for col in df.select_dtypes(include=['int64']).columns:
10 df[col] = pd.to_numeric(df[col], downcast='integer')
11
12 for col in df.select_dtypes(include=['float64']).columns:
13 df[col] = pd.to_numeric(df[col], downcast='float')
14
15 for col in df.select_dtypes(include=['object']).columns:
16 num_unique = df[col].nunique()
17 num_total = len(df[col])
18 if num_unique / num_total < 0.5: # Less than 50% unique
19 df[col] = df[col].astype('category')
20
21 return df
22
23df_optimized = optimize_dtypes(df.copy())
24print(f"Memory reduction: {1 - df_optimized.memory_usage(deep=True).sum() / df.memory_usage(deep=True).sum():.1%}")

4.2 Efficient Operations

Python
1# Use vectorized operations instead of loops
2# BAD
3for i in range(len(df)):
4 df.loc[i, 'new_col'] = df.loc[i, 'col1'] * 2
5
6# GOOD
7df['new_col'] = df['col1'] * 2
8
9# BAD - iterrows
10for index, row in df.iterrows():
11 result = row['a'] + row['b']
12
13# GOOD - vectorized
14df['result'] = df['a'] + df['b']
15
16# Use apply only when necessary
17# BAD
18df['result'] = df.apply(lambda row: row['a'] * row['b'], axis=1)
19
20# GOOD
21df['result'] = df['a'] * df['b']
22
23# When apply is needed, use raw=True for speed
24df['result'] = df.apply(lambda row: complex_function(row), axis=1, raw=True)

4.3 Efficient Reading

Python
1# Read only needed columns
2df = pd.read_csv('large_file.csv', usecols=['col1', 'col2', 'col3'])
3
4# Read in chunks
5chunk_size = 100000
6chunks = []
7for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
8 # Process each chunk
9 processed = chunk.query('amount > 0')
10 chunks.append(processed)
11
12df = pd.concat(chunks, ignore_index=True)
13
14# Specify dtypes upfront
15dtypes = {
16 'id': 'int32',
17 'amount': 'float32',
18 'category': 'category'
19}
20df = pd.read_csv('data.csv', dtype=dtypes)
21
22# Parse dates efficiently
23df = pd.read_csv('data.csv', parse_dates=['date'],
24 date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d'))

4.4 NumPy Integration

Python
1import numpy as np
2
3# Use NumPy for heavy calculations
4# Pandas Series to NumPy
5arr = df['column'].to_numpy()
6
7# NumPy operations are faster
8result = np.where(arr > 100, arr * 2, arr)
9
10# Apply back to DataFrame
11df['new_column'] = result
12
13# NumPy for conditional operations
14df['category'] = np.select(
15 condlist=[
16 df['score'] < 50,
17 df['score'] < 80,
18 df['score'] >= 80
19 ],
20 choicelist=['Low', 'Medium', 'High'],
21 default='Unknown'
22)

4.5 Caching and eval()

Python
1# eval() for complex expressions
2df = pd.eval('''
3 profit = revenue - cost
4 margin = profit / revenue
5 adjusted = margin * 1.1
6''', target=df)
7
8# query() uses eval internally
9df.query('revenue > 1000 and margin > 0.2')
10
11# For large DataFrames, enable numexpr
12pd.set_option('compute.use_numexpr', True)

5. Working with Dates

5.1 Date Parsing

Python
1# Parse dates
2df['date'] = pd.to_datetime(df['date'])
3df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
4df['date'] = pd.to_datetime(df['date'], errors='coerce') # Invalid -> NaT
5
6# From components
7df['date'] = pd.to_datetime(df[['year', 'month', 'day']])
8
9# DatetimeIndex
10df = df.set_index('date')

5.2 Date Components

Python
1# Extract components (dt accessor)
2df['year'] = df['date'].dt.year
3df['month'] = df['date'].dt.month
4df['day'] = df['date'].dt.day
5df['weekday'] = df['date'].dt.dayofweek # 0=Monday
6df['weekday_name'] = df['date'].dt.day_name()
7df['quarter'] = df['date'].dt.quarter
8df['week'] = df['date'].dt.isocalendar().week
9df['is_weekend'] = df['date'].dt.dayofweek >= 5
10
11# Period
12df['period'] = df['date'].dt.to_period('M') # Monthly period
13df['period_q'] = df['date'].dt.to_period('Q') # Quarterly

5.3 Date Arithmetic

Python
1from datetime import timedelta
2
3# Add/subtract days
4df['next_week'] = df['date'] + pd.Timedelta(days=7)
5df['prev_month'] = df['date'] - pd.DateOffset(months=1)
6
7# Difference between dates
8df['days_since_start'] = (df['date'] - df['date'].min()).dt.days
9
10# Business days
11df['business_days'] = np.busday_count(
12 df['start_date'].values.astype('datetime64[D]'),
13 df['end_date'].values.astype('datetime64[D]')
14)

5.4 Resampling

Python
1# Set date index
2ts = df.set_index('date')
3
4# Downsample to monthly
5monthly = ts.resample('M').sum()
6monthly = ts.resample('M').agg({'sales': 'sum', 'quantity': 'mean'})
7
8# Upsample with fill
9daily = ts.resample('D').ffill() # Forward fill
10daily = ts.resample('D').interpolate()
11
12# Common frequencies
13# D=daily, W=weekly, M=month end, MS=month start
14# Q=quarter end, Y=year end, H=hourly

6. String Methods

6.1 String Accessor (str)

Python
1# Sample data
2df = pd.DataFrame({
3 'name': [' John Smith ', 'jane doe', 'BOB JONES', 'Alice Brown'],
4 'email': ['john@email.com', 'jane@test.com', 'bob@work.org', 'alice@email.com']
5})
6
7# Basic string operations
8df['name_clean'] = df['name'].str.strip()
9df['name_lower'] = df['name'].str.lower()
10df['name_upper'] = df['name'].str.upper()
11df['name_title'] = df['name'].str.strip().str.title()
12
13# Length
14df['name_length'] = df['name'].str.len()
15
16# Contains
17df['has_john'] = df['name'].str.contains('john', case=False)
18
19# Extract
20df['first_name'] = df['name'].str.split().str[0]
21df['domain'] = df['email'].str.split('@').str[1]
22
23# Replace
24df['email_masked'] = df['email'].str.replace(r'@.*', '@***.com', regex=True)
25
26# Regex extract
27df['username'] = df['email'].str.extract(r'(\w+)@')

6.2 Pattern Matching

Python
1# Check patterns
2df['valid_email'] = df['email'].str.match(r'^[\w.-]+@[\w.-]+\.\w+$')
3
4# Find all matches
5df['numbers'] = df['text'].str.findall(r'\d+')
6
7# Count occurrences
8df['word_count'] = df['text'].str.count(r'\w+')
9
10# Split and expand
11names_split = df['name'].str.split(' ', expand=True)
12names_split.columns = ['first', 'last']
13df = pd.concat([df, names_split], axis=1)

7. Thực hành

Advanced Practice

Exercise: Sales Data Pipeline

Python
1import pandas as pd
2import numpy as np
3
4# Create complex dataset
5np.random.seed(42)
6n = 1000
7
8data = pd.DataFrame({
9 'order_id': range(1, n+1),
10 'date': pd.date_range('2023-01-01', periods=n, freq='6H'),
11 'customer': [f'Customer_{i}' for i in np.random.randint(1, 50, n)],
12 'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Watch'], n),
13 'region': np.random.choice(['North', 'South', 'East', 'West'], n),
14 'quantity': np.random.randint(1, 10, n),
15 'unit_price': np.random.uniform(100, 1000, n).round(2)
16})
17
18# Tasks:
19# 1. Tạo complete analysis pipeline với method chaining
20# 2. Tính monthly sales by product and region (MultiIndex)
21# 3. Tìm top customer by year-quarter
22# 4. Optimize memory usage
23# 5. Add rolling 7-day average và YoY growth
24
25# YOUR CODE HERE
💡 Xem đáp án
Python
1# 1. Complete analysis pipeline
2analysis_result = (
3 data
4 # Calculate total
5 .assign(total_amount=lambda x: x['quantity'] * x['unit_price'])
6
7 # Extract date components
8 .assign(
9 year=lambda x: x['date'].dt.year,
10 month=lambda x: x['date'].dt.month,
11 quarter=lambda x: x['date'].dt.quarter,
12 week=lambda x: x['date'].dt.isocalendar().week.astype(int)
13 )
14
15 # Filter valid data
16 .query('total_amount > 0')
17
18 # Calculate totals
19 .assign(
20 amount_k=lambda x: x['total_amount'] / 1000
21 )
22)
23
24print("Pipeline result shape:", analysis_result.shape)
25
26# 2. Monthly sales MultiIndex
27monthly_sales = (
28 analysis_result
29 .groupby(['year', 'month', 'product', 'region'])
30 .agg(
31 total_sales=('total_amount', 'sum'),
32 transactions=('order_id', 'count'),
33 avg_order=('total_amount', 'mean')
34 )
35 .round(2)
36)
37
38print("\nMonthly Sales MultiIndex:")
39print(monthly_sales.head(10))
40
41# Access specific data
42print("\n2023 Q1 data:")
43print(monthly_sales.xs(2023, level='year').head())
44
45# 3. Top customer by year-quarter
46top_customers = (
47 analysis_result
48 .groupby(['year', 'quarter', 'customer'])['total_amount']
49 .sum()
50 .reset_index()
51 .sort_values(['year', 'quarter', 'total_amount'], ascending=[True, True, False])
52 .groupby(['year', 'quarter'])
53 .head(3) # Top 3 per quarter
54)
55
56print("\nTop Customers by Quarter:")
57print(top_customers)
58
59# 4. Memory optimization
60def optimize_dataframe(df):
61 """Optimize memory usage"""
62 before = df.memory_usage(deep=True).sum() / 1024**2
63
64 # Optimize integers
65 for col in df.select_dtypes(include=['int64']).columns:
66 df[col] = pd.to_numeric(df[col], downcast='integer')
67
68 # Optimize floats
69 for col in df.select_dtypes(include=['float64']).columns:
70 df[col] = pd.to_numeric(df[col], downcast='float')
71
72 # Convert low-cardinality strings to category
73 for col in df.select_dtypes(include=['object']).columns:
74 if df[col].nunique() / len(df) < 0.5:
75 df[col] = df[col].astype('category')
76
77 after = df.memory_usage(deep=True).sum() / 1024**2
78 print(f"Memory: {before:.2f}MB -> {after:.2f}MB ({(1-after/before)*100:.1f}% reduction)")
79
80 return df
81
82data_optimized = optimize_dataframe(analysis_result.copy())
83
84# 5. Rolling average and YoY growth
85daily_sales = (
86 analysis_result
87 .set_index('date')
88 .resample('D')['total_amount']
89 .sum()
90 .to_frame('daily_sales')
91)
92
93daily_sales['rolling_7d'] = daily_sales['daily_sales'].rolling(7).mean()
94daily_sales['rolling_30d'] = daily_sales['daily_sales'].rolling(30).mean()
95
96# YoY growth (if we had multiple years)
97daily_sales['yoy_growth'] = daily_sales['daily_sales'].pct_change(365) * 100
98
99print("\nDaily Sales with Rolling Averages:")
100print(daily_sales.tail(10).round(2))
101
102# Visualization
103import matplotlib.pyplot as plt
104
105fig, axes = plt.subplots(2, 2, figsize=(14, 10))
106
107# Monthly trend by product
108monthly_product = (
109 analysis_result
110 .groupby([analysis_result['date'].dt.to_period('M'), 'product'])['total_amount']
111 .sum()
112 .unstack()
113)
114monthly_product.plot(ax=axes[0, 0], marker='o')
115axes[0, 0].set_title('Monthly Sales by Product')
116axes[0, 0].set_xlabel('Month')
117
118# Region distribution
119region_sales = analysis_result.groupby('region')['total_amount'].sum()
120axes[0, 1].pie(region_sales, labels=region_sales.index, autopct='%1.1f%%')
121axes[0, 1].set_title('Sales by Region')
122
123# Daily with rolling average
124axes[1, 0].plot(daily_sales.index, daily_sales['daily_sales'], alpha=0.3, label='Daily')
125axes[1, 0].plot(daily_sales.index, daily_sales['rolling_7d'], label='7-day MA')
126axes[1, 0].plot(daily_sales.index, daily_sales['rolling_30d'], label='30-day MA')
127axes[1, 0].set_title('Daily Sales with Moving Averages')
128axes[1, 0].legend()
129axes[1, 0].tick_params(axis='x', rotation=45)
130
131# Heatmap of region x product
132heatmap_data = analysis_result.pivot_table(
133 values='total_amount',
134 index='region',
135 columns='product',
136 aggfunc='sum'
137)
138im = axes[1, 1].imshow(heatmap_data.values, cmap='YlOrRd')
139axes[1, 1].set_xticks(range(len(heatmap_data.columns)))
140axes[1, 1].set_yticks(range(len(heatmap_data.index)))
141axes[1, 1].set_xticklabels(heatmap_data.columns)
142axes[1, 1].set_yticklabels(heatmap_data.index)
143axes[1, 1].set_title('Sales Heatmap: Region x Product')
144plt.colorbar(im, ax=axes[1, 1])
145
146plt.tight_layout()
147plt.show()

8. Tổng kết

TopicKey Points
MultiIndexHierarchical indexing, xs(), level operations
Chainingassign(), query(), pipe() for clean code
Optimizationdtypes, vectorization, chunking
Datesdt accessor, resample(), date arithmetic
Stringsstr accessor, regex, pattern matching

Best Practices:

  1. ✅ Use method chaining for readability
  2. ✅ Optimize dtypes for large datasets
  3. ✅ Avoid loops - use vectorized operations
  4. ✅ Leverage MultiIndex for complex analyses
  5. ✅ Profile memory usage regularly

Bài tiếp theo: Window Functions Deep Dive - Advanced SQL