MinAI - Về trang chủ
Hướng dẫn
4/132 giờ
Đang tải...

Pandas Advanced

Multi-index, Method Chaining, Performance Optimization

0

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Sử dụng MultiIndex cho hierarchical data

✅ Method Chaining cho clean, readable code

✅ Performance optimization techniques

✅ Date/Time manipulation thành thạo

✅ String methods và regex patterns

Thời gian: 2 giờ | Độ khó: Intermediate-Advanced | Tool: Python + Pandas + NumPy

1

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min
Thuật ngữTiếng ViệtMô tả
MultiIndexChỉ mục đa cấpHierarchical rows/columns
Method ChainingChuỗi phương thứcPipe operations sequentially
VectorizationVector hóaOperate on entire arrays, no loops
DowncastGiảm kiểuReduce dtype size (int64 → int32)
Category dtypeKiểu phân loạiMemory-efficient for low-cardinality
dt accessorTruy cập ngàyDate/time component extraction
str accessorTruy cập chuỗiString operations on Series
assign()GánAdd columns (chaining-friendly)
query()Truy vấnSQL-like filter syntax
pipe()Ống dẫnApply function in chain

Checkpoint

MultiIndex = multiple index levels. Method Chaining = readable pipeline. Vectorization = no loops, NumPy fast. assign/query/pipe = chaining trinity!

2

📐 1. MultiIndex

TB5 min

1.1 Creating MultiIndex

Python
1import pandas as pd
2import numpy as np
3
4# From tuples
5index = pd.MultiIndex.from_tuples([
6 ('2024', 'Q1'), ('2024', 'Q2'), ('2024', 'Q3'), ('2024', 'Q4'),
7 ('2025', 'Q1'), ('2025', 'Q2')
8], names=['Year', 'Quarter'])
9
10df = pd.DataFrame({
11 'Sales': [100, 120, 150, 180, 130, 160],
12 'Profit': [20, 25, 30, 35, 28, 32]
13}, index=index)
14
15# From product (all combinations)
16index = pd.MultiIndex.from_product(
17 [['2024', '2025'], ['Q1', 'Q2', 'Q3', 'Q4'], ['North', 'South']],
18 names=['Year', 'Quarter', 'Region']
19)
20
21# From DataFrame columns
22df_multi = df.set_index(['year', 'quarter', 'region'])

1.2 Accessing MultiIndex Data

Python
1# Select by level
2print(df.loc[2024]) # Outermost level
3print(df.loc[(2024, 'Q1')]) # Multiple levels
4print(df.loc[(2024, 'Q1', 'North'), 'sales']) # Specific cell
5
6# Cross-section with xs()
7print(df.xs('Q1', level='quarter'))
8print(df.xs('North', level='region'))
9
10# Swap and sort levels
11df_swapped = df.swaplevel('year', 'quarter')
12df_sorted = df.sort_index(level=['year', 'quarter'])

1.3 MultiIndex Operations

Python
1# Aggregate across levels
2print(df.groupby(level='year').sum())
3print(df.groupby(level=['year', 'region']).sum())
4
5# Unstack to pivot
6unstacked = df.unstack(level='region')
7stacked = unstacked.stack()
8
9# Column MultiIndex
10print(df['Sales']) # All Sales columns
11print(df['Sales', 'Q1']) # Specific

Checkpoint

MultiIndex = hierarchical data. xs() = cross-section by level. unstack() = level→columns. stack() = columns→level. Great cho time series + panel data!

3

⛓️ 2. Method Chaining

TB5 min

2.1 Chaining Pattern

Python
1# Without chaining (hard to read)
2df = pd.read_csv('data.csv')
3df = df[df['amount'] > 0]
4df['year'] = pd.to_datetime(df['date']).dt.year
5df = df.groupby('year')['amount'].sum().reset_index()
6df = df.sort_values('amount', ascending=False)
7
8# With chaining (clean!)
9result = (
10 pd.read_csv('data.csv')
11 .query('amount > 0')
12 .assign(year=lambda x: pd.to_datetime(x['date']).dt.year)
13 .groupby('year')['amount']
14 .sum()
15 .reset_index()
16 .sort_values('amount', ascending=False)
17)

Method Chaining = readable, maintainable code. Key methods: assign() (add columns), query() (filter), pipe() (custom functions). Mỗi step = 1 transformation!

2.2 Key Methods

Python
1# assign() - Add columns
2df = df.assign(
3 profit=lambda x: x['revenue'] - x['cost'],
4 margin=lambda x: x['profit'] / x['revenue']
5)
6
7# query() - Filter (SQL-like)
8df = df.query('amount > 100 and category == "A"')
9
10# pipe() - Custom function
11def custom_transform(df, multiplier):
12 return df.assign(adjusted=df['value'] * multiplier)
13
14result = df.pipe(custom_transform, multiplier=1.1)

2.3 Complete Pipeline

Python
1sales_summary = (
2 pd.read_csv('sales.csv')
3 .dropna(subset=['amount', 'date'])
4 .query('amount > 0')
5 .assign(
6 date=lambda x: pd.to_datetime(x['date']),
7 year=lambda x: x['date'].dt.year,
8 quarter=lambda x: x['date'].dt.quarter
9 )
10 .groupby(['year', 'quarter', 'product'])
11 .agg(
12 total_sales=('amount', 'sum'),
13 avg_sale=('amount', 'mean'),
14 count=('amount', 'count')
15 )
16 .reset_index()
17 .sort_values(['year', 'quarter', 'total_sales'], ascending=[True, True, False])
18 .round(2)
19)

Checkpoint

Chaining trinity: assign() (add cols), query() (filter), pipe() (custom fn). Wrap in () cho multi-line. Each step = 1 clear transformation!

4

⚡ 3. Performance Optimization

TB5 min

3.1 Data Types Optimization

Python
1def optimize_dtypes(df):
2 """Reduce memory usage"""
3 for col in df.select_dtypes(include=['int64']).columns:
4 df[col] = pd.to_numeric(df[col], downcast='integer')
5
6 for col in df.select_dtypes(include=['float64']).columns:
7 df[col] = pd.to_numeric(df[col], downcast='float')
8
9 for col in df.select_dtypes(include=['object']).columns:
10 if df[col].nunique() / len(df) < 0.5:
11 df[col] = df[col].astype('category')
12
13 return df
14
15df_optimized = optimize_dtypes(df.copy())

3.2 Vectorized Operations

Python
1# BAD: loops
2for i in range(len(df)):
3 df.loc[i, 'new_col'] = df.loc[i, 'col1'] * 2
4
5# GOOD: vectorized
6df['new_col'] = df['col1'] * 2
7
8# BAD: apply
9df['result'] = df.apply(lambda row: row['a'] * row['b'], axis=1)
10
11# GOOD: vectorized
12df['result'] = df['a'] * df['b']
13
14# NumPy for conditionals
15df['category'] = np.select(
16 [df['score'] < 50, df['score'] < 80, df['score'] >= 80],
17 ['Low', 'Medium', 'High'],
18 default='Unknown'
19)

3.3 Efficient Reading

Python
1# Only needed columns
2df = pd.read_csv('large_file.csv', usecols=['col1', 'col2'])
3
4# Chunks for large files
5chunks = []
6for chunk in pd.read_csv('large.csv', chunksize=100000):
7 chunks.append(chunk.query('amount > 0'))
8df = pd.concat(chunks, ignore_index=True)
9
10# Specify dtypes upfront
11df = pd.read_csv('data.csv', dtype={'id': 'int32', 'amount': 'float32',
12 'category': 'category'})

Performance rules: 1) Never loop DataFrames. 2) Use vectorized ops. 3) Downcast dtypes. 4) Read only needed columns. 5) Process in chunks if file is large!

Checkpoint

Vectorization = no loops! Downcast = smaller dtypes. Category cho low-cardinality strings. Chunks cho large files. Có thể giảm 50-90% memory!

5

📅 4. Date/Time Operations

TB5 min

5.1 Parsing & Components

Python
1# Parse dates
2df['date'] = pd.to_datetime(df['date'])
3df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
4df['date'] = pd.to_datetime(df['date'], errors='coerce') # Invalid→NaT
5
6# Extract components (dt accessor)
7df['year'] = df['date'].dt.year
8df['month'] = df['date'].dt.month
9df['weekday'] = df['date'].dt.dayofweek # 0=Monday
10df['weekday_name'] = df['date'].dt.day_name()
11df['quarter'] = df['date'].dt.quarter
12df['is_weekend'] = df['date'].dt.dayofweek >= 5
13df['period'] = df['date'].dt.to_period('M')

5.2 Date Arithmetic

Python
1# Add/subtract
2df['next_week'] = df['date'] + pd.Timedelta(days=7)
3df['prev_month'] = df['date'] - pd.DateOffset(months=1)
4df['days_since_start'] = (df['date'] - df['date'].min()).dt.days

5.3 Resampling

Python
1ts = df.set_index('date')
2
3# Downsample
4monthly = ts.resample('M').sum()
5monthly = ts.resample('M').agg({'sales': 'sum', 'quantity': 'mean'})
6
7# Upsample
8daily = ts.resample('D').ffill()
9daily = ts.resample('D').interpolate()
10
11# Frequencies: D=daily, W=weekly, M=month end, MS=month start, Q=quarter, Y=year

Checkpoint

pd.to_datetime() parse. dt accessor extract (year, month, quarter). Timedelta/DateOffset arithmetic. resample() = time-based groupby!

6

🔤 5. String Methods

TB5 min

6.1 String Accessor (str)

Python
1df = pd.DataFrame({
2 'name': [' John Smith ', 'jane doe', 'BOB JONES'],
3 'email': ['john@email.com', 'jane@test.com', 'bob@work.org']
4})
5
6# Clean
7df['name_clean'] = df['name'].str.strip().str.title()
8
9# Search
10df['has_john'] = df['name'].str.contains('john', case=False)
11df['starts_j'] = df['name'].str.startswith('J')
12
13# Extract
14df['first_name'] = df['name'].str.split().str[0]
15df['domain'] = df['email'].str.split('@').str[1]
16
17# Replace
18df['masked'] = df['email'].str.replace(r'@.*', '@***.com', regex=True)

6.2 Regex

Python
1# Extract with regex
2df['username'] = df['email'].str.extract(r'(\w+)@')
3
4# Validate pattern
5df['valid_email'] = df['email'].str.match(r'^[\w.-]+@[\w.-]+\.\w+$')
6
7# Count matches
8df['word_count'] = df['text'].str.count(r'\w+')
9
10# Find all
11df['numbers'] = df['text'].str.findall(r'\d+')

Checkpoint

str accessor = string operations on Series. str.contains() search, str.extract() regex capture, str.split().str[0] first word. Regex cho complex patterns!

7

📋 Tổng kết

TB5 min

Kiến thức đã học

TopicKey Points
MultiIndexHierarchical indexing, xs(), level operations
Chainingassign(), query(), pipe() for clean code
Optimizationdtypes, vectorization, chunking
Datesdt accessor, resample(), date arithmetic
Stringsstr accessor, regex, pattern matching

Best Practices

  1. ✅ Use method chaining for readability
  2. ✅ Optimize dtypes for large datasets
  3. ✅ Avoid loops — use vectorized operations
  4. ✅ Leverage MultiIndex for complex analyses
  5. ✅ Profile memory usage regularly

Câu hỏi tự kiểm tra

  1. MultiIndex có ưu điểm gì so với single index?
  2. Method chaining giúp code tốt hơn thế nào?
  3. Cách tối ưu memory cho DataFrame lớn?
  4. dt accessor khác apply() cho xử lý datetime thế nào?

Bài tiếp theo: Advanced SQL — Window Functions, CTEs, Query Optimization →

🎉 Tuyệt vời! Bạn đã master Pandas nâng cao!

Nhớ: Method chaining + vectorization + đúng dtypes = code vừa đẹp vừa nhanh. Từ giờ hãy viết code Pandas như một pro!