Pandas Advanced

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Sử dụng MultiIndex cho hierarchical data

✅ Method Chaining cho clean, readable code

✅ Performance optimization techniques

✅ Date/Time manipulation thành thạo

✅ String methods và regex patterns

Thời gian: 2 giờ | Độ khó: Intermediate-Advanced | Tool: Python + Pandas + NumPy

Task 0

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
MultiIndex	Chỉ mục đa cấp	Hierarchical rows/columns
Method Chaining	Chuỗi phương thức	Pipe operations sequentially
Vectorization	Vector hóa	Operate on entire arrays, no loops
Downcast	Giảm kiểu	Reduce dtype size (int64 → int32)
Category dtype	Kiểu phân loại	Memory-efficient for low-cardinality
dt accessor	Truy cập ngày	Date/time component extraction
str accessor	Truy cập chuỗi	String operations on Series
assign()	Gán	Add columns (chaining-friendly)
query()	Truy vấn	SQL-like filter syntax
pipe()	Ống dẫn	Apply function in chain

Checkpoint

MultiIndex = multiple index levels. Method Chaining = readable pipeline. Vectorization = no loops, NumPy fast. assign/query/pipe = chaining trinity!

Task 1

📐 1. MultiIndex

TB5 min

1.1 Creating MultiIndex

Python

1import pandas as pd
2import numpy as np
3
4# From tuples
5index = pd.MultiIndex.from_tuples([
6    ('2024', 'Q1'), ('2024', 'Q2'), ('2024', 'Q3'), ('2024', 'Q4'),
7    ('2025', 'Q1'), ('2025', 'Q2')
8], names=['Year', 'Quarter'])
9
10df = pd.DataFrame({
11    'Sales': [100, 120, 150, 180, 130, 160],
12    'Profit': [20, 25, 30, 35, 28, 32]
13}, index=index)
14
15# From product (all combinations)
16index = pd.MultiIndex.from_product(
17    [['2024', '2025'], ['Q1', 'Q2', 'Q3', 'Q4'], ['North', 'South']],
18    names=['Year', 'Quarter', 'Region']
19)
20
21# From DataFrame columns
22df_multi = df.set_index(['year', 'quarter', 'region'])

1.2 Accessing MultiIndex Data

Python

1# Select by level
2print(df.loc[2024])              # Outermost level
3print(df.loc[(2024, 'Q1')])      # Multiple levels
4print(df.loc[(2024, 'Q1', 'North'), 'sales'])  # Specific cell
5
6# Cross-section with xs()
7print(df.xs('Q1', level='quarter'))
8print(df.xs('North', level='region'))
9
10# Swap and sort levels
11df_swapped = df.swaplevel('year', 'quarter')
12df_sorted = df.sort_index(level=['year', 'quarter'])

1.3 MultiIndex Operations

Python

1# Aggregate across levels
2print(df.groupby(level='year').sum())
3print(df.groupby(level=['year', 'region']).sum())
4
5# Unstack to pivot
6unstacked = df.unstack(level='region')
7stacked = unstacked.stack()
8
9# Column MultiIndex
10print(df['Sales'])           # All Sales columns
11print(df['Sales', 'Q1'])     # Specific

Checkpoint

MultiIndex = hierarchical data. xs() = cross-section by level. unstack() = level→columns. stack() = columns→level. Great cho time series + panel data!

Task 2

⛓️ 2. Method Chaining

TB5 min

2.1 Chaining Pattern

Python

1# Without chaining (hard to read)
2df = pd.read_csv('data.csv')
3df = df[df['amount'] > 0]
4df['year'] = pd.to_datetime(df['date']).dt.year
5df = df.groupby('year')['amount'].sum().reset_index()
6df = df.sort_values('amount', ascending=False)
7
8# With chaining (clean!)
9result = (
10    pd.read_csv('data.csv')
11    .query('amount > 0')
12    .assign(year=lambda x: pd.to_datetime(x['date']).dt.year)
13    .groupby('year')['amount']
14    .sum()
15    .reset_index()
16    .sort_values('amount', ascending=False)
17)

Method Chaining = readable, maintainable code. Key methods: assign() (add columns), query() (filter), pipe() (custom functions). Mỗi step = 1 transformation!

2.2 Key Methods

Python

1# assign() - Add columns
2df = df.assign(
3    profit=lambda x: x['revenue'] - x['cost'],
4    margin=lambda x: x['profit'] / x['revenue']
5)
6
7# query() - Filter (SQL-like)
8df = df.query('amount > 100 and category == "A"')
9
10# pipe() - Custom function
11def custom_transform(df, multiplier):
12    return df.assign(adjusted=df['value'] * multiplier)
13
14result = df.pipe(custom_transform, multiplier=1.1)

2.3 Complete Pipeline

Python

1sales_summary = (
2    pd.read_csv('sales.csv')
3    .dropna(subset=['amount', 'date'])
4    .query('amount > 0')
5    .assign(
6        date=lambda x: pd.to_datetime(x['date']),
7        year=lambda x: x['date'].dt.year,
8        quarter=lambda x: x['date'].dt.quarter
9    )
10    .groupby(['year', 'quarter', 'product'])
11    .agg(
12        total_sales=('amount', 'sum'),
13        avg_sale=('amount', 'mean'),
14        count=('amount', 'count')
15    )
16    .reset_index()
17    .sort_values(['year', 'quarter', 'total_sales'], ascending=[True, True, False])
18    .round(2)
19)

Checkpoint

Chaining trinity: assign() (add cols), query() (filter), pipe() (custom fn). Wrap in () cho multi-line. Each step = 1 clear transformation!

Task 3

⚡ 3. Performance Optimization

TB5 min

3.1 Data Types Optimization

Python

1def optimize_dtypes(df):
2    """Reduce memory usage"""
3    for col in df.select_dtypes(include=['int64']).columns:
4        df[col] = pd.to_numeric(df[col], downcast='integer')
5
6    for col in df.select_dtypes(include=['float64']).columns:
7        df[col] = pd.to_numeric(df[col], downcast='float')
8
9    for col in df.select_dtypes(include=['object']).columns:
10        if df[col].nunique() / len(df) < 0.5:
11            df[col] = df[col].astype('category')
12
13    return df
14
15df_optimized = optimize_dtypes(df.copy())

3.2 Vectorized Operations

Python

1# BAD: loops
2for i in range(len(df)):
3    df.loc[i, 'new_col'] = df.loc[i, 'col1'] * 2
4
5# GOOD: vectorized
6df['new_col'] = df['col1'] * 2
7
8# BAD: apply
9df['result'] = df.apply(lambda row: row['a'] * row['b'], axis=1)
10
11# GOOD: vectorized
12df['result'] = df['a'] * df['b']
13
14# NumPy for conditionals
15df['category'] = np.select(
16    [df['score'] < 50, df['score'] < 80, df['score'] >= 80],
17    ['Low', 'Medium', 'High'],
18    default='Unknown'
19)

3.3 Efficient Reading

Python

1# Only needed columns
2df = pd.read_csv('large_file.csv', usecols=['col1', 'col2'])
3
4# Chunks for large files
5chunks = []
6for chunk in pd.read_csv('large.csv', chunksize=100000):
7    chunks.append(chunk.query('amount > 0'))
8df = pd.concat(chunks, ignore_index=True)
9
10# Specify dtypes upfront
11df = pd.read_csv('data.csv', dtype={'id': 'int32', 'amount': 'float32',
12                                     'category': 'category'})

Performance rules: 1) Never loop DataFrames. 2) Use vectorized ops. 3) Downcast dtypes. 4) Read only needed columns. 5) Process in chunks if file is large!

Checkpoint

Vectorization = no loops! Downcast = smaller dtypes. Category cho low-cardinality strings. Chunks cho large files. Có thể giảm 50-90% memory!

Task 4

📅 4. Date/Time Operations

TB5 min

5.1 Parsing & Components

Python

1# Parse dates
2df['date'] = pd.to_datetime(df['date'])
3df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
4df['date'] = pd.to_datetime(df['date'], errors='coerce')  # Invalid→NaT
5
6# Extract components (dt accessor)
7df['year'] = df['date'].dt.year
8df['month'] = df['date'].dt.month
9df['weekday'] = df['date'].dt.dayofweek  # 0=Monday
10df['weekday_name'] = df['date'].dt.day_name()
11df['quarter'] = df['date'].dt.quarter
12df['is_weekend'] = df['date'].dt.dayofweek >= 5
13df['period'] = df['date'].dt.to_period('M')

5.2 Date Arithmetic

Python

1# Add/subtract
2df['next_week'] = df['date'] + pd.Timedelta(days=7)
3df['prev_month'] = df['date'] - pd.DateOffset(months=1)
4df['days_since_start'] = (df['date'] - df['date'].min()).dt.days

5.3 Resampling

Python

1ts = df.set_index('date')
2
3# Downsample
4monthly = ts.resample('M').sum()
5monthly = ts.resample('M').agg({'sales': 'sum', 'quantity': 'mean'})
6
7# Upsample
8daily = ts.resample('D').ffill()
9daily = ts.resample('D').interpolate()
10
11# Frequencies: D=daily, W=weekly, M=month end, MS=month start, Q=quarter, Y=year

Checkpoint

pd.to_datetime() parse. dt accessor extract (year, month, quarter). Timedelta/DateOffset arithmetic. resample() = time-based groupby!

Task 5

🔤 5. String Methods

TB5 min

6.1 String Accessor (str)

Python

1df = pd.DataFrame({
2    'name': ['  John Smith  ', 'jane doe', 'BOB JONES'],
3    'email': ['john@email.com', 'jane@test.com', 'bob@work.org']
4})
5
6# Clean
7df['name_clean'] = df['name'].str.strip().str.title()
8
9# Search
10df['has_john'] = df['name'].str.contains('john', case=False)
11df['starts_j'] = df['name'].str.startswith('J')
12
13# Extract
14df['first_name'] = df['name'].str.split().str[0]
15df['domain'] = df['email'].str.split('@').str[1]
16
17# Replace
18df['masked'] = df['email'].str.replace(r'@.*', '@***.com', regex=True)

6.2 Regex

Python

1# Extract with regex
2df['username'] = df['email'].str.extract(r'(\w+)@')
3
4# Validate pattern
5df['valid_email'] = df['email'].str.match(r'^[\w.-]+@[\w.-]+\.\w+$')
6
7# Count matches
8df['word_count'] = df['text'].str.count(r'\w+')
9
10# Find all
11df['numbers'] = df['text'].str.findall(r'\d+')

Checkpoint

str accessor = string operations on Series. str.contains() search, str.extract() regex capture, str.split().str[0] first word. Regex cho complex patterns!

Task 6

📋 Tổng kết

TB5 min

Kiến thức đã học

Topic	Key Points
MultiIndex	Hierarchical indexing, xs(), level operations
Chaining	assign(), query(), pipe() for clean code
Optimization	dtypes, vectorization, chunking
Dates	dt accessor, resample(), date arithmetic
Strings	str accessor, regex, pattern matching

Best Practices

✅ Use method chaining for readability
✅ Optimize dtypes for large datasets
✅ Avoid loops — use vectorized operations
✅ Leverage MultiIndex for complex analyses
✅ Profile memory usage regularly

Câu hỏi tự kiểm tra

MultiIndex có ưu điểm gì so với single index?
Method chaining giúp code tốt hơn thế nào?
Cách tối ưu memory cho DataFrame lớn?
dt accessor khác apply() cho xử lý datetime thế nào?

Bài tiếp theo: Advanced SQL — Window Functions, CTEs, Query Optimization →

🎉 Tuyệt vời! Bạn đã master Pandas nâng cao!

Nhớ: Method chaining + vectorization + đúng dtypes = code vừa đẹp vừa nhanh. Từ giờ hãy viết code Pandas như một pro!

Task 7

🎯 Mục tiêu bài học

📖 Bảng Thuật Ngữ Quan Trọng

Checkpoint

📐 1. MultiIndex

1.1 Creating MultiIndex

1.2 Accessing MultiIndex Data

1.3 MultiIndex Operations

Checkpoint

⛓️ 2. Method Chaining

2.1 Chaining Pattern

2.2 Key Methods

2.3 Complete Pipeline

Checkpoint

⚡ 3. Performance Optimization

3.1 Data Types Optimization

3.2 Vectorized Operations

3.3 Efficient Reading

Checkpoint

📅 4. Date/Time Operations

5.1 Parsing & Components

5.2 Date Arithmetic

5.3 Resampling

Checkpoint

🔤 5. String Methods

6.1 String Accessor (str)

6.2 Regex

Checkpoint

📋 Tổng kết

Kiến thức đã học

Best Practices

Câu hỏi tự kiểm tra

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu