🎯 Mục tiêu bài học
Sau bài học này, bạn sẽ:
✅ Sử dụng MultiIndex cho hierarchical data
✅ Method Chaining cho clean, readable code
✅ Performance optimization techniques
✅ Date/Time manipulation thành thạo
✅ String methods và regex patterns
Thời gian: 2 giờ | Độ khó: Intermediate-Advanced | Tool: Python + Pandas + NumPy
📖 Bảng Thuật Ngữ Quan Trọng
| Thuật ngữ | Tiếng Việt | Mô tả |
|---|---|---|
| MultiIndex | Chỉ mục đa cấp | Hierarchical rows/columns |
| Method Chaining | Chuỗi phương thức | Pipe operations sequentially |
| Vectorization | Vector hóa | Operate on entire arrays, no loops |
| Downcast | Giảm kiểu | Reduce dtype size (int64 → int32) |
| Category dtype | Kiểu phân loại | Memory-efficient for low-cardinality |
| dt accessor | Truy cập ngày | Date/time component extraction |
| str accessor | Truy cập chuỗi | String operations on Series |
| assign() | Gán | Add columns (chaining-friendly) |
| query() | Truy vấn | SQL-like filter syntax |
| pipe() | Ống dẫn | Apply function in chain |
Checkpoint
MultiIndex = multiple index levels. Method Chaining = readable pipeline. Vectorization = no loops, NumPy fast. assign/query/pipe = chaining trinity!
📐 1. MultiIndex
1.1 Creating MultiIndex
1import pandas as pd2import numpy as np34# From tuples5index = pd.MultiIndex.from_tuples([6 ('2024', 'Q1'), ('2024', 'Q2'), ('2024', 'Q3'), ('2024', 'Q4'),7 ('2025', 'Q1'), ('2025', 'Q2')8], names=['Year', 'Quarter'])910df = pd.DataFrame({11 'Sales': [100, 120, 150, 180, 130, 160],12 'Profit': [20, 25, 30, 35, 28, 32]13}, index=index)1415# From product (all combinations)16index = pd.MultiIndex.from_product(17 [['2024', '2025'], ['Q1', 'Q2', 'Q3', 'Q4'], ['North', 'South']],18 names=['Year', 'Quarter', 'Region']19)2021# From DataFrame columns22df_multi = df.set_index(['year', 'quarter', 'region'])1.2 Accessing MultiIndex Data
1# Select by level2print(df.loc[2024]) # Outermost level3print(df.loc[(2024, 'Q1')]) # Multiple levels4print(df.loc[(2024, 'Q1', 'North'), 'sales']) # Specific cell56# Cross-section with xs()7print(df.xs('Q1', level='quarter'))8print(df.xs('North', level='region'))910# Swap and sort levels11df_swapped = df.swaplevel('year', 'quarter')12df_sorted = df.sort_index(level=['year', 'quarter'])1.3 MultiIndex Operations
1# Aggregate across levels2print(df.groupby(level='year').sum())3print(df.groupby(level=['year', 'region']).sum())45# Unstack to pivot6unstacked = df.unstack(level='region')7stacked = unstacked.stack()89# Column MultiIndex10print(df['Sales']) # All Sales columns11print(df['Sales', 'Q1']) # SpecificCheckpoint
MultiIndex = hierarchical data. xs() = cross-section by level. unstack() = level→columns. stack() = columns→level. Great cho time series + panel data!
⛓️ 2. Method Chaining
2.1 Chaining Pattern
1# Without chaining (hard to read)2df = pd.read_csv('data.csv')3df = df[df['amount'] > 0]4df['year'] = pd.to_datetime(df['date']).dt.year5df = df.groupby('year')['amount'].sum().reset_index()6df = df.sort_values('amount', ascending=False)78# With chaining (clean!)9result = (10 pd.read_csv('data.csv')11 .query('amount > 0')12 .assign(year=lambda x: pd.to_datetime(x['date']).dt.year)13 .groupby('year')['amount']14 .sum()15 .reset_index()16 .sort_values('amount', ascending=False)17)Method Chaining = readable, maintainable code. Key methods: assign() (add columns), query() (filter), pipe() (custom functions). Mỗi step = 1 transformation!
2.2 Key Methods
1# assign() - Add columns2df = df.assign(3 profit=lambda x: x['revenue'] - x['cost'],4 margin=lambda x: x['profit'] / x['revenue']5)67# query() - Filter (SQL-like)8df = df.query('amount > 100 and category == "A"')910# pipe() - Custom function11def custom_transform(df, multiplier):12 return df.assign(adjusted=df['value'] * multiplier)1314result = df.pipe(custom_transform, multiplier=1.1)2.3 Complete Pipeline
1sales_summary = (2 pd.read_csv('sales.csv')3 .dropna(subset=['amount', 'date'])4 .query('amount > 0')5 .assign(6 date=lambda x: pd.to_datetime(x['date']),7 year=lambda x: x['date'].dt.year,8 quarter=lambda x: x['date'].dt.quarter9 )10 .groupby(['year', 'quarter', 'product'])11 .agg(12 total_sales=('amount', 'sum'),13 avg_sale=('amount', 'mean'),14 count=('amount', 'count')15 )16 .reset_index()17 .sort_values(['year', 'quarter', 'total_sales'], ascending=[True, True, False])18 .round(2)19)Checkpoint
Chaining trinity: assign() (add cols), query() (filter), pipe() (custom fn). Wrap in () cho multi-line. Each step = 1 clear transformation!
⚡ 3. Performance Optimization
3.1 Data Types Optimization
1def optimize_dtypes(df):2 """Reduce memory usage"""3 for col in df.select_dtypes(include=['int64']).columns:4 df[col] = pd.to_numeric(df[col], downcast='integer')56 for col in df.select_dtypes(include=['float64']).columns:7 df[col] = pd.to_numeric(df[col], downcast='float')89 for col in df.select_dtypes(include=['object']).columns:10 if df[col].nunique() / len(df) < 0.5:11 df[col] = df[col].astype('category')1213 return df1415df_optimized = optimize_dtypes(df.copy())3.2 Vectorized Operations
1# BAD: loops2for i in range(len(df)):3 df.loc[i, 'new_col'] = df.loc[i, 'col1'] * 245# GOOD: vectorized6df['new_col'] = df['col1'] * 278# BAD: apply9df['result'] = df.apply(lambda row: row['a'] * row['b'], axis=1)1011# GOOD: vectorized12df['result'] = df['a'] * df['b']1314# NumPy for conditionals15df['category'] = np.select(16 [df['score'] < 50, df['score'] < 80, df['score'] >= 80],17 ['Low', 'Medium', 'High'],18 default='Unknown'19)3.3 Efficient Reading
1# Only needed columns2df = pd.read_csv('large_file.csv', usecols=['col1', 'col2'])34# Chunks for large files5chunks = []6for chunk in pd.read_csv('large.csv', chunksize=100000):7 chunks.append(chunk.query('amount > 0'))8df = pd.concat(chunks, ignore_index=True)910# Specify dtypes upfront11df = pd.read_csv('data.csv', dtype={'id': 'int32', 'amount': 'float32',12 'category': 'category'})Performance rules: 1) Never loop DataFrames. 2) Use vectorized ops. 3) Downcast dtypes. 4) Read only needed columns. 5) Process in chunks if file is large!
Checkpoint
Vectorization = no loops! Downcast = smaller dtypes. Category cho low-cardinality strings. Chunks cho large files. Có thể giảm 50-90% memory!
📅 4. Date/Time Operations
5.1 Parsing & Components
1# Parse dates2df['date'] = pd.to_datetime(df['date'])3df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')4df['date'] = pd.to_datetime(df['date'], errors='coerce') # Invalid→NaT56# Extract components (dt accessor)7df['year'] = df['date'].dt.year8df['month'] = df['date'].dt.month9df['weekday'] = df['date'].dt.dayofweek # 0=Monday10df['weekday_name'] = df['date'].dt.day_name()11df['quarter'] = df['date'].dt.quarter12df['is_weekend'] = df['date'].dt.dayofweek >= 513df['period'] = df['date'].dt.to_period('M')5.2 Date Arithmetic
1# Add/subtract2df['next_week'] = df['date'] + pd.Timedelta(days=7)3df['prev_month'] = df['date'] - pd.DateOffset(months=1)4df['days_since_start'] = (df['date'] - df['date'].min()).dt.days5.3 Resampling
1ts = df.set_index('date')23# Downsample4monthly = ts.resample('M').sum()5monthly = ts.resample('M').agg({'sales': 'sum', 'quantity': 'mean'})67# Upsample8daily = ts.resample('D').ffill()9daily = ts.resample('D').interpolate()1011# Frequencies: D=daily, W=weekly, M=month end, MS=month start, Q=quarter, Y=yearCheckpoint
pd.to_datetime() parse. dt accessor extract (year, month, quarter). Timedelta/DateOffset arithmetic. resample() = time-based groupby!
🔤 5. String Methods
6.1 String Accessor (str)
1df = pd.DataFrame({2 'name': [' John Smith ', 'jane doe', 'BOB JONES'],3 'email': ['john@email.com', 'jane@test.com', 'bob@work.org']4})56# Clean7df['name_clean'] = df['name'].str.strip().str.title()89# Search10df['has_john'] = df['name'].str.contains('john', case=False)11df['starts_j'] = df['name'].str.startswith('J')1213# Extract14df['first_name'] = df['name'].str.split().str[0]15df['domain'] = df['email'].str.split('@').str[1]1617# Replace18df['masked'] = df['email'].str.replace(r'@.*', '@***.com', regex=True)6.2 Regex
1# Extract with regex2df['username'] = df['email'].str.extract(r'(\w+)@')34# Validate pattern5df['valid_email'] = df['email'].str.match(r'^[\w.-]+@[\w.-]+\.\w+$')67# Count matches8df['word_count'] = df['text'].str.count(r'\w+')910# Find all11df['numbers'] = df['text'].str.findall(r'\d+')Checkpoint
str accessor = string operations on Series. str.contains() search, str.extract() regex capture, str.split().str[0] first word. Regex cho complex patterns!
📋 Tổng kết
Kiến thức đã học
| Topic | Key Points |
|---|---|
| MultiIndex | Hierarchical indexing, xs(), level operations |
| Chaining | assign(), query(), pipe() for clean code |
| Optimization | dtypes, vectorization, chunking |
| Dates | dt accessor, resample(), date arithmetic |
| Strings | str accessor, regex, pattern matching |
Best Practices
- ✅ Use method chaining for readability
- ✅ Optimize dtypes for large datasets
- ✅ Avoid loops — use vectorized operations
- ✅ Leverage MultiIndex for complex analyses
- ✅ Profile memory usage regularly
Câu hỏi tự kiểm tra
- MultiIndex có ưu điểm gì so với single index?
- Method chaining giúp code tốt hơn thế nào?
- Cách tối ưu memory cho DataFrame lớn?
dtaccessor khácapply()cho xử lý datetime thế nào?
Bài tiếp theo: Advanced SQL — Window Functions, CTEs, Query Optimization →
🎉 Tuyệt vời! Bạn đã master Pandas nâng cao!
Nhớ: Method chaining + vectorization + đúng dtypes = code vừa đẹp vừa nhanh. Từ giờ hãy viết code Pandas như một pro!
