Pandas Advanced
1. MultiIndex (Hierarchical Index)
MultiIndex Power
MultiIndex cho phép làm việc với data có nhiều dimensions, tạo hierarchical rows/columns - rất hữu ích cho time series, panel data, và complex aggregations.
1.1 Creating MultiIndex
Python
1import pandas as pd2import numpy as np34# From tuples5index = pd.MultiIndex.from_tuples([6 ('2024', 'Q1'), ('2024', 'Q2'), ('2024', 'Q3'), ('2024', 'Q4'),7 ('2025', 'Q1'), ('2025', 'Q2')8], names=['Year', 'Quarter'])910df = pd.DataFrame({11 'Sales': [100, 120, 150, 180, 130, 160],12 'Profit': [20, 25, 30, 35, 28, 32]13}, index=index)14print(df)1516# From product (all combinations)17years = ['2024', '2025']18quarters = ['Q1', 'Q2', 'Q3', 'Q4']19regions = ['North', 'South']2021index = pd.MultiIndex.from_product(22 [years, quarters, regions],23 names=['Year', 'Quarter', 'Region']24)25df = pd.DataFrame({26 'Sales': np.random.randint(100, 200, len(index))27}, index=index)2829# From DataFrame columns30df = pd.DataFrame({31 'year': [2024, 2024, 2024, 2024, 2025, 2025],32 'quarter': ['Q1', 'Q1', 'Q2', 'Q2', 'Q1', 'Q1'],33 'region': ['North', 'South', 'North', 'South', 'North', 'South'],34 'sales': [100, 150, 120, 180, 130, 160]35})3637df_multi = df.set_index(['year', 'quarter', 'region'])38print(df_multi)1.2 Accessing MultiIndex Data
Python
1# Sample MultiIndex DataFrame2df = pd.DataFrame({3 'year': [2024, 2024, 2024, 2024, 2025, 2025, 2025, 2025],4 'quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],5 'region': ['North', 'North', 'South', 'South', 'North', 'North', 'South', 'South'],6 'sales': [100, 120, 150, 180, 130, 160, 140, 170]7}).set_index(['year', 'quarter', 'region'])89# Select by outermost level10print(df.loc[2024])1112# Select by multiple levels13print(df.loc[(2024, 'Q1')])1415# Select specific cell16print(df.loc[(2024, 'Q1', 'North'), 'sales'])1718# Cross-section with xs()19print(df.xs('Q1', level='quarter'))20print(df.xs('North', level='region'))21print(df.xs((2024, 'Q1'), level=['year', 'quarter']))2223# Slice with slice()24print(df.loc[(slice(None), 'Q1'), :]) # All years, Q1 only1.3 Swapping and Sorting Levels
Python
1# Swap levels2df_swapped = df.swaplevel('year', 'quarter')3print(df_swapped)45# Reorder levels6df_reordered = df.reorder_levels(['region', 'year', 'quarter'])78# Sort index9df_sorted = df.sort_index()10df_sorted = df.sort_index(level='region')11df_sorted = df.sort_index(level=['year', 'quarter'], ascending=[True, False])1.4 Aggregating with MultiIndex
Python
1# Sum across a level2print(df.sum(level='year')) # Sum per year3print(df.sum(level=['year', 'region'])) # Sum per year-region45# Alternative with groupby6print(df.groupby(level='year').sum())7print(df.groupby(level=['year', 'region']).sum())89# Unstack to pivot10unstacked = df.unstack(level='region')11print(unstacked)1213# Stack back14stacked = unstacked.stack()2. Column MultiIndex
2.1 Creating Column MultiIndex
Python
1# From tuples2columns = pd.MultiIndex.from_tuples([3 ('Sales', 'Q1'), ('Sales', 'Q2'), ('Sales', 'Q3'), ('Sales', 'Q4'),4 ('Profit', 'Q1'), ('Profit', 'Q2'), ('Profit', 'Q3'), ('Profit', 'Q4')5], names=['Metric', 'Quarter'])67df = pd.DataFrame(8 np.random.randint(100, 500, (3, 8)),9 index=['Product A', 'Product B', 'Product C'],10 columns=columns11)12print(df)1314# Access columns15print(df['Sales']) # All Sales columns16print(df['Sales', 'Q1']) # Sales Q1 only17print(df.loc[:, ('Sales', 'Q1')]) # Same with loc2.2 Stacking Column MultiIndex
Python
1# Stack columns to rows2stacked = df.stack(level='Quarter')3print(stacked)45# Stack all levels6fully_stacked = df.stack(level=['Metric', 'Quarter'])7print(fully_stacked)89# Unstack to recreate10unstacked = stacked.unstack()3. Method Chaining
Clean Code with Chaining
Method chaining tạo code readable, maintainable, và dễ debug hơn. Mỗi step là một transformation rõ ràng.
3.1 Chaining Pattern
Python
1# Without chaining (hard to read)2df = pd.read_csv('data.csv')3df = df[df['amount'] > 0]4df['year'] = pd.to_datetime(df['date']).dt.year5df = df.groupby('year')['amount'].sum().reset_index()6df = df.sort_values('amount', ascending=False)78# With chaining (clean and readable)9result = (10 pd.read_csv('data.csv')11 .query('amount > 0')12 .assign(year=lambda x: pd.to_datetime(x['date']).dt.year)13 .groupby('year')['amount']14 .sum()15 .reset_index()16 .sort_values('amount', ascending=False)17)3.2 Key Methods for Chaining
Python
1# assign() - Add columns2df = df.assign(3 profit=lambda x: x['revenue'] - x['cost'],4 margin=lambda x: x['profit'] / x['revenue']5)67# query() - Filter rows (SQL-like)8df = df.query('amount > 100 and category == "A"')910# pipe() - Apply custom function11def custom_transform(df, multiplier):12 return df.assign(adjusted=df['value'] * multiplier)1314result = df.pipe(custom_transform, multiplier=1.1)1516# loc[] with chaining17result = (18 df19 .loc[lambda x: x['status'] == 'active']20 .sort_values('score', ascending=False)21 .head(10)22)3.3 Complete Chaining Example
Python
1# Sales analysis pipeline2sales_summary = (3 pd.read_csv('sales.csv')4 5 # Clean6 .dropna(subset=['amount', 'date'])7 .query('amount > 0')8 9 # Transform10 .assign(11 date=lambda x: pd.to_datetime(x['date']),12 year=lambda x: x['date'].dt.year,13 month=lambda x: x['date'].dt.month,14 quarter=lambda x: x['date'].dt.quarter,15 amount_k=lambda x: x['amount'] / 100016 )17 18 # Aggregate19 .groupby(['year', 'quarter', 'product'])20 .agg(21 total_sales=('amount', 'sum'),22 avg_sale=('amount', 'mean'),23 transaction_count=('amount', 'count')24 )25 .reset_index()26 27 # Calculate metrics28 .assign(29 sales_per_transaction=lambda x: x['total_sales'] / x['transaction_count']30 )31 32 # Sort and format33 .sort_values(['year', 'quarter', 'total_sales'], ascending=[True, True, False])34 .round(2)35)4. Performance Optimization
4.1 Data Types Optimization
Python
1# Check memory usage2print(df.info(memory_usage='deep'))3print(df.memory_usage(deep=True))45# Downcast numeric types6def optimize_dtypes(df):7 """Optimize DataFrame memory usage"""8 9 for col in df.select_dtypes(include=['int64']).columns:10 df[col] = pd.to_numeric(df[col], downcast='integer')11 12 for col in df.select_dtypes(include=['float64']).columns:13 df[col] = pd.to_numeric(df[col], downcast='float')14 15 for col in df.select_dtypes(include=['object']).columns:16 num_unique = df[col].nunique()17 num_total = len(df[col])18 if num_unique / num_total < 0.5: # Less than 50% unique19 df[col] = df[col].astype('category')20 21 return df2223df_optimized = optimize_dtypes(df.copy())24print(f"Memory reduction: {1 - df_optimized.memory_usage(deep=True).sum() / df.memory_usage(deep=True).sum():.1%}")4.2 Efficient Operations
Python
1# Use vectorized operations instead of loops2# BAD3for i in range(len(df)):4 df.loc[i, 'new_col'] = df.loc[i, 'col1'] * 256# GOOD7df['new_col'] = df['col1'] * 289# BAD - iterrows10for index, row in df.iterrows():11 result = row['a'] + row['b']1213# GOOD - vectorized14df['result'] = df['a'] + df['b']1516# Use apply only when necessary17# BAD18df['result'] = df.apply(lambda row: row['a'] * row['b'], axis=1)1920# GOOD21df['result'] = df['a'] * df['b']2223# When apply is needed, use raw=True for speed24df['result'] = df.apply(lambda row: complex_function(row), axis=1, raw=True)4.3 Efficient Reading
Python
1# Read only needed columns2df = pd.read_csv('large_file.csv', usecols=['col1', 'col2', 'col3'])34# Read in chunks5chunk_size = 1000006chunks = []7for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):8 # Process each chunk9 processed = chunk.query('amount > 0')10 chunks.append(processed)1112df = pd.concat(chunks, ignore_index=True)1314# Specify dtypes upfront15dtypes = {16 'id': 'int32',17 'amount': 'float32',18 'category': 'category'19}20df = pd.read_csv('data.csv', dtype=dtypes)2122# Parse dates efficiently23df = pd.read_csv('data.csv', parse_dates=['date'], 24 date_parser=lambda x: pd.to_datetime(x, format='%Y-%m-%d'))4.4 NumPy Integration
Python
1import numpy as np23# Use NumPy for heavy calculations4# Pandas Series to NumPy5arr = df['column'].to_numpy()67# NumPy operations are faster8result = np.where(arr > 100, arr * 2, arr)910# Apply back to DataFrame11df['new_column'] = result1213# NumPy for conditional operations14df['category'] = np.select(15 condlist=[16 df['score'] < 50,17 df['score'] < 80,18 df['score'] >= 8019 ],20 choicelist=['Low', 'Medium', 'High'],21 default='Unknown'22)4.5 Caching and eval()
Python
1# eval() for complex expressions2df = pd.eval('''3 profit = revenue - cost4 margin = profit / revenue5 adjusted = margin * 1.16''', target=df)78# query() uses eval internally9df.query('revenue > 1000 and margin > 0.2')1011# For large DataFrames, enable numexpr12pd.set_option('compute.use_numexpr', True)5. Working with Dates
5.1 Date Parsing
Python
1# Parse dates2df['date'] = pd.to_datetime(df['date'])3df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')4df['date'] = pd.to_datetime(df['date'], errors='coerce') # Invalid -> NaT56# From components7df['date'] = pd.to_datetime(df[['year', 'month', 'day']])89# DatetimeIndex10df = df.set_index('date')5.2 Date Components
Python
1# Extract components (dt accessor)2df['year'] = df['date'].dt.year3df['month'] = df['date'].dt.month4df['day'] = df['date'].dt.day5df['weekday'] = df['date'].dt.dayofweek # 0=Monday6df['weekday_name'] = df['date'].dt.day_name()7df['quarter'] = df['date'].dt.quarter8df['week'] = df['date'].dt.isocalendar().week9df['is_weekend'] = df['date'].dt.dayofweek >= 51011# Period12df['period'] = df['date'].dt.to_period('M') # Monthly period13df['period_q'] = df['date'].dt.to_period('Q') # Quarterly5.3 Date Arithmetic
Python
1from datetime import timedelta23# Add/subtract days4df['next_week'] = df['date'] + pd.Timedelta(days=7)5df['prev_month'] = df['date'] - pd.DateOffset(months=1)67# Difference between dates8df['days_since_start'] = (df['date'] - df['date'].min()).dt.days910# Business days11df['business_days'] = np.busday_count(12 df['start_date'].values.astype('datetime64[D]'),13 df['end_date'].values.astype('datetime64[D]')14)5.4 Resampling
Python
1# Set date index2ts = df.set_index('date')34# Downsample to monthly5monthly = ts.resample('M').sum()6monthly = ts.resample('M').agg({'sales': 'sum', 'quantity': 'mean'})78# Upsample with fill9daily = ts.resample('D').ffill() # Forward fill10daily = ts.resample('D').interpolate()1112# Common frequencies13# D=daily, W=weekly, M=month end, MS=month start14# Q=quarter end, Y=year end, H=hourly6. String Methods
6.1 String Accessor (str)
Python
1# Sample data2df = pd.DataFrame({3 'name': [' John Smith ', 'jane doe', 'BOB JONES', 'Alice Brown'],4 'email': ['john@email.com', 'jane@test.com', 'bob@work.org', 'alice@email.com']5})67# Basic string operations8df['name_clean'] = df['name'].str.strip()9df['name_lower'] = df['name'].str.lower()10df['name_upper'] = df['name'].str.upper()11df['name_title'] = df['name'].str.strip().str.title()1213# Length14df['name_length'] = df['name'].str.len()1516# Contains17df['has_john'] = df['name'].str.contains('john', case=False)1819# Extract20df['first_name'] = df['name'].str.split().str[0]21df['domain'] = df['email'].str.split('@').str[1]2223# Replace24df['email_masked'] = df['email'].str.replace(r'@.*', '@***.com', regex=True)2526# Regex extract27df['username'] = df['email'].str.extract(r'(\w+)@')6.2 Pattern Matching
Python
1# Check patterns2df['valid_email'] = df['email'].str.match(r'^[\w.-]+@[\w.-]+\.\w+$')34# Find all matches5df['numbers'] = df['text'].str.findall(r'\d+')67# Count occurrences8df['word_count'] = df['text'].str.count(r'\w+')910# Split and expand11names_split = df['name'].str.split(' ', expand=True)12names_split.columns = ['first', 'last']13df = pd.concat([df, names_split], axis=1)7. Thực hành
Advanced Practice
Exercise: Sales Data Pipeline
Python
1import pandas as pd2import numpy as np34# Create complex dataset5np.random.seed(42)6n = 100078data = pd.DataFrame({9 'order_id': range(1, n+1),10 'date': pd.date_range('2023-01-01', periods=n, freq='6H'),11 'customer': [f'Customer_{i}' for i in np.random.randint(1, 50, n)],12 'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Watch'], n),13 'region': np.random.choice(['North', 'South', 'East', 'West'], n),14 'quantity': np.random.randint(1, 10, n),15 'unit_price': np.random.uniform(100, 1000, n).round(2)16})1718# Tasks:19# 1. Tạo complete analysis pipeline với method chaining20# 2. Tính monthly sales by product and region (MultiIndex)21# 3. Tìm top customer by year-quarter22# 4. Optimize memory usage23# 5. Add rolling 7-day average và YoY growth2425# YOUR CODE HERE💡 Xem đáp án
Python
1# 1. Complete analysis pipeline2analysis_result = (3 data4 # Calculate total5 .assign(total_amount=lambda x: x['quantity'] * x['unit_price'])6 7 # Extract date components8 .assign(9 year=lambda x: x['date'].dt.year,10 month=lambda x: x['date'].dt.month,11 quarter=lambda x: x['date'].dt.quarter,12 week=lambda x: x['date'].dt.isocalendar().week.astype(int)13 )14 15 # Filter valid data16 .query('total_amount > 0')17 18 # Calculate totals19 .assign(20 amount_k=lambda x: x['total_amount'] / 100021 )22)2324print("Pipeline result shape:", analysis_result.shape)2526# 2. Monthly sales MultiIndex27monthly_sales = (28 analysis_result29 .groupby(['year', 'month', 'product', 'region'])30 .agg(31 total_sales=('total_amount', 'sum'),32 transactions=('order_id', 'count'),33 avg_order=('total_amount', 'mean')34 )35 .round(2)36)3738print("\nMonthly Sales MultiIndex:")39print(monthly_sales.head(10))4041# Access specific data42print("\n2023 Q1 data:")43print(monthly_sales.xs(2023, level='year').head())4445# 3. Top customer by year-quarter46top_customers = (47 analysis_result48 .groupby(['year', 'quarter', 'customer'])['total_amount']49 .sum()50 .reset_index()51 .sort_values(['year', 'quarter', 'total_amount'], ascending=[True, True, False])52 .groupby(['year', 'quarter'])53 .head(3) # Top 3 per quarter54)5556print("\nTop Customers by Quarter:")57print(top_customers)5859# 4. Memory optimization60def optimize_dataframe(df):61 """Optimize memory usage"""62 before = df.memory_usage(deep=True).sum() / 1024**263 64 # Optimize integers65 for col in df.select_dtypes(include=['int64']).columns:66 df[col] = pd.to_numeric(df[col], downcast='integer')67 68 # Optimize floats69 for col in df.select_dtypes(include=['float64']).columns:70 df[col] = pd.to_numeric(df[col], downcast='float')71 72 # Convert low-cardinality strings to category73 for col in df.select_dtypes(include=['object']).columns:74 if df[col].nunique() / len(df) < 0.5:75 df[col] = df[col].astype('category')76 77 after = df.memory_usage(deep=True).sum() / 1024**278 print(f"Memory: {before:.2f}MB -> {after:.2f}MB ({(1-after/before)*100:.1f}% reduction)")79 80 return df8182data_optimized = optimize_dataframe(analysis_result.copy())8384# 5. Rolling average and YoY growth85daily_sales = (86 analysis_result87 .set_index('date')88 .resample('D')['total_amount']89 .sum()90 .to_frame('daily_sales')91)9293daily_sales['rolling_7d'] = daily_sales['daily_sales'].rolling(7).mean()94daily_sales['rolling_30d'] = daily_sales['daily_sales'].rolling(30).mean()9596# YoY growth (if we had multiple years)97daily_sales['yoy_growth'] = daily_sales['daily_sales'].pct_change(365) * 1009899print("\nDaily Sales with Rolling Averages:")100print(daily_sales.tail(10).round(2))101102# Visualization103import matplotlib.pyplot as plt104105fig, axes = plt.subplots(2, 2, figsize=(14, 10))106107# Monthly trend by product108monthly_product = (109 analysis_result110 .groupby([analysis_result['date'].dt.to_period('M'), 'product'])['total_amount']111 .sum()112 .unstack()113)114monthly_product.plot(ax=axes[0, 0], marker='o')115axes[0, 0].set_title('Monthly Sales by Product')116axes[0, 0].set_xlabel('Month')117118# Region distribution119region_sales = analysis_result.groupby('region')['total_amount'].sum()120axes[0, 1].pie(region_sales, labels=region_sales.index, autopct='%1.1f%%')121axes[0, 1].set_title('Sales by Region')122123# Daily with rolling average124axes[1, 0].plot(daily_sales.index, daily_sales['daily_sales'], alpha=0.3, label='Daily')125axes[1, 0].plot(daily_sales.index, daily_sales['rolling_7d'], label='7-day MA')126axes[1, 0].plot(daily_sales.index, daily_sales['rolling_30d'], label='30-day MA')127axes[1, 0].set_title('Daily Sales with Moving Averages')128axes[1, 0].legend()129axes[1, 0].tick_params(axis='x', rotation=45)130131# Heatmap of region x product132heatmap_data = analysis_result.pivot_table(133 values='total_amount',134 index='region',135 columns='product',136 aggfunc='sum'137)138im = axes[1, 1].imshow(heatmap_data.values, cmap='YlOrRd')139axes[1, 1].set_xticks(range(len(heatmap_data.columns)))140axes[1, 1].set_yticks(range(len(heatmap_data.index)))141axes[1, 1].set_xticklabels(heatmap_data.columns)142axes[1, 1].set_yticklabels(heatmap_data.index)143axes[1, 1].set_title('Sales Heatmap: Region x Product')144plt.colorbar(im, ax=axes[1, 1])145146plt.tight_layout()147plt.show()8. Tổng kết
| Topic | Key Points |
|---|---|
| MultiIndex | Hierarchical indexing, xs(), level operations |
| Chaining | assign(), query(), pipe() for clean code |
| Optimization | dtypes, vectorization, chunking |
| Dates | dt accessor, resample(), date arithmetic |
| Strings | str accessor, regex, pattern matching |
Best Practices:
- ✅ Use method chaining for readability
- ✅ Optimize dtypes for large datasets
- ✅ Avoid loops - use vectorized operations
- ✅ Leverage MultiIndex for complex analyses
- ✅ Profile memory usage regularly
Bài tiếp theo: Window Functions Deep Dive - Advanced SQL
