Pandas Nâng cao
1. MultiIndex (Hierarchical Index)
1.1 Tạo MultiIndex DataFrame
Python
1import pandas as pd2import numpy as np34# Method 1: From tuples5arrays = [6 ['2023', '2023', '2024', '2024'],7 ['Q1', 'Q2', 'Q1', 'Q2']8]9index = pd.MultiIndex.from_arrays(arrays, names=['Year', 'Quarter'])10df = pd.DataFrame({11 'Revenue': [100, 150, 120, 180],12 'Profit': [20, 35, 28, 45]13}, index=index)1415print(df)16# Revenue Profit17# Year Quarter 18# 2023 Q1 100 2019# Q2 150 3520# 2024 Q1 120 2821# Q2 180 45Python
1# Method 2: set_index với multiple columns2sales_data = pd.DataFrame({3 'Region': ['North', 'North', 'South', 'South'],4 'Product': ['A', 'B', 'A', 'B'],5 'Sales': [100, 200, 150, 250],6 'Quantity': [10, 20, 15, 25]7})89df_multi = sales_data.set_index(['Region', 'Product'])10print(df_multi)11# Sales Quantity12# Region Product 13# North A 100 1014# B 200 2015# South A 150 1516# B 250 251.2 Truy cập MultiIndex
Python
1# Lấy tất cả data của một region2df_multi.loc['North']3# Sales Quantity4# Product 5# A 100 106# B 200 2078# Lấy specific row9df_multi.loc[('North', 'A')]10# Sales 10011# Quantity 101213# Cross-section với xs()14df_multi.xs('A', level='Product')15# Sales Quantity16# Region 17# North 100 1018# South 150 151.3 Stack và Unstack
Python
1# Pivot data giữa row và column index2df_wide = df_multi.unstack() # Move Product to columns3print(df_wide)4# Sales Quantity 5# Product A B A B6# Region 7# North 100 200 10 208# South 150 250 15 25910# Stack back11df_long = df_wide.stack()2. GroupBy Nâng cao
2.1 Multiple Aggregations
Python
1# Sample data2df = pd.DataFrame({3 'Category': ['A', 'A', 'B', 'B', 'A', 'B'],4 'Product': ['X', 'Y', 'X', 'Y', 'X', 'Y'],5 'Sales': [100, 200, 150, 250, 120, 180],6 'Quantity': [10, 20, 15, 25, 12, 18]7})89# Multiple agg functions10result = df.groupby('Category').agg({11 'Sales': ['sum', 'mean', 'max'],12 'Quantity': ['sum', 'count']13})14print(result)15# Sales Quantity 16# sum mean max sum count17# Category 18# A 420 140.0 200 42 319# B 580 193.3 250 58 32.2 Custom Aggregation
Python
1# Named aggregations (pandas 0.25+)2result = df.groupby('Category').agg(3 total_sales=('Sales', 'sum'),4 avg_sales=('Sales', 'mean'),5 top_sale=('Sales', 'max'),6 num_transactions=('Sales', 'count'),7 sales_range=('Sales', lambda x: x.max() - x.min())8)2.3 Transform vs Apply
Python
1# Transform: Returns same shape as input2df['Sales_pct'] = df.groupby('Category')['Sales'].transform(3 lambda x: x / x.sum() * 1004)56# Apply: Returns aggregated result7def top_n_sales(group, n=2):8 return group.nlargest(n, 'Sales')910df.groupby('Category').apply(top_n_sales, n=2)2.4 Filter Groups
Python
1# Chỉ giữ categories có tổng sales > 5002df_filtered = df.groupby('Category').filter(lambda x: x['Sales'].sum() > 500)3. Merge Strategies
3.1 Các loại Join
Python
1# Sample DataFrames2customers = pd.DataFrame({3 'customer_id': [1, 2, 3, 4],4 'name': ['Alice', 'Bob', 'Charlie', 'David']5})67orders = pd.DataFrame({8 'order_id': [101, 102, 103, 104],9 'customer_id': [1, 2, 2, 5],10 'amount': [100, 200, 150, 300]11})1213# Inner join (default)14inner = pd.merge(customers, orders, on='customer_id', how='inner')15# Chỉ giữ matching rows1617# Left join18left = pd.merge(customers, orders, on='customer_id', how='left')19# Giữ tất cả customers, NaN cho orders không match2021# Right join22right = pd.merge(customers, orders, on='customer_id', how='right')23# Giữ tất cả orders, NaN cho customers không match2425# Outer join26outer = pd.merge(customers, orders, on='customer_id', how='outer')27# Giữ tất cả rows từ cả hai bảng3.2 Merge với Different Column Names
Python
1orders_v2 = pd.DataFrame({2 'order_id': [101, 102],3 'cust_id': [1, 2], # Different column name4 'amount': [100, 200]5})67merged = pd.merge(8 customers, 9 orders_v2, 10 left_on='customer_id', 11 right_on='cust_id'12)3.3 Merge Indicators
Python
1# Indicator column cho biết row đến từ đâu2merged = pd.merge(3 customers, orders, 4 on='customer_id', 5 how='outer', 6 indicator=True7)8# _merge: 'left_only', 'right_only', 'both'910# Tìm customers chưa có orders11no_orders = merged[merged['_merge'] == 'left_only']3.4 Concatenation
Python
1# Vertical concat (stack rows)2df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})3df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})4combined = pd.concat([df1, df2], ignore_index=True)56# Horizontal concat (add columns)7combined_h = pd.concat([df1, df2], axis=1)4. Performance Optimization
4.1 Chọn đúng Data Types
Python
1# Check memory usage2df.info(memory_usage='deep')34# Downcast numeric types5df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')6df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')78# Use category for low-cardinality strings9df['category_col'] = df['category_col'].astype('category')4.2 Efficient Operations
Python
1# ❌ Slow: iterrows2for idx, row in df.iterrows():3 df.loc[idx, 'new_col'] = row['col1'] + row['col2']45# ✅ Fast: vectorized6df['new_col'] = df['col1'] + df['col2']78# ✅ Fast: apply với axis=1 (nếu cần complex logic)9df['new_col'] = df.apply(lambda row: row['col1'] + row['col2'], axis=1)1011# ✅ Fastest: numpy12import numpy as np13df['new_col'] = np.where(df['col1'] > 0, df['col1'], df['col2'])4.3 Chunked Reading
Python
1# Đọc file lớn theo chunks2chunks = []3for chunk in pd.read_csv('large_file.csv', chunksize=100000):4 # Process each chunk5 processed = chunk[chunk['value'] > 0] # Filter6 chunks.append(processed)78df_final = pd.concat(chunks, ignore_index=True)4.4 Query Method
Python
1# query() có thể nhanh hơn boolean indexing2# với large DataFrames34# Boolean indexing5df[(df['A'] > 5) & (df['B'] < 10)]67# query method8df.query('A > 5 and B < 10')910# query với variables11threshold = 512df.query('A > @threshold')5. Practical Examples
5.1 Sales Analysis Pipeline
Python
1import pandas as pd23# Load data4sales = pd.read_csv('sales.csv')5products = pd.read_csv('products.csv')6customers = pd.read_csv('customers.csv')78# Merge all tables9df = (10 sales11 .merge(products, on='product_id')12 .merge(customers, on='customer_id')13)1415# Create analysis16analysis = (17 df18 .assign(19 revenue=lambda x: x['quantity'] * x['unit_price'],20 profit=lambda x: x['revenue'] * x['margin']21 )22 .groupby(['region', 'category'])23 .agg(24 total_revenue=('revenue', 'sum'),25 total_profit=('profit', 'sum'),26 order_count=('order_id', 'nunique'),27 avg_order_value=('revenue', 'mean')28 )29 .sort_values('total_revenue', ascending=False)30)5.2 Time Series Resampling
Python
1# Set datetime index2df['date'] = pd.to_datetime(df['date'])3df = df.set_index('date')45# Resample to monthly6monthly = df.resample('M').agg({7 'sales': 'sum',8 'quantity': 'sum',9 'transactions': 'count'10})1112# Rolling calculations13df['rolling_avg_7d'] = df['sales'].rolling(window=7).mean()14df['rolling_sum_30d'] = df['sales'].rolling(window=30).sum()6. Thực hành
Exercises
Exercise 1: MultiIndex Operations
Python
1# Tạo DataFrame với MultiIndex: Region -> Year -> Quarter2# Tính tổng sales theo Region, và % contribution của mỗi Quarter34# YOUR CODE HERE💡 Xem đáp án
Python
1data = pd.DataFrame({2 'Region': ['North']*4 + ['South']*4,3 'Year': [2023, 2023, 2024, 2024] * 2,4 'Quarter': ['Q1', 'Q2', 'Q1', 'Q2'] * 2,5 'Sales': [100, 150, 120, 180, 80, 120, 100, 140]6})78df = data.set_index(['Region', 'Year', 'Quarter'])910# Tổng theo Region11region_totals = df.groupby('Region')['Sales'].sum()1213# % contribution per Quarter14df['Pct'] = df.groupby(level='Region')['Sales'].transform(15 lambda x: x / x.sum() * 10016)17print(df)Exercise 2: Complex Aggregation
Python
1# Với DataFrame orders, tính cho mỗi customer:2# - Tổng orders, Tổng amount3# - Average order value4# - Days since first order5# - Days since last order67# YOUR CODE HERE💡 Xem đáp án
Python
1import pandas as pd2from datetime import datetime34orders = pd.DataFrame({5 'customer_id': [1, 1, 2, 2, 1, 3],6 'order_date': pd.to_datetime(['2024-01-01', '2024-01-15', '2024-01-05', 7 '2024-02-01', '2024-02-10', '2024-01-20']),8 'amount': [100, 200, 150, 180, 120, 90]9})1011today = datetime.now()1213customer_summary = orders.groupby('customer_id').agg(14 total_orders=('amount', 'count'),15 total_amount=('amount', 'sum'),16 avg_order_value=('amount', 'mean'),17 first_order=('order_date', 'min'),18 last_order=('order_date', 'max')19)2021customer_summary['days_since_first'] = (today - customer_summary['first_order']).dt.days22customer_summary['days_since_last'] = (today - customer_summary['last_order']).dt.days2324print(customer_summary)7. Tổng kết
| Concept | Use Case | Key Methods |
|---|---|---|
| MultiIndex | Hierarchical data | set_index(), xs(), stack(), unstack() |
| GroupBy | Aggregations | agg(), transform(), apply(), filter() |
| Merge | Combine tables | merge(), concat(), join() |
| Performance | Large data | astype(), vectorization, query(), chunking |
Best Practices:
- Luôn check
df.info()và optimize dtypes - Prefer vectorized operations over loops
- Use
query()cho complex filters - Chunk large files khi đọc
Bài tiếp theo: Data Quality & Validation
