Pandas Nâng cao

Python Pandas Data Analysis

1. MultiIndex (Hierarchical Index)

1.1 Tạo MultiIndex DataFrame

Python

1import pandas as pd
2import numpy as np
3
4# Method 1: From tuples
5arrays = [
6    ['2023', '2023', '2024', '2024'],
7    ['Q1', 'Q2', 'Q1', 'Q2']
8]
9index = pd.MultiIndex.from_arrays(arrays, names=['Year', 'Quarter'])
10df = pd.DataFrame({
11    'Revenue': [100, 150, 120, 180],
12    'Profit': [20, 35, 28, 45]
13}, index=index)
14
15print(df)
16#              Revenue  Profit
17# Year Quarter                
18# 2023 Q1          100      20
19#      Q2          150      35
20# 2024 Q1          120      28
21#      Q2          180      45

Python

1# Method 2: set_index với multiple columns
2sales_data = pd.DataFrame({
3    'Region': ['North', 'North', 'South', 'South'],
4    'Product': ['A', 'B', 'A', 'B'],
5    'Sales': [100, 200, 150, 250],
6    'Quantity': [10, 20, 15, 25]
7})
8
9df_multi = sales_data.set_index(['Region', 'Product'])
10print(df_multi)
11#                 Sales  Quantity
12# Region Product                 
13# North  A          100        10
14#        B          200        20
15# South  A          150        15
16#        B          250        25

1.2 Truy cập MultiIndex

Python

1# Lấy tất cả data của một region
2df_multi.loc['North']
3#          Sales  Quantity
4# Product                 
5# A          100        10
6# B          200        20
7
8# Lấy specific row
9df_multi.loc[('North', 'A')]
10# Sales       100
11# Quantity     10
12
13# Cross-section với xs()
14df_multi.xs('A', level='Product')
15#         Sales  Quantity
16# Region                 
17# North     100        10
18# South     150        15

1.3 Stack và Unstack

Python

1# Pivot data giữa row và column index
2df_wide = df_multi.unstack()  # Move Product to columns
3print(df_wide)
4#         Sales       Quantity      
5# Product     A    B        A     B
6# Region                           
7# North     100  200       10    20
8# South     150  250       15    25
9
10# Stack back
11df_long = df_wide.stack()

2. GroupBy Nâng cao

2.1 Multiple Aggregations

Python

1# Sample data
2df = pd.DataFrame({
3    'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
4    'Product': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
5    'Sales': [100, 200, 150, 250, 120, 180],
6    'Quantity': [10, 20, 15, 25, 12, 18]
7})
8
9# Multiple agg functions
10result = df.groupby('Category').agg({
11    'Sales': ['sum', 'mean', 'max'],
12    'Quantity': ['sum', 'count']
13})
14print(result)
15#          Sales             Quantity      
16#            sum   mean  max      sum count
17# Category                                 
18# A          420  140.0  200       42     3
19# B          580  193.3  250       58     3

2.2 Custom Aggregation

Python

1# Named aggregations (pandas 0.25+)
2result = df.groupby('Category').agg(
3    total_sales=('Sales', 'sum'),
4    avg_sales=('Sales', 'mean'),
5    top_sale=('Sales', 'max'),
6    num_transactions=('Sales', 'count'),
7    sales_range=('Sales', lambda x: x.max() - x.min())
8)

2.3 Transform vs Apply

Python

1# Transform: Returns same shape as input
2df['Sales_pct'] = df.groupby('Category')['Sales'].transform(
3    lambda x: x / x.sum() * 100
4)
5
6# Apply: Returns aggregated result
7def top_n_sales(group, n=2):
8    return group.nlargest(n, 'Sales')
9
10df.groupby('Category').apply(top_n_sales, n=2)

2.4 Filter Groups

Python

1# Chỉ giữ categories có tổng sales > 500
2df_filtered = df.groupby('Category').filter(lambda x: x['Sales'].sum() > 500)

3. Merge Strategies

3.1 Các loại Join

Python

1# Sample DataFrames
2customers = pd.DataFrame({
3    'customer_id': [1, 2, 3, 4],
4    'name': ['Alice', 'Bob', 'Charlie', 'David']
5})
6
7orders = pd.DataFrame({
8    'order_id': [101, 102, 103, 104],
9    'customer_id': [1, 2, 2, 5],
10    'amount': [100, 200, 150, 300]
11})
12
13# Inner join (default)
14inner = pd.merge(customers, orders, on='customer_id', how='inner')
15# Chỉ giữ matching rows
16
17# Left join
18left = pd.merge(customers, orders, on='customer_id', how='left')
19# Giữ tất cả customers, NaN cho orders không match
20
21# Right join
22right = pd.merge(customers, orders, on='customer_id', how='right')
23# Giữ tất cả orders, NaN cho customers không match
24
25# Outer join
26outer = pd.merge(customers, orders, on='customer_id', how='outer')
27# Giữ tất cả rows từ cả hai bảng

3.2 Merge với Different Column Names

Python

1orders_v2 = pd.DataFrame({
2    'order_id': [101, 102],
3    'cust_id': [1, 2],  # Different column name
4    'amount': [100, 200]
5})
6
7merged = pd.merge(
8    customers, 
9    orders_v2, 
10    left_on='customer_id', 
11    right_on='cust_id'
12)

3.3 Merge Indicators

Python

1# Indicator column cho biết row đến từ đâu
2merged = pd.merge(
3    customers, orders, 
4    on='customer_id', 
5    how='outer', 
6    indicator=True
7)
8# _merge: 'left_only', 'right_only', 'both'
9
10# Tìm customers chưa có orders
11no_orders = merged[merged['_merge'] == 'left_only']

3.4 Concatenation

Python

1# Vertical concat (stack rows)
2df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
3df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
4combined = pd.concat([df1, df2], ignore_index=True)
5
6# Horizontal concat (add columns)
7combined_h = pd.concat([df1, df2], axis=1)

4. Performance Optimization

4.1 Chọn đúng Data Types

Python

1# Check memory usage
2df.info(memory_usage='deep')
3
4# Downcast numeric types
5df['int_col'] = pd.to_numeric(df['int_col'], downcast='integer')
6df['float_col'] = pd.to_numeric(df['float_col'], downcast='float')
7
8# Use category for low-cardinality strings
9df['category_col'] = df['category_col'].astype('category')

4.2 Efficient Operations

Python

1# ❌ Slow: iterrows
2for idx, row in df.iterrows():
3    df.loc[idx, 'new_col'] = row['col1'] + row['col2']
4
5# ✅ Fast: vectorized
6df['new_col'] = df['col1'] + df['col2']
7
8# ✅ Fast: apply với axis=1 (nếu cần complex logic)
9df['new_col'] = df.apply(lambda row: row['col1'] + row['col2'], axis=1)
10
11# ✅ Fastest: numpy
12import numpy as np
13df['new_col'] = np.where(df['col1'] > 0, df['col1'], df['col2'])

4.3 Chunked Reading

Python

1# Đọc file lớn theo chunks
2chunks = []
3for chunk in pd.read_csv('large_file.csv', chunksize=100000):
4    # Process each chunk
5    processed = chunk[chunk['value'] > 0]  # Filter
6    chunks.append(processed)
7
8df_final = pd.concat(chunks, ignore_index=True)

4.4 Query Method

Python

1# query() có thể nhanh hơn boolean indexing
2# với large DataFrames
3
4# Boolean indexing
5df[(df['A'] > 5) & (df['B'] < 10)]
6
7# query method
8df.query('A > 5 and B < 10')
9
10# query với variables
11threshold = 5
12df.query('A > @threshold')

5. Practical Examples

5.1 Sales Analysis Pipeline

Python

1import pandas as pd
2
3# Load data
4sales = pd.read_csv('sales.csv')
5products = pd.read_csv('products.csv')
6customers = pd.read_csv('customers.csv')
7
8# Merge all tables
9df = (
10    sales
11    .merge(products, on='product_id')
12    .merge(customers, on='customer_id')
13)
14
15# Create analysis
16analysis = (
17    df
18    .assign(
19        revenue=lambda x: x['quantity'] * x['unit_price'],
20        profit=lambda x: x['revenue'] * x['margin']
21    )
22    .groupby(['region', 'category'])
23    .agg(
24        total_revenue=('revenue', 'sum'),
25        total_profit=('profit', 'sum'),
26        order_count=('order_id', 'nunique'),
27        avg_order_value=('revenue', 'mean')
28    )
29    .sort_values('total_revenue', ascending=False)
30)

5.2 Time Series Resampling

Python

1# Set datetime index
2df['date'] = pd.to_datetime(df['date'])
3df = df.set_index('date')
4
5# Resample to monthly
6monthly = df.resample('M').agg({
7    'sales': 'sum',
8    'quantity': 'sum',
9    'transactions': 'count'
10})
11
12# Rolling calculations
13df['rolling_avg_7d'] = df['sales'].rolling(window=7).mean()
14df['rolling_sum_30d'] = df['sales'].rolling(window=30).sum()

6. Thực hành

Exercises

Exercise 1: MultiIndex Operations

Python

1# Tạo DataFrame với MultiIndex: Region -> Year -> Quarter
2# Tính tổng sales theo Region, và % contribution của mỗi Quarter
3
4# YOUR CODE HERE

💡 Xem đáp án

Python

1data = pd.DataFrame({
2    'Region': ['North']*4 + ['South']*4,
3    'Year': [2023, 2023, 2024, 2024] * 2,
4    'Quarter': ['Q1', 'Q2', 'Q1', 'Q2'] * 2,
5    'Sales': [100, 150, 120, 180, 80, 120, 100, 140]
6})
7
8df = data.set_index(['Region', 'Year', 'Quarter'])
9
10# Tổng theo Region
11region_totals = df.groupby('Region')['Sales'].sum()
12
13# % contribution per Quarter
14df['Pct'] = df.groupby(level='Region')['Sales'].transform(
15    lambda x: x / x.sum() * 100
16)
17print(df)

Exercise 2: Complex Aggregation

Python

1# Với DataFrame orders, tính cho mỗi customer:
2# - Tổng orders, Tổng amount
3# - Average order value
4# - Days since first order
5# - Days since last order
6
7# YOUR CODE HERE

💡 Xem đáp án

Python

1import pandas as pd
2from datetime import datetime
3
4orders = pd.DataFrame({
5    'customer_id': [1, 1, 2, 2, 1, 3],
6    'order_date': pd.to_datetime(['2024-01-01', '2024-01-15', '2024-01-05', 
7                                   '2024-02-01', '2024-02-10', '2024-01-20']),
8    'amount': [100, 200, 150, 180, 120, 90]
9})
10
11today = datetime.now()
12
13customer_summary = orders.groupby('customer_id').agg(
14    total_orders=('amount', 'count'),
15    total_amount=('amount', 'sum'),
16    avg_order_value=('amount', 'mean'),
17    first_order=('order_date', 'min'),
18    last_order=('order_date', 'max')
19)
20
21customer_summary['days_since_first'] = (today - customer_summary['first_order']).dt.days
22customer_summary['days_since_last'] = (today - customer_summary['last_order']).dt.days
23
24print(customer_summary)

7. Tổng kết

Concept	Use Case	Key Methods
MultiIndex	Hierarchical data	`set_index()`, `xs()`, `stack()`, `unstack()`
GroupBy	Aggregations	`agg()`, `transform()`, `apply()`, `filter()`
Merge	Combine tables	`merge()`, `concat()`, `join()`
Performance	Large data	`astype()`, vectorization, `query()`, chunking

Best Practices:

Luôn check df.info() và optimize dtypes
Prefer vectorized operations over loops
Use query() cho complex filters
Chunk large files khi đọc

Bài tiếp theo: Data Quality & Validation

Pandas Nâng cao

Pandas Nâng cao

1. MultiIndex (Hierarchical Index)

1.1 Tạo MultiIndex DataFrame

1.2 Truy cập MultiIndex

1.3 Stack và Unstack

2. GroupBy Nâng cao

2.1 Multiple Aggregations

2.2 Custom Aggregation

2.3 Transform vs Apply

2.4 Filter Groups

3. Merge Strategies

3.1 Các loại Join

3.2 Merge với Different Column Names

3.3 Merge Indicators

3.4 Concatenation

4. Performance Optimization

4.1 Chọn đúng Data Types

4.2 Efficient Operations

4.3 Chunked Reading

4.4 Query Method

5. Practical Examples

5.1 Sales Analysis Pipeline

5.2 Time Series Resampling

6. Thực hành

Exercise 1: MultiIndex Operations

Exercise 2: Complex Aggregation

7. Tổng kết

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu