Data Manipulation
1. Introduction
Data Manipulation Skills
Kỹ năng manipulation data là core competency của Data Analyst. Bài này cover các techniques quan trọng nhất: merging, pivoting, grouping, và transforming data.
2. Merging & Joining
2.1 Merge Types
Text
1┌─────────────────────────────────────────────────────────┐2│ Join Types │3├─────────────────────────────────────────────────────────┤4│ │5│ INNER JOIN LEFT JOIN │6│ ┌───┬───┐ ┌───┬───┐ │7│ │ │███│ │███│███│ │8│ │ │███│ │███│███│ │9│ └───┴───┘ └───┴───┘ │10│ Only matches All from left │11│ │12│ RIGHT JOIN OUTER JOIN │13│ ┌───┬───┐ ┌───┬───┐ │14│ │███│███│ │███│███│ │15│ │███│███│ │███│███│ │16│ └───┴───┘ └───┴───┘ │17│ All from right All from both │18│ │19└─────────────────────────────────────────────────────────┘2.2 pd.merge()
Python
1import pandas as pd23# Sample data4orders = pd.DataFrame({5 'order_id': [1, 2, 3, 4],6 'customer_id': [101, 102, 103, 104],7 'amount': [500, 300, 700, 200]8})910customers = pd.DataFrame({11 'customer_id': [101, 102, 105],12 'name': ['Alice', 'Bob', 'Eve'],13 'city': ['Hanoi', 'HCMC', 'Danang']14})1516# Inner join (default)17inner = pd.merge(orders, customers, on='customer_id')18print("Inner Join:")19print(inner)2021# Left join22left = pd.merge(orders, customers, on='customer_id', how='left')23print("\nLeft Join:")24print(left)2526# Right join27right = pd.merge(orders, customers, on='customer_id', how='right')28print("\nRight Join:")29print(right)3031# Outer join32outer = pd.merge(orders, customers, on='customer_id', how='outer')33print("\nOuter Join:")34print(outer)2.3 Advanced Merge Options
Python
1# Merge on different column names2df1 = pd.DataFrame({'id': [1, 2, 3], 'value': ['a', 'b', 'c']})3df2 = pd.DataFrame({'key': [1, 2, 4], 'score': [10, 20, 40]})45merged = pd.merge(df1, df2, left_on='id', right_on='key')67# Merge on multiple columns8orders = pd.DataFrame({9 'year': [2023, 2023, 2024],10 'quarter': [1, 2, 1],11 'sales': [100, 200, 150]12})1314targets = pd.DataFrame({15 'year': [2023, 2023, 2024],16 'quarter': [1, 2, 1],17 'target': [120, 180, 140]18})1920merged = pd.merge(orders, targets, on=['year', 'quarter'])2122# Handle duplicate column names23merged = pd.merge(df1, df2, left_on='id', right_on='key', 24 suffixes=('_left', '_right'))2526# Validate merge27# one_to_one, one_to_many, many_to_one, many_to_many28merged = pd.merge(df1, df2, left_on='id', right_on='key', 29 validate='one_to_one')3031# Indicator column32merged = pd.merge(orders, customers, on='customer_id', 33 how='outer', indicator=True)34# _merge column shows: left_only, right_only, both2.4 Join on Index
Python
1# Set index and use join()2df1 = df1.set_index('id')3df2 = df2.set_index('key')45# Default is left join6result = df1.join(df2)78# Or use merge with index9result = pd.merge(df1, df2, left_index=True, right_index=True)3. Concatenation
3.1 pd.concat()
Python
1# Vertical concatenation (stack rows)2df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})3df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})45combined = pd.concat([df1, df2], ignore_index=True)6print(combined)7# A B8# 0 1 39# 1 2 410# 2 5 711# 3 6 81213# Horizontal concatenation14combined_h = pd.concat([df1, df2], axis=1)15print(combined_h)16# A B A B17# 0 1 3 5 718# 1 2 4 6 81920# With different columns21df3 = pd.DataFrame({'A': [1, 2], 'C': [9, 10]})22combined = pd.concat([df1, df3], ignore_index=True)23# Missing columns filled with NaN3.2 Concat với keys
Python
1# Add hierarchical index2q1 = pd.DataFrame({'sales': [100, 200]}, index=['Jan', 'Feb'])3q2 = pd.DataFrame({'sales': [300, 400]}, index=['Mar', 'Apr'])45combined = pd.concat([q1, q2], keys=['Q1', 'Q2'])6print(combined)7# sales8# Q1 Jan 1009# Feb 20010# Q2 Mar 30011# Apr 4004. GroupBy
GroupBy Power
GroupBy là một trong những công cụ mạnh nhất của Pandas, cho phép split-apply-combine data theo nhóm.
4.1 Basic GroupBy
Python
1# Sample data2sales = pd.DataFrame({3 'date': pd.date_range('2024-01-01', periods=10),4 'product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],5 'region': ['North', 'North', 'South', 'South', 'North', 6 'South', 'North', 'North', 'South', 'South'],7 'amount': [100, 150, 200, 180, 120, 160, 140, 170, 190, 130]8})910# Group by single column11grouped = sales.groupby('product')12print(grouped.sum())1314# Group by multiple columns15grouped = sales.groupby(['product', 'region'])16print(grouped.sum())1718# Access groups19print(grouped.groups)20print(grouped.get_group(('A', 'North')))4.2 Aggregation Functions
Python
1# Single aggregation2print(sales.groupby('product')['amount'].sum())3print(sales.groupby('product')['amount'].mean())4print(sales.groupby('product')['amount'].count())56# Multiple aggregations7agg_result = sales.groupby('product')['amount'].agg(['sum', 'mean', 'min', 'max'])8print(agg_result)910# Different aggregations per column11agg_result = sales.groupby('product').agg({12 'amount': ['sum', 'mean'],13 'date': 'count'14})15print(agg_result)1617# Named aggregations (cleaner output)18agg_result = sales.groupby('product').agg(19 total_sales=('amount', 'sum'),20 avg_sale=('amount', 'mean'),21 transaction_count=('date', 'count')22)23print(agg_result)2425# Custom aggregation function26def range_func(x):27 return x.max() - x.min()2829agg_result = sales.groupby('product')['amount'].agg(range_func)4.3 Transform
Python
1# Transform returns same-shape output2# Useful for adding group-level calculations back to original data34# Add group mean to each row5sales['group_mean'] = sales.groupby('product')['amount'].transform('mean')67# Normalize within group8sales['normalized'] = sales.groupby('product')['amount'].transform(9 lambda x: (x - x.mean()) / x.std()10)1112# Rank within group13sales['rank_in_group'] = sales.groupby('product')['amount'].transform('rank')1415# Percentage of group total16sales['pct_of_group'] = sales.groupby('product')['amount'].transform(17 lambda x: x / x.sum() * 10018)4.4 Filter Groups
Python
1# Keep only groups with sum > 5002filtered = sales.groupby('product').filter(lambda x: x['amount'].sum() > 500)34# Keep groups with more than 3 transactions5filtered = sales.groupby('product').filter(lambda x: len(x) > 3)4.5 Apply
Python
1# Apply custom function to each group2def top_n(group, n=2, column='amount'):3 return group.nlargest(n, column)45result = sales.groupby('product').apply(top_n, n=2)67# Reset index after apply8result = sales.groupby('product').apply(top_n).reset_index(drop=True)5. Pivot Tables
5.1 pivot_table()
Python
1# Basic pivot table2pivot = sales.pivot_table(3 values='amount',4 index='product',5 columns='region',6 aggfunc='sum'7)8print(pivot)9# region North South10# product 11# A 360 39012# B 470 2901314# Multiple aggregations15pivot = sales.pivot_table(16 values='amount',17 index='product',18 columns='region',19 aggfunc=['sum', 'mean', 'count']20)2122# Multiple values23pivot = sales.pivot_table(24 values=['amount'], # Can add more columns25 index='product',26 columns='region',27 aggfunc='sum',28 margins=True, # Add totals29 margins_name='Total'30)5.2 pivot() - Reshaping without aggregation
Python
1# When there's only one value per cell2df = pd.DataFrame({3 'date': ['2024-01', '2024-01', '2024-02', '2024-02'],4 'product': ['A', 'B', 'A', 'B'],5 'sales': [100, 150, 120, 180]6})78pivoted = df.pivot(index='date', columns='product', values='sales')9print(pivoted)10# product A B11# date 12# 2024-01 100 15013# 2024-02 120 1805.3 melt() - Unpivot
Python
1# Wide to long format2wide_df = pd.DataFrame({3 'product': ['A', 'B'],4 'Q1': [100, 150],5 'Q2': [120, 180],6 'Q3': [140, 200]7})89long_df = pd.melt(10 wide_df,11 id_vars=['product'],12 value_vars=['Q1', 'Q2', 'Q3'],13 var_name='quarter',14 value_name='sales'15)16print(long_df)17# product quarter sales18# 0 A Q1 10019# 1 B Q1 15020# 2 A Q2 12021# 3 B Q2 18022# 4 A Q3 14023# 5 B Q3 2006. Stack & Unstack
Python
1# Multi-index DataFrame2df = sales.groupby(['product', 'region'])['amount'].sum()3print(df)4# product region5# A North 3606# South 3907# B North 4708# South 290910# Unstack - pivot inner level to columns11unstacked = df.unstack()12print(unstacked)13# region North South14# product 15# A 360 39016# B 470 2901718# Stack - opposite of unstack19stacked = unstacked.stack()20print(stacked)21# Back to original format2223# Unstack specific level24df_multi = sales.groupby(['product', 'region', 'date'])['amount'].sum()25unstacked_date = df_multi.unstack(level='date')7. Window Functions
7.1 Rolling Windows
Python
1# Time series data2ts = pd.DataFrame({3 'date': pd.date_range('2024-01-01', periods=30),4 'sales': np.random.randint(100, 200, 30)5})6ts = ts.set_index('date')78# Rolling mean (moving average)9ts['MA_7'] = ts['sales'].rolling(window=7).mean()10ts['MA_14'] = ts['sales'].rolling(window=14).mean()1112# Rolling statistics13ts['rolling_std'] = ts['sales'].rolling(7).std()14ts['rolling_min'] = ts['sales'].rolling(7).min()15ts['rolling_max'] = ts['sales'].rolling(7).max()1617# Rolling sum18ts['rolling_sum'] = ts['sales'].rolling(7).sum()1920# Custom rolling function21ts['rolling_range'] = ts['sales'].rolling(7).apply(lambda x: x.max() - x.min())7.2 Expanding Windows
Python
1# Cumulative calculations2ts['cumsum'] = ts['sales'].expanding().sum()3ts['cummax'] = ts['sales'].expanding().max()4ts['cummin'] = ts['sales'].expanding().min()5ts['cummean'] = ts['sales'].expanding().mean()67# Year-to-date totals8ts['ytd_sales'] = ts['sales'].expanding().sum()7.3 Exponential Weighted Moving Average
Python
1# EWMA - gives more weight to recent values2ts['ewma'] = ts['sales'].ewm(span=7).mean()3ts['ewma_20'] = ts['sales'].ewm(span=20).mean()8. Advanced Techniques
8.1 Method Chaining
Python
1# Chain multiple operations2result = (3 sales4 .query('amount > 100')5 .assign(6 amount_k=lambda x: x['amount'] / 1000,7 month=lambda x: x['date'].dt.month8 )9 .groupby(['product', 'month'])10 .agg(11 total=('amount', 'sum'),12 avg=('amount', 'mean')13 )14 .reset_index()15 .sort_values('total', ascending=False)16)8.2 pipe()
Python
1# Custom functions in chain2def add_percent_column(df, col):3 df[f'{col}_pct'] = df[col] / df[col].sum() * 1004 return df56def filter_top_n(df, col, n):7 return df.nlargest(n, col)89result = (10 sales11 .groupby('product')['amount']12 .sum()13 .reset_index()14 .pipe(add_percent_column, 'amount')15 .pipe(filter_top_n, 'amount', 5)16)8.3 Cross Tabulation
Python
1# Cross tabulation2ct = pd.crosstab(3 sales['product'],4 sales['region'],5 values=sales['amount'],6 aggfunc='sum'7)8print(ct)910# With margins11ct = pd.crosstab(12 sales['product'],13 sales['region'],14 margins=True,15 margins_name='Total'16)1718# Normalize19ct_normalized = pd.crosstab(20 sales['product'],21 sales['region'],22 normalize='all' # or 'index', 'columns'23)9. Thực hành
Practice Exercise
Exercise: Sales Analysis
Python
1import pandas as pd2import numpy as np34# Generate sample data5np.random.seed(42)6n_orders = 50078orders = pd.DataFrame({9 'order_id': range(1, n_orders + 1),10 'date': pd.date_range('2024-01-01', periods=n_orders, freq='D'),11 'customer_id': np.random.randint(1, 100, n_orders),12 'product_id': np.random.randint(1, 20, n_orders),13 'quantity': np.random.randint(1, 10, n_orders),14 'unit_price': np.random.uniform(10, 100, n_orders).round(2)15})16orders['total_amount'] = orders['quantity'] * orders['unit_price']1718products = pd.DataFrame({19 'product_id': range(1, 21),20 'product_name': [f'Product_{i}' for i in range(1, 21)],21 'category': np.random.choice(['Electronics', 'Clothing', 'Food', 'Home'], 20)22})2324customers = pd.DataFrame({25 'customer_id': range(1, 101),26 'customer_name': [f'Customer_{i}' for i in range(1, 101)],27 'region': np.random.choice(['North', 'South', 'East', 'West'], 100)28})2930# Tasks:31# 1. Merge orders với products và customers32# 2. Tính monthly sales by category33# 3. Tìm top 5 customers by total spending34# 4. Tạo pivot table: category vs region, values = total_amount35# 5. Tính 7-day rolling average sales3637# YOUR CODE HERE💡 Xem đáp án
Python
1# 1. Merge all tables2df = (3 orders4 .merge(products, on='product_id')5 .merge(customers, on='customer_id')6)7print(f"Merged shape: {df.shape}")8print(df.head())910# 2. Monthly sales by category11df['month'] = df['date'].dt.to_period('M')12monthly_category = (13 df14 .groupby(['month', 'category'])['total_amount']15 .sum()16 .unstack()17 .round(2)18)19print("\nMonthly Sales by Category:")20print(monthly_category)2122# 3. Top 5 customers23top_customers = (24 df25 .groupby(['customer_id', 'customer_name'])['total_amount']26 .sum()27 .reset_index()28 .nlargest(5, 'total_amount')29)30print("\nTop 5 Customers:")31print(top_customers)3233# 4. Pivot table34pivot = pd.pivot_table(35 df,36 values='total_amount',37 index='category',38 columns='region',39 aggfunc='sum',40 margins=True,41 margins_name='Total'42).round(2)43print("\nPivot Table - Category vs Region:")44print(pivot)4546# 5. 7-day rolling average47daily_sales = df.groupby('date')['total_amount'].sum()48rolling_avg = daily_sales.rolling(7).mean()49print("\n7-Day Rolling Average:")50print(rolling_avg.tail(10))5152# Visualization53import matplotlib.pyplot as plt5455fig, axes = plt.subplots(2, 2, figsize=(14, 10))5657# Monthly trend58monthly_category.plot(ax=axes[0, 0], marker='o')59axes[0, 0].set_title('Monthly Sales by Category')60axes[0, 0].set_xlabel('Month')61axes[0, 0].set_ylabel('Sales')6263# Top customers64axes[0, 1].barh(top_customers['customer_name'], top_customers['total_amount'])65axes[0, 1].set_title('Top 5 Customers by Spending')66axes[0, 1].set_xlabel('Total Amount')6768# Region distribution69region_sales = df.groupby('region')['total_amount'].sum()70axes[1, 0].pie(region_sales, labels=region_sales.index, autopct='%1.1f%%')71axes[1, 0].set_title('Sales by Region')7273# Rolling average74axes[1, 1].plot(daily_sales.index, daily_sales.values, alpha=0.5, label='Daily')75axes[1, 1].plot(rolling_avg.index, rolling_avg.values, label='7-day MA')76axes[1, 1].set_title('Daily Sales with Rolling Average')77axes[1, 1].legend()78axes[1, 1].tick_params(axis='x', rotation=45)7980plt.tight_layout()81plt.show()10. Tổng kết
| Operation | Method | Use Case |
|---|---|---|
| Merge | pd.merge() | Join tables on keys |
| Concat | pd.concat() | Stack DataFrames |
| GroupBy | df.groupby() | Aggregate by groups |
| Transform | transform() | Group calculations same shape |
| Pivot | pivot_table() | Reshape for analysis |
| Melt | pd.melt() | Wide to long format |
| Rolling | rolling() | Moving window calculations |
Bài tiếp theo: Pandas Advanced - MultiIndex, Method Chaining, Optimization
