Pandas Fundamentals
1. Introduction to Pandas
Pandas là gì?
Pandas là thư viện Python mạnh mẽ nhất cho Data Analysis, cung cấp DataFrame - cấu trúc dữ liệu 2 chiều giống Excel/SQL table nhưng với khả năng xử lý linh hoạt hơn rất nhiều.
1.1 Core Concepts
Text
1┌─────────────────────────────────────────────────────────┐2│ Pandas Core │3├─────────────────────────────────────────────────────────┤4│ │5│ ┌────────────┐ ┌─────────────────────┐ │6│ │ Series │ │ DataFrame │ │7│ │ (1D array)│ │ (2D table) │ │8│ │ │ │ │ │9│ │ Index │ │ Index Col1 Col2 │ │10│ │ ────── │ │ ───── ───── ───── │ │11│ │ 0 → 10 │ │ 0 A 1 │ │12│ │ 1 → 20 │ │ 1 B 2 │ │13│ │ 2 → 30 │ │ 2 C 3 │ │14│ │ │ │ │ │15│ └────────────┘ └─────────────────────┘ │16│ │17└─────────────────────────────────────────────────────────┘1.2 Import Convention
Python
1import pandas as pd2import numpy as np34# Verify5print(f"Pandas version: {pd.__version__}")2. Series
2.1 Tạo Series
Python
1# From list2s1 = pd.Series([10, 20, 30, 40])3print(s1)4# 0 105# 1 206# 2 307# 3 408# dtype: int64910# With custom index11s2 = pd.Series([100, 200, 300], index=['a', 'b', 'c'])12print(s2)13# a 10014# b 20015# c 3001617# From dictionary18d = {'apple': 50, 'banana': 30, 'orange': 45}19s3 = pd.Series(d)20print(s3)21# apple 5022# banana 3023# orange 452.2 Series Operations
Python
1s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])23# Indexing4print(s['a']) # 105print(s[0]) # 106print(s['a':'c']) # a=10, b=20, c=3078# Boolean indexing9print(s[s > 25]) # c=30, d=40, e=501011# Arithmetic12print(s * 2) # All values doubled13print(s + 100) # Add 100 to all1415# Aggregations16print(s.sum()) # 15017print(s.mean()) # 3018print(s.std()) # 15.8119print(s.max()) # 5020print(s.min()) # 102122# Missing values23s_with_na = pd.Series([1, 2, np.nan, 4])24print(s_with_na.isna())25print(s_with_na.dropna())26print(s_with_na.fillna(0))3. DataFrame Basics
3.1 Tạo DataFrame
Python
1# From dictionary2data = {3 'name': ['Alice', 'Bob', 'Charlie', 'Diana'],4 'age': [25, 30, 35, 28],5 'city': ['Hanoi', 'HCMC', 'Danang', 'Hanoi'],6 'salary': [50000, 60000, 70000, 55000]7}8df = pd.DataFrame(data)9print(df)10# name age city salary11# 0 Alice 25 Hanoi 5000012# 1 Bob 30 HCMC 6000013# 2 Charlie 35 Danang 7000014# 3 Diana 28 Hanoi 550001516# From list of dictionaries17records = [18 {'name': 'Alice', 'age': 25},19 {'name': 'Bob', 'age': 30},20 {'name': 'Charlie', 'age': 35}21]22df2 = pd.DataFrame(records)2324# From NumPy array25arr = np.random.randn(3, 4)26df3 = pd.DataFrame(arr, columns=['A', 'B', 'C', 'D'])3.2 DataFrame Attributes
Python
1# Basic info2print(df.shape) # (4, 4) - rows, columns3print(df.columns) # Index(['name', 'age', 'city', 'salary'])4print(df.index) # RangeIndex(start=0, stop=4, step=1)5print(df.dtypes) # Data types of each column67# Memory usage8print(df.memory_usage(deep=True))910# Quick overview11print(df.info())12print(df.describe()) # Statistics for numeric columns4. Reading & Writing Data
4.1 Reading Data
Python
1# CSV2df = pd.read_csv('data.csv')3df = pd.read_csv('data.csv', encoding='utf-8')4df = pd.read_csv('data.csv', parse_dates=['date_column'])5df = pd.read_csv('data.csv', index_col='id')6df = pd.read_csv('data.csv', usecols=['col1', 'col2']) # Select columns7df = pd.read_csv('data.csv', nrows=1000) # First 1000 rows89# Excel10df = pd.read_excel('data.xlsx')11df = pd.read_excel('data.xlsx', sheet_name='Sheet1')12df = pd.read_excel('data.xlsx', sheet_name=0) # First sheet1314# JSON15df = pd.read_json('data.json')16df = pd.read_json('data.json', orient='records')1718# SQL19from sqlalchemy import create_engine20engine = create_engine('postgresql://user:pass@host:5432/db')21df = pd.read_sql('SELECT * FROM table', engine)22df = pd.read_sql_query('SELECT * FROM table WHERE date > ?', engine, params=['2024-01-01'])4.2 Writing Data
Python
1# CSV2df.to_csv('output.csv', index=False)3df.to_csv('output.csv', encoding='utf-8-sig') # For Excel compatibility45# Excel6df.to_excel('output.xlsx', index=False, sheet_name='Results')78# Multiple sheets9with pd.ExcelWriter('output.xlsx') as writer:10 df1.to_excel(writer, sheet_name='Sheet1', index=False)11 df2.to_excel(writer, sheet_name='Sheet2', index=False)1213# JSON14df.to_json('output.json', orient='records', indent=2)1516# SQL17df.to_sql('table_name', engine, if_exists='replace', index=False)5. Selecting Data
5.1 Column Selection
Python
1# Single column (returns Series)2ages = df['age']3print(type(ages)) # pandas.core.series.Series45# Single column (alternative)6ages = df.age # Works but not recommended78# Multiple columns (returns DataFrame)9subset = df[['name', 'age']]10print(type(subset)) # pandas.core.frame.DataFrame5.2 Row Selection
Python
1# By index position - iloc2print(df.iloc[0]) # First row (Series)3print(df.iloc[0:3]) # First 3 rows (DataFrame)4print(df.iloc[-1]) # Last row5print(df.iloc[[0, 2, 3]]) # Specific rows67# By label - loc8df_labeled = df.set_index('name')9print(df_labeled.loc['Alice']) # Row by index label10print(df_labeled.loc['Alice':'Bob']) # Range by label1112# By condition13print(df[df['age'] > 28]) # Age > 2814print(df[df['city'] == 'Hanoi']) # City is Hanoi5.3 Combined Selection
Python
1# iloc: integer position2print(df.iloc[0, 1]) # Row 0, Column 13print(df.iloc[0:3, 1:3]) # Rows 0-2, Columns 1-245# loc: label-based6print(df.loc[0, 'name']) # Row 0, column 'name'7print(df.loc[0:2, 'name':'city']) # Rows 0-2, columns name to city89# Mixed conditions10result = df.loc[df['age'] > 28, ['name', 'salary']]11print(result)iloc vs loc
- iloc: Integer-based (position) - như array indexing
- loc: Label-based - dùng tên column/row index
- Cả hai đều hỗ trợ slicing và boolean indexing
6. Filtering Data
6.1 Boolean Conditions
Python
1# Single condition2young = df[df['age'] < 30]34# Multiple conditions (AND)5result = df[(df['age'] > 25) & (df['city'] == 'Hanoi')]67# Multiple conditions (OR)8result = df[(df['city'] == 'Hanoi') | (df['city'] == 'HCMC')]910# NOT condition11result = df[~(df['city'] == 'Hanoi')]1213# isin()14cities = ['Hanoi', 'HCMC']15result = df[df['city'].isin(cities)]1617# String methods18result = df[df['name'].str.startswith('A')]19result = df[df['name'].str.contains('li')]6.2 Query Method
Python
1# query() - SQL-like syntax2result = df.query('age > 25 and city == "Hanoi"')3result = df.query('age > 25 or salary > 60000')45# Using variables6min_age = 257result = df.query('age > @min_age')89# In operator10result = df.query('city in ["Hanoi", "HCMC"]')7. Basic Operations
7.1 Adding/Modifying Columns
Python
1# Add new column2df['bonus'] = df['salary'] * 0.134# Conditional column5df['category'] = np.where(df['age'] > 30, 'Senior', 'Junior')67# Multiple conditions8conditions = [9 (df['age'] < 25),10 (df['age'] >= 25) & (df['age'] < 35),11 (df['age'] >= 35)12]13choices = ['Young', 'Middle', 'Senior']14df['age_group'] = np.select(conditions, choices, default='Unknown')1516# Using apply17df['name_length'] = df['name'].apply(len)18df['salary_k'] = df['salary'].apply(lambda x: x / 1000)1920# Using assign (chaining-friendly)21df = df.assign(22 total_comp=lambda x: x['salary'] + x['bonus'],23 tax=lambda x: x['salary'] * 0.224)7.2 Removing Data
Python
1# Drop columns2df_clean = df.drop(columns=['bonus', 'category'])3df_clean = df.drop(['bonus', 'category'], axis=1)45# Drop rows by index6df_clean = df.drop([0, 1]) # Drop first two rows7df_clean = df.drop(index=[0, 1])89# Drop duplicates10df_clean = df.drop_duplicates()11df_clean = df.drop_duplicates(subset=['name', 'city'])12df_clean = df.drop_duplicates(subset='name', keep='last')7.3 Renaming
Python
1# Rename columns2df_renamed = df.rename(columns={'name': 'full_name', 'age': 'years'})34# Rename with function5df_renamed = df.rename(columns=str.upper)6df_renamed = df.rename(columns=lambda x: x.replace('_', ' ').title())8. Sorting
Python
1# Sort by column2df_sorted = df.sort_values('age')3df_sorted = df.sort_values('age', ascending=False)45# Sort by multiple columns6df_sorted = df.sort_values(['city', 'age'], ascending=[True, False])78# Sort by index9df_sorted = df.sort_index()10df_sorted = df.sort_index(ascending=False)1112# Get top/bottom N13top_5 = df.nlargest(5, 'salary')14bottom_5 = df.nsmallest(5, 'age')9. Missing Values
9.1 Detecting Missing Values
Python
1# Create sample data with missing values2df = pd.DataFrame({3 'A': [1, 2, np.nan, 4],4 'B': [5, np.nan, np.nan, 8],5 'C': [9, 10, 11, 12]6})78# Check for missing9print(df.isna()) # Boolean DataFrame10print(df.isnull()) # Same as isna()11print(df.notna()) # Opposite1213# Count missing values14print(df.isna().sum()) # Per column15print(df.isna().sum().sum()) # Total16print(df.isna().mean() * 100) # Percentage1718# Rows with any missing19print(df[df.isna().any(axis=1)])9.2 Handling Missing Values
Python
1# Drop rows with any missing2df_clean = df.dropna()34# Drop rows where all values are missing5df_clean = df.dropna(how='all')67# Drop based on specific columns8df_clean = df.dropna(subset=['A', 'B'])910# Fill with value11df_filled = df.fillna(0)12df_filled = df.fillna({'A': 0, 'B': 99})1314# Fill with statistics15df_filled = df.fillna(df.mean())16df_filled = df.fillna(df.median())1718# Forward/backward fill19df_filled = df.fillna(method='ffill') # Forward fill20df_filled = df.fillna(method='bfill') # Backward fill2122# Interpolation23df_filled = df.interpolate()24df_filled = df.interpolate(method='linear')10. Descriptive Statistics
Python
1# Overall statistics2print(df.describe())34# Include all columns5print(df.describe(include='all'))67# Individual statistics8print(df['salary'].mean())9print(df['salary'].median())10print(df['salary'].std())11print(df['salary'].var())12print(df['salary'].min())13print(df['salary'].max())14print(df['salary'].quantile([0.25, 0.5, 0.75]))1516# Value counts17print(df['city'].value_counts())18print(df['city'].value_counts(normalize=True)) # Percentages1920# Unique values21print(df['city'].unique())22print(df['city'].nunique())2324# Correlations25print(df[['age', 'salary']].corr())2627# Custom aggregations28print(df.agg({29 'age': ['mean', 'min', 'max'],30 'salary': ['sum', 'mean', 'std']31}))11. Thực hành
Practice Exercise
Exercise: Employee Data Analysis
Python
1# Create sample employee dataset2np.random.seed(42)3n = 10045employees = pd.DataFrame({6 'employee_id': range(1, n+1),7 'name': [f'Employee_{i}' for i in range(1, n+1)],8 'department': np.random.choice(['Sales', 'IT', 'HR', 'Finance', 'Marketing'], n),9 'salary': np.random.randint(40000, 120000, n),10 'age': np.random.randint(22, 60, n),11 'years_experience': np.random.randint(0, 30, n),12 'performance_score': np.random.uniform(1, 5, n).round(1),13 'hire_date': pd.date_range('2015-01-01', periods=n, freq='W')14})1516# Add some missing values17employees.loc[np.random.choice(employees.index, 5), 'performance_score'] = np.nan1819# Tasks:20# 1. Tìm employees có salary > 80000 và department là IT21# 2. Tìm top 10 employees có performance_score cao nhất22# 3. Tính average salary theo department23# 4. Fill missing performance_score với median24# 5. Tạo column 'salary_category' (Low/Medium/High)2526# YOUR CODE HERE💡 Xem đáp án
Python
1# 1. IT employees với salary > 800002it_high_salary = employees[(employees['salary'] > 80000) & 3 (employees['department'] == 'IT')]4print(f"IT employees with salary > 80k: {len(it_high_salary)}")5print(it_high_salary[['name', 'department', 'salary']])67# 2. Top 10 performance8top_performers = employees.nlargest(10, 'performance_score')[9 ['name', 'department', 'performance_score']10]11print("\nTop 10 performers:")12print(top_performers)1314# 3. Average salary by department15avg_salary_dept = employees.groupby('department')['salary'].mean().round(2)16print("\nAverage salary by department:")17print(avg_salary_dept.sort_values(ascending=False))1819# 4. Fill missing performance scores20median_score = employees['performance_score'].median()21employees['performance_score'] = employees['performance_score'].fillna(median_score)22print(f"\nMissing values after fill: {employees['performance_score'].isna().sum()}")2324# 5. Salary category25def categorize_salary(salary):26 if salary < 60000:27 return 'Low'28 elif salary < 90000:29 return 'Medium'30 else:31 return 'High'3233employees['salary_category'] = employees['salary'].apply(categorize_salary)34print("\nSalary distribution:")35print(employees['salary_category'].value_counts())3637# Alternative using pd.cut38employees['salary_category2'] = pd.cut(39 employees['salary'],40 bins=[0, 60000, 90000, float('inf')],41 labels=['Low', 'Medium', 'High']42)12. Tổng kết
| Concept | Key Methods |
|---|---|
| Create | pd.DataFrame(), pd.read_csv(), pd.read_excel() |
| Select | df['col'], df.iloc[], df.loc[], df.query() |
| Filter | Boolean conditions, isin(), string methods |
| Modify | assign(), apply(), np.where() |
| Clean | dropna(), fillna(), drop_duplicates() |
| Sort | sort_values(), nlargest(), nsmallest() |
| Stats | describe(), value_counts(), agg() |
Bài tiếp theo: Data Manipulation - Merge, Pivot, GroupBy
