Lý thuyết
Bài 2/17

Pandas Fundamentals

DataFrames, Series, và các operations cơ bản trong Pandas

Pandas Fundamentals

Pandas Data Analysis

1. Introduction to Pandas

Pandas là gì?

Pandas là thư viện Python mạnh mẽ nhất cho Data Analysis, cung cấp DataFrame - cấu trúc dữ liệu 2 chiều giống Excel/SQL table nhưng với khả năng xử lý linh hoạt hơn rất nhiều.

1.1 Core Concepts

Text
1┌─────────────────────────────────────────────────────────┐
2│ Pandas Core │
3├─────────────────────────────────────────────────────────┤
4│ │
5│ ┌────────────┐ ┌─────────────────────┐ │
6│ │ Series │ │ DataFrame │ │
7│ │ (1D array)│ │ (2D table) │ │
8│ │ │ │ │ │
9│ │ Index │ │ Index Col1 Col2 │ │
10│ │ ────── │ │ ───── ───── ───── │ │
11│ │ 0 → 10 │ │ 0 A 1 │ │
12│ │ 1 → 20 │ │ 1 B 2 │ │
13│ │ 2 → 30 │ │ 2 C 3 │ │
14│ │ │ │ │ │
15│ └────────────┘ └─────────────────────┘ │
16│ │
17└─────────────────────────────────────────────────────────┘

1.2 Import Convention

Python
1import pandas as pd
2import numpy as np
3
4# Verify
5print(f"Pandas version: {pd.__version__}")

2. Series

2.1 Tạo Series

Python
1# From list
2s1 = pd.Series([10, 20, 30, 40])
3print(s1)
4# 0 10
5# 1 20
6# 2 30
7# 3 40
8# dtype: int64
9
10# With custom index
11s2 = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
12print(s2)
13# a 100
14# b 200
15# c 300
16
17# From dictionary
18d = {'apple': 50, 'banana': 30, 'orange': 45}
19s3 = pd.Series(d)
20print(s3)
21# apple 50
22# banana 30
23# orange 45

2.2 Series Operations

Python
1s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
2
3# Indexing
4print(s['a']) # 10
5print(s[0]) # 10
6print(s['a':'c']) # a=10, b=20, c=30
7
8# Boolean indexing
9print(s[s > 25]) # c=30, d=40, e=50
10
11# Arithmetic
12print(s * 2) # All values doubled
13print(s + 100) # Add 100 to all
14
15# Aggregations
16print(s.sum()) # 150
17print(s.mean()) # 30
18print(s.std()) # 15.81
19print(s.max()) # 50
20print(s.min()) # 10
21
22# Missing values
23s_with_na = pd.Series([1, 2, np.nan, 4])
24print(s_with_na.isna())
25print(s_with_na.dropna())
26print(s_with_na.fillna(0))

3. DataFrame Basics

3.1 Tạo DataFrame

Python
1# From dictionary
2data = {
3 'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
4 'age': [25, 30, 35, 28],
5 'city': ['Hanoi', 'HCMC', 'Danang', 'Hanoi'],
6 'salary': [50000, 60000, 70000, 55000]
7}
8df = pd.DataFrame(data)
9print(df)
10# name age city salary
11# 0 Alice 25 Hanoi 50000
12# 1 Bob 30 HCMC 60000
13# 2 Charlie 35 Danang 70000
14# 3 Diana 28 Hanoi 55000
15
16# From list of dictionaries
17records = [
18 {'name': 'Alice', 'age': 25},
19 {'name': 'Bob', 'age': 30},
20 {'name': 'Charlie', 'age': 35}
21]
22df2 = pd.DataFrame(records)
23
24# From NumPy array
25arr = np.random.randn(3, 4)
26df3 = pd.DataFrame(arr, columns=['A', 'B', 'C', 'D'])

3.2 DataFrame Attributes

Python
1# Basic info
2print(df.shape) # (4, 4) - rows, columns
3print(df.columns) # Index(['name', 'age', 'city', 'salary'])
4print(df.index) # RangeIndex(start=0, stop=4, step=1)
5print(df.dtypes) # Data types of each column
6
7# Memory usage
8print(df.memory_usage(deep=True))
9
10# Quick overview
11print(df.info())
12print(df.describe()) # Statistics for numeric columns

4. Reading & Writing Data

4.1 Reading Data

Python
1# CSV
2df = pd.read_csv('data.csv')
3df = pd.read_csv('data.csv', encoding='utf-8')
4df = pd.read_csv('data.csv', parse_dates=['date_column'])
5df = pd.read_csv('data.csv', index_col='id')
6df = pd.read_csv('data.csv', usecols=['col1', 'col2']) # Select columns
7df = pd.read_csv('data.csv', nrows=1000) # First 1000 rows
8
9# Excel
10df = pd.read_excel('data.xlsx')
11df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
12df = pd.read_excel('data.xlsx', sheet_name=0) # First sheet
13
14# JSON
15df = pd.read_json('data.json')
16df = pd.read_json('data.json', orient='records')
17
18# SQL
19from sqlalchemy import create_engine
20engine = create_engine('postgresql://user:pass@host:5432/db')
21df = pd.read_sql('SELECT * FROM table', engine)
22df = pd.read_sql_query('SELECT * FROM table WHERE date > ?', engine, params=['2024-01-01'])

4.2 Writing Data

Python
1# CSV
2df.to_csv('output.csv', index=False)
3df.to_csv('output.csv', encoding='utf-8-sig') # For Excel compatibility
4
5# Excel
6df.to_excel('output.xlsx', index=False, sheet_name='Results')
7
8# Multiple sheets
9with pd.ExcelWriter('output.xlsx') as writer:
10 df1.to_excel(writer, sheet_name='Sheet1', index=False)
11 df2.to_excel(writer, sheet_name='Sheet2', index=False)
12
13# JSON
14df.to_json('output.json', orient='records', indent=2)
15
16# SQL
17df.to_sql('table_name', engine, if_exists='replace', index=False)

5. Selecting Data

5.1 Column Selection

Python
1# Single column (returns Series)
2ages = df['age']
3print(type(ages)) # pandas.core.series.Series
4
5# Single column (alternative)
6ages = df.age # Works but not recommended
7
8# Multiple columns (returns DataFrame)
9subset = df[['name', 'age']]
10print(type(subset)) # pandas.core.frame.DataFrame

5.2 Row Selection

Python
1# By index position - iloc
2print(df.iloc[0]) # First row (Series)
3print(df.iloc[0:3]) # First 3 rows (DataFrame)
4print(df.iloc[-1]) # Last row
5print(df.iloc[[0, 2, 3]]) # Specific rows
6
7# By label - loc
8df_labeled = df.set_index('name')
9print(df_labeled.loc['Alice']) # Row by index label
10print(df_labeled.loc['Alice':'Bob']) # Range by label
11
12# By condition
13print(df[df['age'] > 28]) # Age > 28
14print(df[df['city'] == 'Hanoi']) # City is Hanoi

5.3 Combined Selection

Python
1# iloc: integer position
2print(df.iloc[0, 1]) # Row 0, Column 1
3print(df.iloc[0:3, 1:3]) # Rows 0-2, Columns 1-2
4
5# loc: label-based
6print(df.loc[0, 'name']) # Row 0, column 'name'
7print(df.loc[0:2, 'name':'city']) # Rows 0-2, columns name to city
8
9# Mixed conditions
10result = df.loc[df['age'] > 28, ['name', 'salary']]
11print(result)
iloc vs loc
  • iloc: Integer-based (position) - như array indexing
  • loc: Label-based - dùng tên column/row index
  • Cả hai đều hỗ trợ slicing và boolean indexing

6. Filtering Data

6.1 Boolean Conditions

Python
1# Single condition
2young = df[df['age'] < 30]
3
4# Multiple conditions (AND)
5result = df[(df['age'] > 25) & (df['city'] == 'Hanoi')]
6
7# Multiple conditions (OR)
8result = df[(df['city'] == 'Hanoi') | (df['city'] == 'HCMC')]
9
10# NOT condition
11result = df[~(df['city'] == 'Hanoi')]
12
13# isin()
14cities = ['Hanoi', 'HCMC']
15result = df[df['city'].isin(cities)]
16
17# String methods
18result = df[df['name'].str.startswith('A')]
19result = df[df['name'].str.contains('li')]

6.2 Query Method

Python
1# query() - SQL-like syntax
2result = df.query('age > 25 and city == "Hanoi"')
3result = df.query('age > 25 or salary > 60000')
4
5# Using variables
6min_age = 25
7result = df.query('age > @min_age')
8
9# In operator
10result = df.query('city in ["Hanoi", "HCMC"]')

7. Basic Operations

7.1 Adding/Modifying Columns

Python
1# Add new column
2df['bonus'] = df['salary'] * 0.1
3
4# Conditional column
5df['category'] = np.where(df['age'] > 30, 'Senior', 'Junior')
6
7# Multiple conditions
8conditions = [
9 (df['age'] < 25),
10 (df['age'] >= 25) & (df['age'] < 35),
11 (df['age'] >= 35)
12]
13choices = ['Young', 'Middle', 'Senior']
14df['age_group'] = np.select(conditions, choices, default='Unknown')
15
16# Using apply
17df['name_length'] = df['name'].apply(len)
18df['salary_k'] = df['salary'].apply(lambda x: x / 1000)
19
20# Using assign (chaining-friendly)
21df = df.assign(
22 total_comp=lambda x: x['salary'] + x['bonus'],
23 tax=lambda x: x['salary'] * 0.2
24)

7.2 Removing Data

Python
1# Drop columns
2df_clean = df.drop(columns=['bonus', 'category'])
3df_clean = df.drop(['bonus', 'category'], axis=1)
4
5# Drop rows by index
6df_clean = df.drop([0, 1]) # Drop first two rows
7df_clean = df.drop(index=[0, 1])
8
9# Drop duplicates
10df_clean = df.drop_duplicates()
11df_clean = df.drop_duplicates(subset=['name', 'city'])
12df_clean = df.drop_duplicates(subset='name', keep='last')

7.3 Renaming

Python
1# Rename columns
2df_renamed = df.rename(columns={'name': 'full_name', 'age': 'years'})
3
4# Rename with function
5df_renamed = df.rename(columns=str.upper)
6df_renamed = df.rename(columns=lambda x: x.replace('_', ' ').title())

8. Sorting

Python
1# Sort by column
2df_sorted = df.sort_values('age')
3df_sorted = df.sort_values('age', ascending=False)
4
5# Sort by multiple columns
6df_sorted = df.sort_values(['city', 'age'], ascending=[True, False])
7
8# Sort by index
9df_sorted = df.sort_index()
10df_sorted = df.sort_index(ascending=False)
11
12# Get top/bottom N
13top_5 = df.nlargest(5, 'salary')
14bottom_5 = df.nsmallest(5, 'age')

9. Missing Values

9.1 Detecting Missing Values

Python
1# Create sample data with missing values
2df = pd.DataFrame({
3 'A': [1, 2, np.nan, 4],
4 'B': [5, np.nan, np.nan, 8],
5 'C': [9, 10, 11, 12]
6})
7
8# Check for missing
9print(df.isna()) # Boolean DataFrame
10print(df.isnull()) # Same as isna()
11print(df.notna()) # Opposite
12
13# Count missing values
14print(df.isna().sum()) # Per column
15print(df.isna().sum().sum()) # Total
16print(df.isna().mean() * 100) # Percentage
17
18# Rows with any missing
19print(df[df.isna().any(axis=1)])

9.2 Handling Missing Values

Python
1# Drop rows with any missing
2df_clean = df.dropna()
3
4# Drop rows where all values are missing
5df_clean = df.dropna(how='all')
6
7# Drop based on specific columns
8df_clean = df.dropna(subset=['A', 'B'])
9
10# Fill with value
11df_filled = df.fillna(0)
12df_filled = df.fillna({'A': 0, 'B': 99})
13
14# Fill with statistics
15df_filled = df.fillna(df.mean())
16df_filled = df.fillna(df.median())
17
18# Forward/backward fill
19df_filled = df.fillna(method='ffill') # Forward fill
20df_filled = df.fillna(method='bfill') # Backward fill
21
22# Interpolation
23df_filled = df.interpolate()
24df_filled = df.interpolate(method='linear')

10. Descriptive Statistics

Python
1# Overall statistics
2print(df.describe())
3
4# Include all columns
5print(df.describe(include='all'))
6
7# Individual statistics
8print(df['salary'].mean())
9print(df['salary'].median())
10print(df['salary'].std())
11print(df['salary'].var())
12print(df['salary'].min())
13print(df['salary'].max())
14print(df['salary'].quantile([0.25, 0.5, 0.75]))
15
16# Value counts
17print(df['city'].value_counts())
18print(df['city'].value_counts(normalize=True)) # Percentages
19
20# Unique values
21print(df['city'].unique())
22print(df['city'].nunique())
23
24# Correlations
25print(df[['age', 'salary']].corr())
26
27# Custom aggregations
28print(df.agg({
29 'age': ['mean', 'min', 'max'],
30 'salary': ['sum', 'mean', 'std']
31}))

11. Thực hành

Practice Exercise

Exercise: Employee Data Analysis

Python
1# Create sample employee dataset
2np.random.seed(42)
3n = 100
4
5employees = pd.DataFrame({
6 'employee_id': range(1, n+1),
7 'name': [f'Employee_{i}' for i in range(1, n+1)],
8 'department': np.random.choice(['Sales', 'IT', 'HR', 'Finance', 'Marketing'], n),
9 'salary': np.random.randint(40000, 120000, n),
10 'age': np.random.randint(22, 60, n),
11 'years_experience': np.random.randint(0, 30, n),
12 'performance_score': np.random.uniform(1, 5, n).round(1),
13 'hire_date': pd.date_range('2015-01-01', periods=n, freq='W')
14})
15
16# Add some missing values
17employees.loc[np.random.choice(employees.index, 5), 'performance_score'] = np.nan
18
19# Tasks:
20# 1. Tìm employees có salary > 80000 và department là IT
21# 2. Tìm top 10 employees có performance_score cao nhất
22# 3. Tính average salary theo department
23# 4. Fill missing performance_score với median
24# 5. Tạo column 'salary_category' (Low/Medium/High)
25
26# YOUR CODE HERE
💡 Xem đáp án
Python
1# 1. IT employees với salary > 80000
2it_high_salary = employees[(employees['salary'] > 80000) &
3 (employees['department'] == 'IT')]
4print(f"IT employees with salary > 80k: {len(it_high_salary)}")
5print(it_high_salary[['name', 'department', 'salary']])
6
7# 2. Top 10 performance
8top_performers = employees.nlargest(10, 'performance_score')[
9 ['name', 'department', 'performance_score']
10]
11print("\nTop 10 performers:")
12print(top_performers)
13
14# 3. Average salary by department
15avg_salary_dept = employees.groupby('department')['salary'].mean().round(2)
16print("\nAverage salary by department:")
17print(avg_salary_dept.sort_values(ascending=False))
18
19# 4. Fill missing performance scores
20median_score = employees['performance_score'].median()
21employees['performance_score'] = employees['performance_score'].fillna(median_score)
22print(f"\nMissing values after fill: {employees['performance_score'].isna().sum()}")
23
24# 5. Salary category
25def categorize_salary(salary):
26 if salary < 60000:
27 return 'Low'
28 elif salary < 90000:
29 return 'Medium'
30 else:
31 return 'High'
32
33employees['salary_category'] = employees['salary'].apply(categorize_salary)
34print("\nSalary distribution:")
35print(employees['salary_category'].value_counts())
36
37# Alternative using pd.cut
38employees['salary_category2'] = pd.cut(
39 employees['salary'],
40 bins=[0, 60000, 90000, float('inf')],
41 labels=['Low', 'Medium', 'High']
42)

12. Tổng kết

ConceptKey Methods
Createpd.DataFrame(), pd.read_csv(), pd.read_excel()
Selectdf['col'], df.iloc[], df.loc[], df.query()
FilterBoolean conditions, isin(), string methods
Modifyassign(), apply(), np.where()
Cleandropna(), fillna(), drop_duplicates()
Sortsort_values(), nlargest(), nsmallest()
Statsdescribe(), value_counts(), agg()

Bài tiếp theo: Data Manipulation - Merge, Pivot, GroupBy