MinAI - Về trang chủ
Hướng dẫn
2/132 giờ
Đang tải...

Pandas Fundamentals

DataFrames, Series, và các operations cơ bản trong Pandas

0

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Hiểu Series và DataFrame — core data structures

✅ Read/Write data từ CSV, Excel, JSON, SQL

✅ Select, filter, sort data thành thạo

✅ Handle missing values chuyên nghiệp

✅ Tính descriptive statistics

Thời gian: 2 giờ | Độ khó: Beginner-Intermediate | Tool: Python + Pandas

1

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min
Thuật ngữTiếng ViệtMô tả
SeriesChuỗi dữ liệu1D labeled array
DataFrameBảng dữ liệu2D labeled table (rows + columns)
IndexChỉ mụcRow labels (default: 0, 1, 2...)
iloc-Integer-based indexing (position)
loc-Label-based indexing (name)
NaNNot a NumberMissing value marker
dtypeData typeKiểu dữ liệu của column
Boolean IndexingLọc điều kiệnFilter rows by True/False mask
query()Truy vấnSQL-like filter syntax
describe()Mô tảAuto-generate statistics

Checkpoint

Series = 1D (like a column). DataFrame = 2D table. iloc = position-based. loc = label-based. NaN = missing value.

2

📊 1. Series & DataFrame

TB5 min

1.1 Series

Python
1import pandas as pd
2import numpy as np
3
4# From list
5s1 = pd.Series([10, 20, 30, 40])
6
7# With custom index
8s2 = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
9
10# From dictionary
11s3 = pd.Series({'apple': 50, 'banana': 30, 'orange': 45})
12
13# Operations
14s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])
15print(s['a']) # 10
16print(s[s > 25]) # c=30, d=40, e=50
17print(s.sum()) # 150
18print(s.mean()) # 30

1.2 DataFrame

Python
1# From dictionary
2data = {
3 'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
4 'age': [25, 30, 35, 28],
5 'city': ['Hanoi', 'HCMC', 'Danang', 'Hanoi'],
6 'salary': [50000, 60000, 70000, 55000]
7}
8df = pd.DataFrame(data)
9
10# Attributes
11print(df.shape) # (4, 4)
12print(df.columns) # Index(['name', 'age', 'city', 'salary'])
13print(df.dtypes) # Data types
14print(df.info()) # Overview
15print(df.describe()) # Statistics
Ví dụ
1┌────────────┐ ┌─────────────────────┐
2│ Series │ │ DataFrame │
3│ (1D array)│ │ (2D table) │
4│ Index │ │ Index Col1 Col2 │
5│ 0 → 10 │ │ 0 A 1 │
6│ 1 → 20 │ │ 1 B 2 │
7│ 2 → 30 │ │ 2 C 3 │
8└────────────┘ └─────────────────────┘

Checkpoint

Series = 1 column với index. DataFrame = table (multiple Series). .shape = (rows, cols). .info() = types + non-null. .describe() = statistics!

3

📁 2. Reading & Writing Data

TB5 min

2.1 Reading Data

Python
1# CSV
2df = pd.read_csv('data.csv')
3df = pd.read_csv('data.csv', encoding='utf-8', parse_dates=['date_column'])
4df = pd.read_csv('data.csv', usecols=['col1', 'col2'], nrows=1000)
5
6# Excel
7df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
8
9# JSON
10df = pd.read_json('data.json', orient='records')
11
12# SQL
13from sqlalchemy import create_engine
14engine = create_engine('postgresql://user:pass@host:5432/db')
15df = pd.read_sql('SELECT * FROM table', engine)

2.2 Writing Data

Python
1# CSV
2df.to_csv('output.csv', index=False)
3df.to_csv('output.csv', encoding='utf-8-sig') # Excel compatibility
4
5# Excel (multiple sheets)
6with pd.ExcelWriter('output.xlsx') as writer:
7 df1.to_excel(writer, sheet_name='Sheet1', index=False)
8 df2.to_excel(writer, sheet_name='Sheet2', index=False)
9
10# JSON
11df.to_json('output.json', orient='records', indent=2)
12
13# SQL
14df.to_sql('table_name', engine, if_exists='replace', index=False)

Checkpoint

Read: pd.read_csv(), pd.read_excel(), pd.read_sql(). Write: .to_csv(index=False), .to_excel(). Dùng usecols + nrows cho large files!

4

🔍 3. Selecting & Filtering

TB5 min

3.1 Column Selection

Python
1# Single column → Series
2ages = df['age']
3
4# Multiple columns → DataFrame
5subset = df[['name', 'age']]

3.2 Row Selection: iloc vs loc

Python
1# iloc: integer position
2print(df.iloc[0]) # First row
3print(df.iloc[0:3]) # First 3 rows
4print(df.iloc[0, 1]) # Row 0, Column 1
5
6# loc: label-based
7print(df.loc[0, 'name']) # Row 0, column 'name'
8print(df.loc[0:2, 'name':'city']) # Rows 0-2, columns name→city
9
10# Mixed: condition + columns
11result = df.loc[df['age'] > 28, ['name', 'salary']]

iloc = Integer-based (position) — như array indexing. loc = Label-based — dùng tên column/row index. Cả hai đều hỗ trợ slicing và boolean indexing!

3.3 Filtering

Python
1# Single condition
2young = df[df['age'] < 30]
3
4# Multiple conditions (AND)
5result = df[(df['age'] > 25) & (df['city'] == 'Hanoi')]
6
7# Multiple conditions (OR)
8result = df[(df['city'] == 'Hanoi') | (df['city'] == 'HCMC')]
9
10# NOT
11result = df[~(df['city'] == 'Hanoi')]
12
13# isin()
14result = df[df['city'].isin(['Hanoi', 'HCMC'])]
15
16# String methods
17result = df[df['name'].str.contains('li')]
18
19# query() - SQL-like
20result = df.query('age > 25 and city == "Hanoi"')
21min_age = 25
22result = df.query('age > @min_age')

Checkpoint

iloc = position, loc = label. Filter: & (AND), | (OR), ~ (NOT). query() = SQL-like syntax. isin() cho multiple values!

5

✏️ 4. Modifying & Sorting

TB5 min

4.1 Adding/Modifying Columns

Python
1# Add column
2df['bonus'] = df['salary'] * 0.1
3
4# Conditional column
5df['category'] = np.where(df['age'] > 30, 'Senior', 'Junior')
6
7# Multiple conditions
8conditions = [
9 (df['age'] < 25),
10 (df['age'] >= 25) & (df['age'] < 35),
11 (df['age'] >= 35)
12]
13choices = ['Young', 'Middle', 'Senior']
14df['age_group'] = np.select(conditions, choices, default='Unknown')
15
16# Apply + Lambda
17df['name_length'] = df['name'].apply(len)
18df['salary_k'] = df['salary'].apply(lambda x: x / 1000)
19
20# assign() - chaining-friendly
21df = df.assign(
22 total_comp=lambda x: x['salary'] + x['bonus'],
23 tax=lambda x: x['salary'] * 0.2
24)

4.2 Removing & Renaming

Python
1# Drop columns
2df_clean = df.drop(columns=['bonus', 'category'])
3
4# Drop duplicates
5df_clean = df.drop_duplicates(subset=['name', 'city'])
6
7# Rename
8df_renamed = df.rename(columns={'name': 'full_name', 'age': 'years'})

4.3 Sorting

Python
1df_sorted = df.sort_values('age', ascending=False)
2df_sorted = df.sort_values(['city', 'age'], ascending=[True, False])
3
4# Top/Bottom N
5top_5 = df.nlargest(5, 'salary')
6bottom_5 = df.nsmallest(5, 'age')

Checkpoint

np.where() = simple if/else column. np.select() = multiple conditions. assign() = chaining-friendly. nlargest/nsmallest = quick top/bottom!

6

❓ 5. Missing Values

TB5 min

5.1 Detecting

Python
1df = pd.DataFrame({
2 'A': [1, 2, np.nan, 4],
3 'B': [5, np.nan, np.nan, 8],
4 'C': [9, 10, 11, 12]
5})
6
7print(df.isna().sum()) # Count per column
8print(df.isna().sum().sum()) # Total missing
9print(df.isna().mean() * 100) # Percentage

5.2 Handling

Python
1# Drop rows with any missing
2df_clean = df.dropna()
3df_clean = df.dropna(subset=['A', 'B']) # Specific columns
4
5# Fill with value
6df_filled = df.fillna(0)
7df_filled = df.fillna({'A': 0, 'B': 99})
8
9# Fill with statistics
10df_filled = df.fillna(df.mean())
11df_filled = df.fillna(df.median())
12
13# Forward/backward fill
14df_filled = df.fillna(method='ffill')
15df_filled = df.fillna(method='bfill')
16
17# Interpolation
18df_filled = df.interpolate(method='linear')

Checkpoint

isna().sum() = count missing. dropna() = remove. fillna() = replace (value, mean, median, ffill). Chọn strategy tùy context: drop nếu ít, fill nếu nhiều!

7

📈 6. Descriptive Statistics

TB5 min
Python
1# Overall
2print(df.describe())
3print(df.describe(include='all'))
4
5# Individual
6print(df['salary'].mean())
7print(df['salary'].median())
8print(df['salary'].std())
9print(df['salary'].quantile([0.25, 0.5, 0.75]))
10
11# Value counts
12print(df['city'].value_counts())
13print(df['city'].value_counts(normalize=True))
14
15# Unique
16print(df['city'].unique())
17print(df['city'].nunique())
18
19# Correlations
20print(df[['age', 'salary']].corr())
21
22# Custom aggregations
23print(df.agg({
24 'age': ['mean', 'min', 'max'],
25 'salary': ['sum', 'mean', 'std']
26}))

Checkpoint

describe() = auto statistics. value_counts() = frequency. corr() = correlation matrix. agg() = custom aggregations per column!

8

📋 Tổng kết

TB5 min

Kiến thức đã học

ConceptKey Methods
Createpd.DataFrame(), pd.read_csv(), pd.read_excel()
Selectdf['col'], df.iloc[], df.loc[], df.query()
FilterBoolean conditions, isin(), string methods
Modifyassign(), apply(), np.where(), np.select()
Cleandropna(), fillna(), drop_duplicates()
Sortsort_values(), nlargest(), nsmallest()
Statsdescribe(), value_counts(), agg(), corr()

Câu hỏi tự kiểm tra

  1. ilocloc khác nhau thế nào?
  2. query() có ưu điểm gì so với Boolean indexing?
  3. dropna() vs fillna(): khi nào dùng cái nào?
  4. describe() cho biết những thống kê gì?

Bài tiếp theo: Data Manipulation — Merge, Pivot, GroupBy →

🎉 Tuyệt vời! Bạn đã nắm vững Pandas fundamentals!

Nhớ: Pandas là công cụ #1 của Data Analyst trong Python. Thành thạo select, filter, clean là nền tảng cho mọi analysis!