🎯 Mục tiêu bài học
Sau bài học này, bạn sẽ:
✅ Hiểu Series và DataFrame — core data structures
✅ Read/Write data từ CSV, Excel, JSON, SQL
✅ Select, filter, sort data thành thạo
✅ Handle missing values chuyên nghiệp
✅ Tính descriptive statistics
Thời gian: 2 giờ | Độ khó: Beginner-Intermediate | Tool: Python + Pandas
📖 Bảng Thuật Ngữ Quan Trọng
| Thuật ngữ | Tiếng Việt | Mô tả |
|---|---|---|
| Series | Chuỗi dữ liệu | 1D labeled array |
| DataFrame | Bảng dữ liệu | 2D labeled table (rows + columns) |
| Index | Chỉ mục | Row labels (default: 0, 1, 2...) |
| iloc | - | Integer-based indexing (position) |
| loc | - | Label-based indexing (name) |
| NaN | Not a Number | Missing value marker |
| dtype | Data type | Kiểu dữ liệu của column |
| Boolean Indexing | Lọc điều kiện | Filter rows by True/False mask |
| query() | Truy vấn | SQL-like filter syntax |
| describe() | Mô tả | Auto-generate statistics |
Checkpoint
Series = 1D (like a column). DataFrame = 2D table. iloc = position-based. loc = label-based. NaN = missing value.
📊 1. Series & DataFrame
1.1 Series
1import pandas as pd2import numpy as np34# From list5s1 = pd.Series([10, 20, 30, 40])67# With custom index8s2 = pd.Series([100, 200, 300], index=['a', 'b', 'c'])910# From dictionary11s3 = pd.Series({'apple': 50, 'banana': 30, 'orange': 45})1213# Operations14s = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'])15print(s['a']) # 1016print(s[s > 25]) # c=30, d=40, e=5017print(s.sum()) # 15018print(s.mean()) # 301.2 DataFrame
1# From dictionary2data = {3 'name': ['Alice', 'Bob', 'Charlie', 'Diana'],4 'age': [25, 30, 35, 28],5 'city': ['Hanoi', 'HCMC', 'Danang', 'Hanoi'],6 'salary': [50000, 60000, 70000, 55000]7}8df = pd.DataFrame(data)910# Attributes11print(df.shape) # (4, 4)12print(df.columns) # Index(['name', 'age', 'city', 'salary'])13print(df.dtypes) # Data types14print(df.info()) # Overview15print(df.describe()) # Statistics1┌────────────┐ ┌─────────────────────┐2│ Series │ │ DataFrame │3│ (1D array)│ │ (2D table) │4│ Index │ │ Index Col1 Col2 │5│ 0 → 10 │ │ 0 A 1 │6│ 1 → 20 │ │ 1 B 2 │7│ 2 → 30 │ │ 2 C 3 │8└────────────┘ └─────────────────────┘Checkpoint
Series = 1 column với index. DataFrame = table (multiple Series). .shape = (rows, cols). .info() = types + non-null. .describe() = statistics!
📁 2. Reading & Writing Data
2.1 Reading Data
1# CSV2df = pd.read_csv('data.csv')3df = pd.read_csv('data.csv', encoding='utf-8', parse_dates=['date_column'])4df = pd.read_csv('data.csv', usecols=['col1', 'col2'], nrows=1000)56# Excel7df = pd.read_excel('data.xlsx', sheet_name='Sheet1')89# JSON10df = pd.read_json('data.json', orient='records')1112# SQL13from sqlalchemy import create_engine14engine = create_engine('postgresql://user:pass@host:5432/db')15df = pd.read_sql('SELECT * FROM table', engine)2.2 Writing Data
1# CSV2df.to_csv('output.csv', index=False)3df.to_csv('output.csv', encoding='utf-8-sig') # Excel compatibility45# Excel (multiple sheets)6with pd.ExcelWriter('output.xlsx') as writer:7 df1.to_excel(writer, sheet_name='Sheet1', index=False)8 df2.to_excel(writer, sheet_name='Sheet2', index=False)910# JSON11df.to_json('output.json', orient='records', indent=2)1213# SQL14df.to_sql('table_name', engine, if_exists='replace', index=False)Checkpoint
Read: pd.read_csv(), pd.read_excel(), pd.read_sql(). Write: .to_csv(index=False), .to_excel(). Dùng usecols + nrows cho large files!
🔍 3. Selecting & Filtering
3.1 Column Selection
1# Single column → Series2ages = df['age']34# Multiple columns → DataFrame5subset = df[['name', 'age']]3.2 Row Selection: iloc vs loc
1# iloc: integer position2print(df.iloc[0]) # First row3print(df.iloc[0:3]) # First 3 rows4print(df.iloc[0, 1]) # Row 0, Column 156# loc: label-based7print(df.loc[0, 'name']) # Row 0, column 'name'8print(df.loc[0:2, 'name':'city']) # Rows 0-2, columns name→city910# Mixed: condition + columns11result = df.loc[df['age'] > 28, ['name', 'salary']]iloc = Integer-based (position) — như array indexing. loc = Label-based — dùng tên column/row index. Cả hai đều hỗ trợ slicing và boolean indexing!
3.3 Filtering
1# Single condition2young = df[df['age'] < 30]34# Multiple conditions (AND)5result = df[(df['age'] > 25) & (df['city'] == 'Hanoi')]67# Multiple conditions (OR)8result = df[(df['city'] == 'Hanoi') | (df['city'] == 'HCMC')]910# NOT11result = df[~(df['city'] == 'Hanoi')]1213# isin()14result = df[df['city'].isin(['Hanoi', 'HCMC'])]1516# String methods17result = df[df['name'].str.contains('li')]1819# query() - SQL-like20result = df.query('age > 25 and city == "Hanoi"')21min_age = 2522result = df.query('age > @min_age')Checkpoint
iloc = position, loc = label. Filter: & (AND), | (OR), ~ (NOT). query() = SQL-like syntax. isin() cho multiple values!
✏️ 4. Modifying & Sorting
4.1 Adding/Modifying Columns
1# Add column2df['bonus'] = df['salary'] * 0.134# Conditional column5df['category'] = np.where(df['age'] > 30, 'Senior', 'Junior')67# Multiple conditions8conditions = [9 (df['age'] < 25),10 (df['age'] >= 25) & (df['age'] < 35),11 (df['age'] >= 35)12]13choices = ['Young', 'Middle', 'Senior']14df['age_group'] = np.select(conditions, choices, default='Unknown')1516# Apply + Lambda17df['name_length'] = df['name'].apply(len)18df['salary_k'] = df['salary'].apply(lambda x: x / 1000)1920# assign() - chaining-friendly21df = df.assign(22 total_comp=lambda x: x['salary'] + x['bonus'],23 tax=lambda x: x['salary'] * 0.224)4.2 Removing & Renaming
1# Drop columns2df_clean = df.drop(columns=['bonus', 'category'])34# Drop duplicates5df_clean = df.drop_duplicates(subset=['name', 'city'])67# Rename8df_renamed = df.rename(columns={'name': 'full_name', 'age': 'years'})4.3 Sorting
1df_sorted = df.sort_values('age', ascending=False)2df_sorted = df.sort_values(['city', 'age'], ascending=[True, False])34# Top/Bottom N5top_5 = df.nlargest(5, 'salary')6bottom_5 = df.nsmallest(5, 'age')Checkpoint
np.where() = simple if/else column. np.select() = multiple conditions. assign() = chaining-friendly. nlargest/nsmallest = quick top/bottom!
❓ 5. Missing Values
5.1 Detecting
1df = pd.DataFrame({2 'A': [1, 2, np.nan, 4],3 'B': [5, np.nan, np.nan, 8],4 'C': [9, 10, 11, 12]5})67print(df.isna().sum()) # Count per column8print(df.isna().sum().sum()) # Total missing9print(df.isna().mean() * 100) # Percentage5.2 Handling
1# Drop rows with any missing2df_clean = df.dropna()3df_clean = df.dropna(subset=['A', 'B']) # Specific columns45# Fill with value6df_filled = df.fillna(0)7df_filled = df.fillna({'A': 0, 'B': 99})89# Fill with statistics10df_filled = df.fillna(df.mean())11df_filled = df.fillna(df.median())1213# Forward/backward fill14df_filled = df.fillna(method='ffill')15df_filled = df.fillna(method='bfill')1617# Interpolation18df_filled = df.interpolate(method='linear')Checkpoint
isna().sum() = count missing. dropna() = remove. fillna() = replace (value, mean, median, ffill). Chọn strategy tùy context: drop nếu ít, fill nếu nhiều!
📈 6. Descriptive Statistics
1# Overall2print(df.describe())3print(df.describe(include='all'))45# Individual6print(df['salary'].mean())7print(df['salary'].median())8print(df['salary'].std())9print(df['salary'].quantile([0.25, 0.5, 0.75]))1011# Value counts12print(df['city'].value_counts())13print(df['city'].value_counts(normalize=True))1415# Unique16print(df['city'].unique())17print(df['city'].nunique())1819# Correlations20print(df[['age', 'salary']].corr())2122# Custom aggregations23print(df.agg({24 'age': ['mean', 'min', 'max'],25 'salary': ['sum', 'mean', 'std']26}))Checkpoint
describe() = auto statistics. value_counts() = frequency. corr() = correlation matrix. agg() = custom aggregations per column!
📋 Tổng kết
Kiến thức đã học
| Concept | Key Methods |
|---|---|
| Create | pd.DataFrame(), pd.read_csv(), pd.read_excel() |
| Select | df['col'], df.iloc[], df.loc[], df.query() |
| Filter | Boolean conditions, isin(), string methods |
| Modify | assign(), apply(), np.where(), np.select() |
| Clean | dropna(), fillna(), drop_duplicates() |
| Sort | sort_values(), nlargest(), nsmallest() |
| Stats | describe(), value_counts(), agg(), corr() |
Câu hỏi tự kiểm tra
ilocvàlockhác nhau thế nào?query()có ưu điểm gì so với Boolean indexing?dropna()vsfillna(): khi nào dùng cái nào?describe()cho biết những thống kê gì?
Bài tiếp theo: Data Manipulation — Merge, Pivot, GroupBy →
🎉 Tuyệt vời! Bạn đã nắm vững Pandas fundamentals!
Nhớ: Pandas là công cụ #1 của Data Analyst trong Python. Thành thạo select, filter, clean là nền tảng cho mọi analysis!
