Data Cleaning — Xử Lý Dữ Liệu Bẩn

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Phát hiện và xử lý Missing Values bằng nhiều phương pháp

✅ Tìm và loại bỏ Duplicate data

✅ Phát hiện và xử lý Outliers (Z-score, IQR)

✅ Chuyển đổi Data Types đúng cách

✅ Làm sạch String data (chuẩn hóa, regex)

✅ Xây dựng Data Cleaning Pipeline hoàn chỉnh

Thời gian: 3 giờ | Độ khó: Intermediate | Yêu cầu: Pandas (Bài 6-7), Visualization (Bài 8-9)

Task 0

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Missing Value	Giá trị thiếu	NaN, null — dữ liệu không có giá trị
Duplicate	Trùng lặp	Dòng/bản ghi bị lặp trong dataset
Outlier	Giá trị ngoại lai	Giá trị bất thường, xa so với phần còn lại
IQR	Khoảng tứ phân vị	Q3 - Q1, dùng phát hiện outliers
Z-score	Điểm chuẩn	Số lần độ lệch chuẩn so với mean
Imputation	Điền giá trị	Thay thế missing bằng giá trị ước tính
Data Type	Kiểu dữ liệu	int, float, string, datetime, category
Regex	Biểu thức chính quy	Pattern matching cho text processing
Pipeline	Đường ống xử lý	Chuỗi các bước xử lý tuần tự
Data Quality	Chất lượng dữ liệu	Mức độ chính xác, đầy đủ, nhất quán

Checkpoint

"Garbage in, garbage out" — Data Cleaning chiếm 60-80% thời gian trong dự án Data Science thực tế. Bạn đã sẵn sàng chưa?

Task 1

🔍 Data Overview — Bước đầu tiên

TB5 min

Data Cleaning là gì? Là quá trình làm sạch dữ liệu trước khi phân tích — xử lý giá trị thiếu, dữ liệu trùng, outlier, và chuẩn hóa định dạng. Trong thực tế, 60-80% thời gian của Data Scientist dành cho Data Cleaning!

Nguyên tắc vàng: Garbage In = Garbage Out. Dữ liệu bẩn sẽ cho kết quả sai — dù model có tốt đến đâu. Luôn clean data trước khi làm bất cứ phân tích nào.

Kiểm tra tổng quan dữ liệu

Python

1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4import seaborn as sns
5
6# Load data
7df = pd.read_csv("data.csv")
8
9# 5 dòng lệnh đầu tiên (BẮT BUỘC)
10print(f"Shape: {df.shape}")
11print(df.head())
12print(df.info())
13print(df.describe())
14print(df.dtypes)

Data Quality Report

Python

1def data_quality_report(df):
2    """Tạo báo cáo chất lượng dữ liệu"""
3    report = pd.DataFrame({
4        'dtype': df.dtypes,
5        'non_null': df.count(),
6        'null_count': df.isnull().sum(),
7        'null_pct': (df.isnull().sum() / len(df) * 100).round(2),
8        'nunique': df.nunique(),
9        'duplicates': df.duplicated().sum()
10    })
11    
12    # Thêm thống kê cho numeric columns
13    for col in df.select_dtypes(include=[np.number]).columns:
14        report.loc[col, 'min'] = df[col].min()
15        report.loc[col, 'max'] = df[col].max()
16        report.loc[col, 'mean'] = df[col].mean()
17    
18    return report.sort_values('null_pct', ascending=False)
19
20report = data_quality_report(df)
21print(report)

Luôn chạy Data Quality Report trước khi làm bất cứ gì. Nó cho bạn biết: cột nào có missing, kiểu dữ liệu nào sai, có outlier không.

Checkpoint

Luôn nhớ 5 dòng đầu tiên khi nhận data mới: shape → head → info → describe → dtypes.

Task 2

❓ Missing Values

TB5 min

Phát hiện Missing Values

Python

1# Kiểm tra null
2df.isnull().sum()          # Đếm null mỗi cột
3df.isnull().sum().sum()    # Tổng null cả DataFrame
4
5# Tỷ lệ missing
6missing_pct = (df.isnull().sum() / len(df) * 100).round(2)
7print(missing_pct[missing_pct > 0].sort_values(ascending=False))
8
9# Visualize missing pattern
10import missingno as msno
11msno.matrix(df)
12plt.show()

Xử lý Missing — Numeric

Python

1# 1. Drop (chỉ khi missing ít < 5%)
2df.dropna(subset=['important_col'])
3
4# 2. Fill với thống kê
5df['age'].fillna(df['age'].mean())     # Mean — phổ biến
6df['age'].fillna(df['age'].median())   # Median — robust hơn (có outliers)
7df['age'].fillna(df['age'].mode()[0])  # Mode — cho numeric ít giá trị
8
9# 3. Fill theo group
10df['salary'] = df.groupby('department')['salary'].transform(
11    lambda x: x.fillna(x.median())
12)
13
14# 4. Forward/Backward fill (time series)
15df['value'].fillna(method='ffill')  # Forward fill
16df['value'].fillna(method='bfill')  # Backward fill
17
18# 5. Interpolation (time series)
19df['value'].interpolate(method='linear')

Xử lý Missing — Categorical

Python

1df['category'].fillna('Unknown')
2df['category'].fillna(df['category'].mode()[0])

Advanced: KNN Imputation

Python

1from sklearn.impute import KNNImputer
2
3imputer = KNNImputer(n_neighbors=5)
4df_numeric = df.select_dtypes(include=[np.number])
5df_imputed = pd.DataFrame(
6    imputer.fit_transform(df_numeric),
7    columns=df_numeric.columns
8)

Quy tắc xử lý Missing:

< 5% missing: Drop hoặc fill mean/median
5-30% missing: Fill với group-wise median hoặc KNN
> 30% missing: Xem xét drop cột, hoặc tạo feature is_missing
Không bao giờ fill missing trước khi hiểu tại sao nó missing!

Checkpoint

Khi nào dùng mean vs median để fill missing? (Gợi ý: data có outliers thì sao?)

Task 3

👯 Duplicate Data

TB5 min

Phát hiện

Python

1# Toàn bộ dòng trùng
2df.duplicated().sum()
3
4# Trùng theo cột cụ thể
5df.duplicated(subset=['name', 'email']).sum()
6
7# Xem các dòng trùng
8df[df.duplicated(keep=False)]          # Tất cả (cả bản gốc)
9df[df.duplicated(keep='first')]        # Trừ dòng đầu tiên

Xử lý

Python

1# Remove duplicates
2df = df.drop_duplicates()                              # Giữ dòng đầu
3df = df.drop_duplicates(keep='last')                   # Giữ dòng cuối
4df = df.drop_duplicates(subset=['email'])               # Theo cột
5df = df.drop_duplicates(subset=['name'], keep=False)    # Remove tất cả
6
7print(f"After dedup: {df.shape}")

Duplicate không phải lúc nào cũng xấu! Ví dụ: 1 khách mua hàng 3 lần → 3 dòng, KHÔNG phải duplicate. Luôn hiểu business context trước khi drop.

Task 4

📈 Outliers

TB5 min

Phát hiện Outliers

Python

1# 1. Visualization
2fig, axes = plt.subplots(1, 2, figsize=(12, 5))
3sns.boxplot(data=df, y='salary', ax=axes[0])
4axes[0].set_title("Box Plot")
5sns.histplot(data=df, x='salary', kde=True, ax=axes[1])
6axes[1].set_title("Distribution")
7plt.tight_layout()
8plt.show()
9
10# 2. IQR Method (phổ biến nhất)
11Q1 = df['salary'].quantile(0.25)
12Q3 = df['salary'].quantile(0.75)
13IQR = Q3 - Q1
14lower = Q1 - 1.5 * IQR
15upper = Q3 + 1.5 * IQR
16
17outliers = df[(df['salary'] < lower) | (df['salary'] > upper)]
18print(f"Outliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")
19
20# 3. Z-Score Method
21from scipy import stats
22z_scores = np.abs(stats.zscore(df['salary'].dropna()))
23outliers_z = df[z_scores > 3]

Xử lý Outliers

Python

1# 1. Remove
2df_clean = df[(df['salary'] >= lower) & (df['salary'] <= upper)]
3
4# 2. Cap (Winsorization) — KHUYẾN KHÍCH
5df['salary_capped'] = df['salary'].clip(lower=lower, upper=upper)
6
7# 3. Log transform (giảm skewness)
8df['salary_log'] = np.log1p(df['salary'])

ĐỪNG xóa outliers một cách mù quáng!

Outlier có thể là lỗi dữ liệu (age = -5, salary = 0) → nên xóa/fix
Outlier có thể là giá trị thực (CEO lương 10 tỷ) → nên giữ hoặc cap
Luôn điều tra outlier trước khi quyết định

Checkpoint

IQR = Q3 - Q1. Outlier là giá trị nằm ngoài [Q1 - 1.5×IQR, Q3 + 1.5×IQR]. Bạn đã tính được cho 1 cột chưa?

Task 5

🔄 Data Type Conversion

TB5 min

Python

1# Numeric
2df['price'] = pd.to_numeric(df['price'], errors='coerce')  # NaN nếu invalid
3
4# Datetime
5df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
6df['date'] = pd.to_datetime(df['date'], errors='coerce')
7
8# Extract datetime components
9df['year'] = df['date'].dt.year
10df['month'] = df['date'].dt.month
11df['dayofweek'] = df['date'].dt.dayofweek
12
13# Category (tiết kiệm memory)
14df['status'] = df['status'].astype('category')
15
16# Boolean
17df['is_active'] = df['is_active'].map({'yes': True, 'no': False})

Dùng category dtype cho cột ít unique values (city, status, color...). Giảm RAM 50-90%.

Python

1# Trước: object → 8 bytes/phần tử
2# Sau:  category → chỉ lưu integer codes
3df['city'] = df['city'].astype('category')
4print(df.memory_usage(deep=True))

Task 6

✂️ String Cleaning

TB5 min

Python

1# Strip whitespace
2df['name'] = df['name'].str.strip()
3
4# Standardize case
5df['name'] = df['name'].str.title()      # Title Case
6df['email'] = df['email'].str.lower()
7
8# Remove special characters
9df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
10
11# Fix inconsistent values
12df['country'] = df['country'].replace({
13    'vn': 'Vietnam', 'VN': 'Vietnam',
14    'vietnam': 'Vietnam', 'Viet Nam': 'Vietnam'
15})
16
17# Extract information với Regex
18df['domain'] = df['email'].str.extract(r'@([\w.]+)')
19df['area_code'] = df['phone'].str[:3]

Checkpoint

Cho cột 'name' có giá trị: " alice ", "BOB", " Charlie ". Làm thế nào để chuẩn hóa thành "Alice", "Bob", "Charlie"?

Task 7

🔧 Complete Cleaning Pipeline

TB5 min

Python

1def clean_data_pipeline(df, config=None):
2    """Pipeline làm sạch dữ liệu hoàn chỉnh"""
3    
4    df = df.copy()
5    print(f"Original: {df.shape}")
6    
7    # Step 1: Remove duplicates
8    before = len(df)
9    df = df.drop_duplicates()
10    print(f"Step 1 - Dedup: removed {before - len(df)} rows")
11    
12    # Step 2: Handle missing values
13    numeric_cols = df.select_dtypes(include=[np.number]).columns
14    cat_cols = df.select_dtypes(include=['object']).columns
15    
16    for col in numeric_cols:
17        null_pct = df[col].isnull().mean()
18        if null_pct > 0.3:
19            df[f'{col}_is_missing'] = df[col].isnull().astype(int)
20        df[col] = df[col].fillna(df[col].median())
21    
22    for col in cat_cols:
23        df[col] = df[col].fillna('Unknown')
24    
25    print(f"Step 2 - Missing: {df.isnull().sum().sum()} remaining nulls")
26    
27    # Step 3: Handle outliers (IQR capping)
28    for col in numeric_cols:
29        Q1 = df[col].quantile(0.25)
30        Q3 = df[col].quantile(0.75)
31        IQR = Q3 - Q1
32        df[col] = df[col].clip(Q1 - 1.5*IQR, Q3 + 1.5*IQR)
33    
34    print(f"Step 3 - Outliers: capped")
35    
36    # Step 4: Clean strings
37    for col in cat_cols:
38        df[col] = df[col].str.strip().str.lower()
39    
40    print(f"Step 4 - Strings: cleaned")
41    print(f"Final: {df.shape}")
42    
43    return df
44
45# Sử dụng
46df_clean = clean_data_pipeline(df)

Thứ tự cleaning quan trọng:

Duplicates trước (giảm data cần xử lý)
Data types (để các bước sau hoạt động đúng)
Missing values (trước outliers, vì NaN ảnh hưởng IQR)
Outliers (sau missing vì cần dữ liệu đầy đủ)
String cleaning (cuối cùng)

Checkpoint

Bạn đã build được cleaning pipeline với ≥ 4 steps chưa? Pipeline giúp tái sử dụng và đảm bảo consistency.

Task 8

📝 Tổng Kết

TB5 min

Tóm tắt kiến thức

Vấn đề	Phương pháp chính
Missing Values	`fillna(median)`, `dropna()`, `KNNImputer`
Duplicates	`drop_duplicates(subset=...)`
Outliers	IQR method + `.clip()`, Z-score > 3
Data Types	`to_numeric()`, `to_datetime()`, `astype('category')`
Strings	`str.strip()`, `str.lower()`, `str.replace(regex)`
Pipeline	Function chain: dedup → types → missing → outliers → strings

Quick Reference

Python

1# Missing
2df.isnull().sum()
3df['col'].fillna(df['col'].median())
4
5# Duplicates
6df.drop_duplicates(subset=['key_col'])
7
8# Outliers (IQR)
9Q1, Q3 = df['col'].quantile([0.25, 0.75])
10IQR = Q3 - Q1
11df['col'] = df['col'].clip(Q1 - 1.5*IQR, Q3 + 1.5*IQR)
12
13# Types
14df['col'] = pd.to_numeric(df['col'], errors='coerce')
15df['date'] = pd.to_datetime(df['date'])
16
17# Strings
18df['col'] = df['col'].str.strip().str.lower()

Bài tiếp theo: Feature Engineering — Tạo features mạnh cho Machine Learning! ⚙️

Câu hỏi tự kiểm tra

Khi gặp missing values, khi nào nên dùng fillna() và khi nào nên dùng dropna()?
Phương pháp IQR phát hiện outlier hoạt động như thế nào? Công thức tính ngưỡng trên và dưới là gì?
Tại sao thứ tự các bước trong cleaning pipeline lại quan trọng? Bước nào nên làm trước?
pd.to_numeric(errors='coerce') xử lý giá trị không hợp lệ như thế nào?

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Data Cleaning!

Tiếp theo: Feature Engineering — học cách tạo features mạnh để chuẩn bị dữ liệu cho Machine Learning!

Task 9

Data Cleaning — Xử Lý Dữ Liệu Bẩn

🎯 Mục tiêu bài học

📖 Bảng Thuật Ngữ Quan Trọng

Checkpoint

🔍 Data Overview — Bước đầu tiên

Kiểm tra tổng quan dữ liệu

Data Quality Report

Checkpoint

❓ Missing Values

Phát hiện Missing Values

Xử lý Missing — Numeric

Xử lý Missing — Categorical

Advanced: KNN Imputation

Checkpoint

👯 Duplicate Data

Phát hiện

Xử lý

📈 Outliers

Phát hiện Outliers

Xử lý Outliers

Checkpoint

🔄 Data Type Conversion

✂️ String Cleaning

Checkpoint

🔧 Complete Cleaning Pipeline

Checkpoint

📝 Tổng Kết

Tóm tắt kiến thức

Quick Reference

Câu hỏi tự kiểm tra

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu