Data Cleaning & Preprocessing

1. Tại sao Data Cleaning quan trọng?

"Garbage in, garbage out" - Dữ liệu bẩn → Model tệ!

Các vấn đề thường gặp:

Missing values - Giá trị thiếu
Duplicates - Dữ liệu trùng lặp
Outliers - Giá trị ngoại lai
Inconsistent data - Dữ liệu không nhất quán
Wrong data types - Kiểu dữ liệu sai

Python

1import pandas as pd
2import numpy as np
3
4# Sample dirty data
5df = pd.DataFrame({
6    'name': ['Alice', 'BOB', 'alice', None, 'Charlie'],
7    'age': [25, 300, 30, 35, -5],
8    'salary': [50000, 60000, np.nan, 70000, 80000],
9    'email': ['alice@gmail.com', 'bob@', 'ALICE@GMAIL.COM', 'charlie@yahoo.com', '']
10})

2. Missing Values

2.1 Phát hiện Missing Values

Python

1# Kiểm tra null
2df.isnull()              # Boolean DataFrame
3df.isnull().sum()        # Đếm null mỗi cột
4df.isnull().sum().sum()  # Tổng null
5
6# Tỷ lệ missing
7missing_pct = df.isnull().sum() / len(df) * 100
8print(missing_pct)
9
10# Visualize missing
11import missingno as msno
12msno.matrix(df)
13msno.heatmap(df)

2.2 Xử lý Missing Values

Python

1# 1. Drop rows với missing
2df.dropna()                    # Drop nếu có bất kỳ NaN
3df.dropna(subset=['name'])     # Drop nếu 'name' là NaN
4df.dropna(thresh=3)            # Giữ dòng có ≥3 non-null
5
6# 2. Drop columns với quá nhiều missing
7df.dropna(axis=1, thresh=len(df)*0.5)  # Drop cột có >50% missing
8
9# 3. Fill missing - Numeric
10df['age'].fillna(df['age'].mean())     # Mean
11df['age'].fillna(df['age'].median())   # Median
12df['age'].fillna(df['age'].mode()[0])  # Mode
13df['age'].fillna(0)                     # Constant
14
15# 4. Fill missing - Categorical
16df['category'].fillna('Unknown')
17df['category'].fillna(df['category'].mode()[0])
18
19# 5. Forward/Backward Fill (Time series)
20df['value'].fillna(method='ffill')  # Forward fill
21df['value'].fillna(method='bfill')  # Backward fill
22
23# 6. Interpolation
24df['value'].interpolate(method='linear')
25df['value'].interpolate(method='polynomial', order=2)
26
27# 7. Group-wise fill
28df['salary'] = df.groupby('department')['salary'].transform(
29    lambda x: x.fillna(x.median())
30)

2.3 Advanced: KNN Imputation

Python

1from sklearn.impute import KNNImputer
2
3imputer = KNNImputer(n_neighbors=5)
4df_numeric = df.select_dtypes(include=[np.number])
5df_imputed = pd.DataFrame(
6    imputer.fit_transform(df_numeric),
7    columns=df_numeric.columns
8)

3. Duplicate Data

3.1 Phát hiện Duplicates

Python

1# Kiểm tra duplicates
2df.duplicated()                        # Boolean Series
3df.duplicated().sum()                  # Đếm duplicates
4
5# Duplicates theo specific columns
6df.duplicated(subset=['name', 'email'])
7
8# Xem các dòng duplicate
9df[df.duplicated(keep=False)]          # Tất cả duplicates
10df[df.duplicated(keep='first')]        # Trừ dòng đầu

3.2 Xử lý Duplicates

Python

1# Remove duplicates
2df.drop_duplicates()                            # Giữ dòng đầu
3df.drop_duplicates(keep='last')                 # Giữ dòng cuối
4df.drop_duplicates(subset=['email'])            # Theo cột cụ thể
5df.drop_duplicates(subset=['name'], keep=False) # Remove all

4. Outliers

4.1 Phát hiện Outliers

Python

1import matplotlib.pyplot as plt
2import seaborn as sns
3
4# Visualize với Boxplot
5plt.figure(figsize=(10, 6))
6sns.boxplot(data=df[['age', 'salary', 'income']])
7plt.show()
8
9# Z-Score method
10from scipy import stats
11
12z_scores = np.abs(stats.zscore(df['salary'].dropna()))
13outliers_z = df[z_scores > 3]  # Z-score > 3
14
15# IQR method
16Q1 = df['salary'].quantile(0.25)
17Q3 = df['salary'].quantile(0.75)
18IQR = Q3 - Q1
19
20lower_bound = Q1 - 1.5 * IQR
21upper_bound = Q3 + 1.5 * IQR
22
23outliers_iqr = df[(df['salary'] < lower_bound) | (df['salary'] > upper_bound)]
24print(f"Outliers: {len(outliers_iqr)}")

4.2 Xử lý Outliers

Python

1# 1. Remove outliers
2df_clean = df[(df['salary'] >= lower_bound) & (df['salary'] <= upper_bound)]
3
4# 2. Cap outliers (Winsorization)
5df['salary_capped'] = df['salary'].clip(lower=lower_bound, upper=upper_bound)
6
7# 3. Log transform (giảm skewness)
8df['salary_log'] = np.log1p(df['salary'])
9
10# 4. Robust scaling
11from sklearn.preprocessing import RobustScaler
12
13scaler = RobustScaler()
14df['salary_scaled'] = scaler.fit_transform(df[['salary']])

5. Data Type Conversion

Python

1# Kiểm tra data types
2df.dtypes
3
4# Convert to numeric
5df['price'] = pd.to_numeric(df['price'], errors='coerce')  # NaN if invalid
6
7# Convert to datetime
8df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
9df['date'] = pd.to_datetime(df['date'], errors='coerce')
10
11# Convert to category (memory efficient)
12df['status'] = df['status'].astype('category')
13
14# Convert to string
15df['id'] = df['id'].astype(str)
16
17# Convert boolean
18df['is_active'] = df['is_active'].map({'yes': True, 'no': False})

6. String Cleaning

Python

1# Strip whitespace
2df['name'] = df['name'].str.strip()
3
4# Standardize case
5df['name'] = df['name'].str.lower()
6df['name'] = df['name'].str.title()  # Title Case
7
8# Remove special characters
9df['phone'] = df['phone'].str.replace(r'[^0-9]', '', regex=True)
10
11# Fix inconsistent values
12df['country'] = df['country'].replace({
13    'vn': 'Vietnam',
14    'VN': 'Vietnam',
15    'vietnam': 'Vietnam',
16    'Viet Nam': 'Vietnam'
17})
18
19# Extract information
20df['domain'] = df['email'].str.extract(r'@(\w+\.\w+)')
21df['area_code'] = df['phone'].str[:3]

7. Feature Encoding

7.1 Label Encoding

Python

1from sklearn.preprocessing import LabelEncoder
2
3le = LabelEncoder()
4df['status_encoded'] = le.fit_transform(df['status'])
5
6# Mapping
7mapping = dict(zip(le.classes_, le.transform(le.classes_)))
8print(mapping)  # {'active': 0, 'inactive': 1}

7.2 One-Hot Encoding

Python

1# Pandas get_dummies
2df_encoded = pd.get_dummies(df, columns=['city', 'status'], drop_first=True)
3
4# Sklearn OneHotEncoder
5from sklearn.preprocessing import OneHotEncoder
6
7ohe = OneHotEncoder(sparse=False, drop='first')
8encoded = ohe.fit_transform(df[['city', 'status']])

7.3 Ordinal Encoding

Python

1from sklearn.preprocessing import OrdinalEncoder
2
3# Với thứ tự cụ thể
4education_order = ['High School', 'Bachelor', 'Master', 'PhD']
5oe = OrdinalEncoder(categories=[education_order])
6df['education_encoded'] = oe.fit_transform(df[['education']])

8. Feature Scaling

8.1 StandardScaler (Z-score)

Python

1from sklearn.preprocessing import StandardScaler
2
3scaler = StandardScaler()
4df[['age_scaled', 'salary_scaled']] = scaler.fit_transform(df[['age', 'salary']])
5
6# Mean = 0, Std = 1

8.2 MinMaxScaler

Python

1from sklearn.preprocessing import MinMaxScaler
2
3scaler = MinMaxScaler(feature_range=(0, 1))
4df[['age_scaled', 'salary_scaled']] = scaler.fit_transform(df[['age', 'salary']])
5
6# Range [0, 1]

8.3 RobustScaler (cho data có outliers)

Python

1from sklearn.preprocessing import RobustScaler
2
3scaler = RobustScaler()
4df['salary_scaled'] = scaler.fit_transform(df[['salary']])
5
6# Robust với outliers (dùng median và IQR)

9. Complete Pipeline

Python

1import pandas as pd
2import numpy as np
3from sklearn.preprocessing import StandardScaler, LabelEncoder
4
5def clean_data(df):
6    """Pipeline làm sạch dữ liệu hoàn chỉnh"""
7    
8    df = df.copy()
9    
10    # 1. Remove duplicates
11    df = df.drop_duplicates()
12    print(f"After removing duplicates: {len(df)} rows")
13    
14    # 2. Handle missing values
15    # Numeric: fill with median
16    numeric_cols = df.select_dtypes(include=[np.number]).columns
17    for col in numeric_cols:
18        df[col] = df[col].fillna(df[col].median())
19    
20    # Categorical: fill with mode
21    cat_cols = df.select_dtypes(include=['object']).columns
22    for col in cat_cols:
23        df[col] = df[col].fillna(df[col].mode()[0])
24    
25    # 3. Handle outliers (IQR method)
26    for col in numeric_cols:
27        Q1 = df[col].quantile(0.25)
28        Q3 = df[col].quantile(0.75)
29        IQR = Q3 - Q1
30        df[col] = df[col].clip(Q1 - 1.5*IQR, Q3 + 1.5*IQR)
31    
32    # 4. Clean strings
33    for col in cat_cols:
34        df[col] = df[col].str.strip().str.lower()
35    
36    # 5. Encode categoricals
37    le = LabelEncoder()
38    for col in cat_cols:
39        df[f'{col}_encoded'] = le.fit_transform(df[col])
40    
41    # 6. Scale numerics
42    scaler = StandardScaler()
43    df[numeric_cols] = scaler.fit_transform(df[numeric_cols])
44    
45    return df
46
47# Sử dụng
48df_clean = clean_data(df)

10. Data Quality Report

Python

1def data_quality_report(df):
2    """Tạo báo cáo chất lượng dữ liệu"""
3    
4    report = pd.DataFrame({
5        'Column': df.columns,
6        'Data Type': df.dtypes.values,
7        'Non-Null Count': df.count().values,
8        'Null Count': df.isnull().sum().values,
9        'Null %': (df.isnull().sum() / len(df) * 100).round(2).values,
10        'Unique Values': df.nunique().values,
11        'Sample Values': [df[col].dropna().head(3).tolist() for col in df.columns]
12    })
13    
14    return report
15
16# Sử dụng
17report = data_quality_report(df)
18print(report.to_string())

Tổng Kết

Trong bài này, bạn đã học:

✅ Phát hiện và xử lý Missing Values
✅ Xử lý Duplicate data
✅ Phát hiện và xử lý Outliers
✅ Data type conversion
✅ String cleaning
✅ Feature Encoding (Label, One-Hot, Ordinal)
✅ Feature Scaling (Standard, MinMax, Robust)
✅ Building data cleaning pipeline

Bài tiếp theo: Data Visualization với Seaborn!