Xử lý Dữ liệu với Pandas

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, học viên sẽ:

✅ Thao tác dữ liệu với Pandas DataFrame

✅ Xử lý Missing Values và Outliers hiệu quả

✅ Thực hiện Feature Engineering và Selection

✅ Encode categorical variables đúng cách

✅ Xây dựng Data Pipeline hoàn chỉnh

✅ Thực hiện EDA (Exploratory Data Analysis) chuyên nghiệp

Thời gian: 4-5 giờ | Độ khó: Intermediate

Task 0

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min

Thuật ngữ	Tiếng Việt	Giải thích đơn giản
DataFrame	Khung dữ liệu	Bảng dữ liệu 2 chiều trong Pandas (hàng x cột)
Missing Values	Giá trị thiếu	Dữ liệu bị trống (NaN, null) cần xử lý
Outlier	Điểm ngoại lai	Giá trị bất thường nằm ngoài phạm vi bình thường
Feature Engineering	Tạo đặc trưng	Tạo biến mới từ dữ liệu gốc để cải thiện model
Encoding	Mã hóa	Chuyển dữ liệu dạng văn bản sang số để model xử lý
EDA	Phân tích khám phá	Tìm hiểu dữ liệu bằng thống kê và trực quan hóa
Imputation	Thay thế	Điền giá trị vào chỗ thiếu (mean, median, mode)
Feature Scaling	Chuẩn hóa đặc trưng	Đưa các features về cùng thang đo

Checkpoint

Bạn đã đọc qua bảng thuật ngữ? Hãy ghi nhớ các khái niệm này!

Task 1

🔑 Tại sao Data Preprocessing quan trọng?

TB5 min

💡 Tại sao phải tiền xử lý dữ liệu?

Data Preprocessing là bước quan trọng nhất trong quy trình Machine Learning:

"Garbage In, Garbage Out": Dữ liệu kém → Model kém, dù thuật toán tốt đến đâu
80% thời gian Data Science dành cho cleaning và preprocessing
Quyết định 90% accuracy của model: Feature engineering > Model tuning
Real-world data luôn bẩn: Missing values, outliers, inconsistent formats
Model assumptions: Hầu hết thuật toán yêu cầu data sạch, scaled, encoded đúng

Ví dụ thực tế:

Model dự đoán giá nhà với missing values → Sai 30-40%
Không scale features → Gradient Descent không hội tụ
Encode sai categorical → Model học sai patterns
Không xử lý outliers → Model bị bias nặng

💪 Kết quả sau bài học:

Xử lý được 95% vấn đề data thực tế
Tăng accuracy model 20-30% chỉ nhờ preprocessing
Xây dựng pipeline tự động, tái sử dụng được

Task 2

📊 Pandas DataFrame cơ bản

TB5 min

📥 Tạo và đọc DataFrame

Python

1import pandas as pd
2import numpy as np
3
4# Tạo DataFrame từ dictionary
5data = {
6    'Name': ['An', 'Binh', 'Chi', 'Đúng'],
7    'Age': [25, 30, 35, np.nan],
8    'Salary': [50000, 60000, 75000, 80000],
9    'Department': ['IT', 'HR', 'IT', 'Finance']
10}
11df = pd.DataFrame(data)
12print(df)
13
14# Đọc từ CSV
15# df = pd.read_csv('data.csv')
16
17# Đọc từ Excel
18# df = pd.read_excel('data.xlsx')

🔍 Các thao tác cơ bản

Python

1# Xem thông tin
2print(df.info())
3print(df.describe())
4print(df.shape)
5print(df.columns)
6print(df.dtypes)
7
8# Truy cập dữ liệu
9print(df['Age'])           # Mot cột
10print(df[['Name', 'Age']]) # Nhieu cột
11print(df.iloc[0])          # Hang dau tien
12print(df.loc[0, 'Name'])   # Cell cu the
13
14# Lọc dữ liệu
15print(df[df['Age'] > 25])
16print(df[(df['Age'] > 25) & (df['Department'] == 'IT')])

Task 3

🔍 Xử lý Missing Values

TB5 min

🔎 Phát hiện Missing Values

Python

1# Kiểm tra missing
2print(df.isnull().sum())
3print(df.isnull().sum() / len(df) * 100)  # Tỷ lệ %
4
5# Visualize missing
6import matplotlib.pyplot as plt
7import seaborn as sns
8
9plt.figure(figsize=(10, 6))
10sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
11plt.title('Missing Values Heatmap')
12plt.show()

📊 Các chiến lược xử lý Missing

Chiến lược	Khi nào dùng	Code
Xóa hàng	Missing ít, data nhiều	`df.dropna()`
Xóa cột	Cột missing quá nhiều (>50%)	`df.drop(columns=['col'])`
Điền Mean	Numerical, phân phối chuẩn	`df.fillna(df.mean())`
Điền Median	Numerical, có outliers	`df.fillna(df.median())`
Điền Mode	Categorical	`df.fillna(df.mode()[0])`
Forward Fill	Time series	`df.fillna(method='ffill')`

💻 Thực hành

Python

1# Xử lý missing
2df_clean = df.copy()
3
4# Điền Age bằng median
5df_clean['Age'].fillna(df_clean['Age'].median(), inplace=True)
6
7# Hoặc dùng SimpleImputer
8from sklearn.impute import SimpleImputer
9
10imputer = SimpleImputer(strategy='median')
11df_clean['Age'] = imputer.fit_transform(df_clean[['Age']])
12
13print(df_clean)

⚠️ Cảnh báo - Data Leakage nguy hiểm!

SAI:

Python

1# Fit imputer trên toàn bộ data → Test data leak vào train!
2imputer = SimpleImputer(strategy='mean')
3X_all = imputer.fit_transform(X)  # ❌ WRONG
4X_train, X_test = train_test_split(X_all)

ĐÚNG:

Python

1# Fit CHÍNH XÁC trên train, transform trên test
2X_train, X_test = train_test_split(X)
3imputer = SimpleImputer(strategy='mean')
4X_train = imputer.fit_transform(X_train)  # ✅ Fit trên train
5X_test = imputer.transform(X_test)  # ✅ Chỉ transform test

Lý do: Trong production, bạn KHÔNG có test data để fit imputer. Nếu fit trên cả dataset, model sẽ học statistics từ future data → accuracy giả cao!

✅ Key Takeaway - Missing Values

EDA trước tiên: Luôn visualize missing pattern trước khi quyết định xử lý
Median > Mean: Khi có outliers, dùng median thay vì mean
Domain knowledge: Cân nhắc ý nghĩa business khi drop/fill
MCAR vs MAR: Hiểu mechanism để chọn strategy đúng
Validate: Sau khi fill, check xem distribution có thay đổi nhiều không

Rule of thumb:

Missing < 5%: Drop rows
Missing 5-50%: Impute
Missing > 50%: Drop column (trừ khi rất quan trọng)

Task 4

🚨 Xử lý Outliers

TB5 min

1 Phát hiện Outliers

Phương pháp IQR (Interquartile Range):

$IQR = Q3 - Q1$ $\text{Lower bound} = Q1 - 1.5 \times IQR$ $\text{Upper bound} = Q3 + 1.5 \times IQR$

Boxplot Outliers

Hình: Boxplot va cách xác định Outliers

2 Thực hành phát hiện

Python

1import numpy as np
2
3def detect_outliers_iqr(data, column):
4    Q1 = data[column].quantile(0.25)
5    Q3 = data[column].quantile(0.75)
6    IQR = Q3 - Q1
7    
8    lower = Q1 - 1.5 * IQR
9    upper = Q3 + 1.5 * IQR
10    
11    outliers = data[(data[column] < lower) | (data[column] > upper)]
12    return outliers, lower, upper
13
14# Ví dụ
15data = pd.DataFrame({
16    'value': [10, 12, 14, 15, 16, 18, 100]  # 100 la outlier
17})
18
19outliers, lower, upper = detect_outliers_iqr(data, 'value')
20print(f"Outliers:\n{outliers}")
21print(f"Bounds: [{lower:.2f}, {upper:.2f}]")
22
23# Visualize
24import matplotlib.pyplot as plt
25plt.boxplot(data['value'])
26plt.title('Boxplot - Outlier Detection')
27plt.show()

3 Xử lý Outliers

Phương pháp	Khi nào dùng	Code
Xóa	Chắc chắn là lỗi đo lường	`df = df[(df['col'] >= lower) & (df['col'] <= upper)]`
Cap (Winsorization)	Giữ lại data points, giảm impact	`df['col'] = df['col'].clip(lower, upper)`
Transform	Log/Sqrt để giảm skewness	`df['col_log'] = np.log1p(df['col'])`
Binning	Nhóm thành categories	`pd.cut(df['col'], bins=5)`
Giữ nguyên	Outlier có ý nghĩa business	Không làm gì

⚠️ Pitfall - Xóa outliers vội vàng!

Ví dụ: Dữ liệu lương nhân viên

Ví dụ

1[50k, 55k, 60k, 52k, 58k, 500k]

❌ SAI: Xóa 500k vì là outlier ✅ ĐÚNG: Kiểm tra - Có thể là lương CEO (outlier thật, có ý nghĩa)

Hậu quả của việc xóa sai:

Mất thông tin quan trọng về high-value customers
Model không học được rare but important cases
Bias model về low-value predictions

Checklist trước khi xóa:

☐ Có phải lỗi đo lường/nhập liệu?
☐ Có ý nghĩa business không?
☐ Chiếm bao nhiêu % data?
☐ Model performance có cải thiện không?

✅ Key Takeaway - Outliers

IQR > Z-score: IQR robust hơn với non-normal distribution
Visualize first: Dùng boxplot, scatter plot để hiểu outliers
Domain knowledge: Không phải outlier nào cũng xóa được
Capping > Removing: Giữ lại data points, chỉ giảm impact
Transform: Log/Sqrt giúp giảm skewness, giữ relationships

Task 5

⚙️ Feature Engineering

TB5 min

1 Encoding Categorical Variables

Label Encoding:

Python

1from sklearn.preprocessing import LabelEncoder
2
3le = LabelEncoder()
4df['Department_encoded'] = le.fit_transform(df['Department'])
5print(df)
6print(f"Classes: {le.classes_}")

One-Hot Encoding:

Python

1# Pandas
2df_onehot = pd.get_dummies(df, columns=['Department'], prefix='Dept')
3print(df_onehot)
4
5# Sklearn
6from sklearn.preprocessing import OneHotEncoder
7
8ohe = OneHotEncoder(sparse_output=False, drop='first')
9encoded = ohe.fit_transform(df[['Department']])
10print(encoded)

2 Khi nào dùng Encoding nào?

Encoding	Khi nào dùng	Ví dụ
Label Encoding	Ordinal data (có thứ tự)	Education level: Low < Medium < High
One-Hot Encoding	Nominal data (không thứ tự)	Color: Red, Blue, Green
Target Encoding	High cardinality	Zip code, City

3 Tạo Features mới

Python

1# Tạo features tu date
2df['Date'] = pd.to_datetime(['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05'])
3df['Year'] = df['Date'].dt.year
4df['Month'] = df['Date'].dt.month
5df['DayOfWeek'] = df['Date'].dt.dayofweek
6df['IsWeekend'] = df['DayOfWeek'].isin([5, 6]).astype(int)
7
8# Binning
9df['Age_Group'] = pd.cut(df['Age'], bins=[0, 25, 35, 100], labels=['Young', 'Middle', 'Senior'])
10
11# Interaction features
12df['Age_Salary'] = df['Age'] * df['Salary']

Task 6

🛠️ Data Pipeline hoàn chỉnh

TB5 min

Python

1import pandas as pd
2import numpy as np
3from sklearn.model_selection import train_test_split
4from sklearn.preprocessing import StandardScaler, OneHotEncoder
5from sklearn.impute import SimpleImputer
6from sklearn.compose import ColumnTransformer
7from sklearn.pipeline import Pipeline
8
9# Giả sử có data
10# df = pd.read_csv('data.csv')
11
12# Định nghĩa cột
13numerical_cols = ['Age', 'Salary']
14categorical_cols = ['Department']
15
16# Numerical pipeline
17numerical_pipeline = Pipeline([
18    ('imputer', SimpleImputer(strategy='median')),
19    ('scaler', StandardScaler())
20])
21
22# Categorical pipeline
23categorical_pipeline = Pipeline([
24    ('imputer', SimpleImputer(strategy='most_frequent')),
25    ('encoder', OneHotEncoder(drop='first', sparse_output=False))
26])
27
28# Kết hợp
29preprocessor = ColumnTransformer([
30    ('num', numerical_pipeline, numerical_cols),
31    ('cat', categorical_pipeline, categorical_cols)
32])
33
34# Ap dung
35# X_processed = preprocessor.fit_transform(X_train)

Ưu và nhược điểm cac phuong phap

1 Missing Value Handling

Phương pháp	Uu điểm	Nhuoc điểm
Drop	Don gian	Mat du lieu
Mean/Median	Giu duoc du lieu	Co the bias
Model-based (KNN)	Chinh xac hon	Cham

2 Encoding

Phương pháp	Uu điểm	Nhuoc điểm
Label	Don gian, 1 cột	Tạo thu tu gia
One-Hot	Khong tao thu tu	Nhieu cột (curse of dimensionality)

Bài tập tự luyện

Bài tập 1: Load dataset Titanic, xử lý missing values cho Age và Embarked
Bài tập 2: Phát hiện và xử lý outliers trong cột Fare
Bài tập 3: One-hot encode cột Sex và Embarked, sau đó train model
Bài tập 4: Tạo datetime features từ cột ngày tháng
Bài tập 5: Implement complete EDA workflow cho dataset mới

Feature Engineering Advanced

1 Workflow tổng quan

Ví dụ

1Raw Data → Extract → Transform → Select → Model

Mục tiêu: Tạo features có correlation mạnh với target, loại noise, đảm bảo assumptions

2 DateTime & Cyclic Features

2.1 Basic DateTime Extraction

Python

1# Tạo datetime object
2df['dt'] = pd.to_datetime(df['date'])
3
4# Extract components
5df['year'] = df['dt'].dt.year
6df['month'] = df['dt'].dt.month
7df['day'] = df['dt'].dt.day
8df['dayofweek'] = df['dt'].dt.dayofweek  # 0=Monday, 6=Sunday
9df['hour'] = df['dt'].dt.hour
10df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
11df['is_month_start'] = df['dt'].dt.is_month_start.astype(int)
12df['is_month_end'] = df['dt'].dt.is_month_end.astype(int)
13
14# Time differences
15df['days_since'] = (df['dt'] - df['dt'].min()).dt.days

2.2 Cyclic Encoding (Sin/Cos)

Vấn đề: Hour 23 và Hour 0 rất gần nhau, nhưng numerically xa (23 vs 0)

Giải pháp: Encode thành Sin/Cos để preserve "closeness"

Python

1import numpy as np
2
3# Hour cyclic encoding (0-23 hours)
4df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
5df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
6
7# Month cyclic encoding (1-12 months)
8df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
9df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
10
11# Dayofweek cyclic encoding (0-6)
12df['dow_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
13df['dow_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)

Visualization:

Ví dụ

1Hour 23 ≈ Hour 0
2         ↓
3      Sin/Cos: Close
4         vs
5    Linear: 23 - 0 = 23 (Far)

3 Polynomial & Interaction Features

3.1 Polynomial Features

Tạo non-linear relationships: $x^2$ , $xy$ , $y^2$

Python

1from sklearn.preprocessing import PolynomialFeatures
2
3# Degree 2: x1, x2 → x1, x2, x1², x2², x1·x2
4poly = PolynomialFeatures(degree=2, include_bias=False)
5X_poly = poly.fit_transform(X)
6
7print(poly.get_feature_names_out())
8# Output: ['x1', 'x2', 'x1^2', 'x1 x2', 'x2^2']

Khi nào dùng:

Linear models cần non-linear relationships
Example: House price = Area + Area² (diminishing returns)

3.2 Domain-Specific Interactions

Python

1# Ratio features
2df['price_per_sqft'] = df['price'] / df['sqft']
3df['income_to_loan'] = df['income'] / (df['loan_amount'] + 1)
4
5# Sum/Difference
6df['total_income'] = df['applicant_income'] + df['spouse_income']
7df['income_diff'] = df['applicant_income'] - df['spouse_income']
8
9# Multiplication
10df['sqft_x_bedrooms'] = df['sqft'] * df['bedrooms']
11
12# Binning interactions
13df['age_income_group'] = df['age_group'] + '_' + df['income_group']

4 Text Features (NLP)

4.1 Basic Stats

Python

1# Word count
2df['word_count'] = df['text'].apply(lambda x: len(str(x).split()))
3
4# Character count
5df['char_count'] = df['text'].str.len()
6
7# Average word length
8df['avg_word_len'] = df['char_count'] / df['word_count']
9
10# Number of uppercase words
11df['uppercase_count'] = df['text'].apply(
12    lambda x: sum(1 for w in str(x).split() if w.isupper())
13)

4.2 TF-IDF (Term Frequency - Inverse Document Frequency)

Phản ánh importance của word trong document collection

Python

1from sklearn.feature_extraction.text import TfidfVectorizer
2
3# TF-IDF with top 100 features
4tfidf = TfidfVectorizer(
5    max_features=100,
6    stop_words='english',
7    ngram_range=(1, 2)  # Unigrams + Bigrams
8)
9
10X_text = tfidf.fit_transform(df['text'])
11
12# Get feature names
13feature_names = tfidf.get_feature_names_out()
14print(f"Created {len(feature_names)} text features")

5 Feature Selection

5.1 Filter Methods (Statistical Tests)

Remove High Correlation:

Python

1# Correlation matrix
2corr_matrix = df.corr().abs()
3
4# Upper triangle
5upper = corr_matrix.where(
6    np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
7)
8
9# Find features with correlation > 0.95
10to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
11print(f"Dropping {len(to_drop)} highly correlated features")
12
13df_filtered = df.drop(columns=to_drop)

5.2 Wrapper Methods (RFE)

Recursive Feature Elimination - iteratively removes weakest features

Python

1from sklearn.feature_selection import RFE
2from sklearn.linear_model import LinearRegression
3
4# Select top 10 features
5selector = RFE(
6    estimator=LinearRegression(), 
7    n_features_to_select=10,
8    step=1
9)
10selector.fit(X, y)
11
12# Get selected features
13selected_features = X.columns[selector.support_]
14print(f"Selected features: {selected_features}")
15
16# Transform data
17X_selected = selector.transform(X)

5.3 Embedded Methods (Model-based)

Random Forest Feature Importance:

Python

1from sklearn.ensemble import RandomForestClassifier
2
3rf = RandomForestClassifier(n_estimators=100, random_state=42)
4rf.fit(X, y)
5
6# Feature importances
7importances = pd.DataFrame({
8    'feature': X.columns,
9    'importance': rf.feature_importances_
10}).sort_values('importance', ascending=False)
11
12# Select top features
13top_features = importances.head(15)['feature'].values
14X_important = X[top_features]

Lasso Regularization (L1):

Python

1from sklearn.linear_model import LassoCV
2
3# Lasso automatically selects features (sets weak coefficients to 0)
4lasso = LassoCV(cv=5, random_state=42)
5lasso.fit(X, y)
6
7# Non-zero coefficients
8selected_mask = lasso.coef_ != 0
9selected_features = X.columns[selected_mask]
10print(f"Lasso selected {len(selected_features)} features")

Task 7

📊 Exploratory Data Analysis (EDA)

TB5 min

1 EDA Workflow

Ví dụ

1Load → Clean → Analyze → Visualize → Insights

2 Step-by-Step EDA

2.1 Load & Initial Exploration

Python

1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4import seaborn as sns
5
6# Configure
7plt.style.use('ggplot')
8sns.set_palette("husl")
9pd.set_option('display.max_columns', None)
10
11# Load data
12df = pd.read_csv('data.csv')
13
14# Quick inspection
15print("=== Shape ===")
16print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
17
18print("\n=== First 5 Rows ===")
19print(df.head())
20
21print("\n=== Data Types ===")
22print(df.dtypes)
23
24print("\n=== Info ===")
25print(df.info())
26
27print("\n=== Summary Statistics ===")
28print(df.describe())

2.2 Missing Values Analysis

Python

1# Missing count & percentage
2missing = pd.DataFrame({
3    'count': df.isnull().sum(),
4    'percent': (df.isnull().sum() / len(df) * 100).round(2)
5})
6missing = missing[missing['count'] > 0].sort_values('count', ascending=False)
7print(missing)
8
9# Visualize missing
10plt.figure(figsize=(12, 6))
11sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis')
12plt.title('Missing Values Heatmap')
13plt.show()
14
15# Missing correlation
16import missingno as msno
17msno.matrix(df)
18plt.show()

2.3 Univariate Analysis

Numerical Features:

Python

1# Distribution plots for all numerical columns
2numerical_cols = df.select_dtypes(include=[np.number]).columns
3
4fig, axes = plt.subplots(nrows=len(numerical_cols)//3 + 1, ncols=3, 
5                         figsize=(15, 10))
6axes = axes.ravel()
7
8for idx, col in enumerate(numerical_cols):
9    df[col].hist(bins=30, ax=axes[idx], edgecolor='black')
10    axes[idx].set_title(f'Distribution of {col}')
11    axes[idx].set_xlabel(col)
12    
13plt.tight_layout()
14plt.show()
15
16# Boxplots for outliers
17fig, axes = plt.subplots(nrows=len(numerical_cols)//3 + 1, ncols=3, 
18                         figsize=(15, 10))
19axes = axes.ravel()
20
21for idx, col in enumerate(numerical_cols):
22    df.boxplot(column=col, ax=axes[idx])
23    axes[idx].set_title(f'Boxplot of {col}')
24    
25plt.tight_layout()
26plt.show()

Categorical Features:

Python

1categorical_cols = df.select_dtypes(include=['object']).columns
2
3for col in categorical_cols:
4    plt.figure(figsize=(10, 5))
5    
6    # Value counts
7    value_counts = df[col].value_counts()
8    
9    # Bar plot
10    value_counts.plot(kind='bar')
11    plt.title(f'Distribution of {col}')
12    plt.xlabel(col)
13    plt.ylabel('Count')
14    plt.xticks(rotation=45)
15    plt.tight_layout()
16    plt.show()
17    
18    # Print stats
19    print(f"\n=== {col} ===")
20    print(f"Unique values: {df[col].nunique()}")
21    print(value_counts.head(10))

2.4 Bivariate Analysis

📖 Bảng định nghĩa - Statistical Measures

Thuật ngữ	Ký hiệu	Công thức	Ý nghĩa	Range	Ví dụ
Variance (Phương sai)	σ²	$(1/n)∑(x_i - μ)²$	Đo mức độ phân tán của data quanh mean	0 to ∞	Variance cao = data spread out
Standard Deviation (Độ lệch chuẩn)	σ	$√[(1/n)∑(x_i - μ)²]$	Căn bậc 2 của variance, cùng đơn vị với data	0 to ∞	Age std=10 nghĩa là hầu hết ages trong mean±10
Covariance (Hiệp phương sai)	Cov(X,Y)	$(1/n)∑(x_i - x̄)(y_i - ȳ)$	Đo mối quan hệ tuyến tính giữa 2 biến	-∞ to ∞	Cov>0: cùng chiều, Cov<0: ngược chiều
Correlation (Tương quan)	r hoặc ρ	$Cov(X,Y) / (σ_X × σ_Y)$	Covariance chuẩn hóa (normalized)	-1 to 1	r=1: perfect positive, r=-1: perfect negative, r=0: no linear relation
Mean (Trung bình)	μ hoặc x̄	$(1/n)∑x_i$	Giá trị trung tâm của dataset	Bất kỳ	Mean salary = $60k
Median (Trung vị)	Q₂	Middle value when sorted	Giá trị ở giữa, robust với outliers	Bất kỳ	Median salary (50th percentile)
Mode (Yếu tố)	-	Most frequent value	Giá trị xuất hiện nhiều nhất	Bất kỳ	Mode color = "Blue"
Quantile (Phân vị)	Q	Value below which % of data falls	Chia data thành phần	Bất kỳ	Q1 (25%), Q2 (50%), Q3 (75%)
IQR (Khoảng tứ phân vị)	-	$Q_3 - Q_1$	Range của middle 50% data	0 to ∞	IQR = 20 nghĩa là middle 50% span 20 units
Z-score	z	$(x - μ) / σ$	Số lượng std từ mean	-∞ to ∞	z=2 nghĩa là 2 std trên mean
Skewness (Độ xiên)	γ₁	$E[(X-μ)³] / σ³$	Asymmetry của distribution	-∞ to ∞	>0: right-skewed, <0: left-skewed, 0: symmetric
Kurtosis (Độ nhọn)	γ₂	$E[(X-μ)⁴] / σ⁴ - 3$	Tailedness của distribution	-∞ to ∞	>0: heavy tails (outliers), <0: light tails

💡 Mối quan hệ:

Variance = Std² → Std = √Variance
Correlation = Covariance chuẩn hóa → Correlation không có đơn vị, so sánh được
IQR robust hơn Std khi có outliers → Dùng IQR cho outlier detection
Z-score chuẩn hóa data → Giúp so sánh data từ các distributions khác nhau

🎯 Khi nào dùng:

Python

1# Variance & Std: Đo spread
2print(f"Age variance: {df['age'].var()}")
3print(f"Age std: {df['age'].std()}")
4
5# Correlation: Tìm relationship
6corr_matrix = df.corr()
7print(corr_matrix['target'])
8
9# IQR: Detect outliers
10Q1 = df['salary'].quantile(0.25)
11Q3 = df['salary'].quantile(0.75)
12IQR = Q3 - Q1
13
14# Z-score: Standardize
15from scipy import stats
16z_scores = stats.zscore(df['age'])

Correlation Heatmap:

Python

1# Correlation matrix
2plt.figure(figsize=(12, 10))
3corr = df[numerical_cols].corr()
4sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', 
5            center=0, square=True, linewidths=1)
6plt.title('Correlation Heatmap')
7plt.tight_layout()
8plt.show()
9
10# Top correlations with target
11if 'target' in df.columns:
12    target_corr = corr['target'].abs().sort_values(ascending=False)
13    print("\n=== Top 10 Correlations with Target ===")
14    print(target_corr.head(10))

Scatter Matrix (Pair Plot):

Python

1# Select top features for pair plot
2from sklearn.feature_selection import SelectKBest, f_regression
3
4if 'target' in df.columns:
5    selector = SelectKBest(f_regression, k=5)
6    selector.fit(df[numerical_cols], df['target'])
7    top_features = df[numerical_cols].columns[selector.get_support()]
8    
9    # Pair plot
10    sns.pairplot(df[list(top_features) + ['target']], diag_kind='kde')
11    plt.show()

Categorical vs Target:

Python

1for col in categorical_cols:
2    if df[col].nunique() < 10:  # Only if not too many categories
3        plt.figure(figsize=(10, 6))
4        
5        # Group by and plot
6        df.groupby(col)['target'].mean().sort_values().plot(kind='barh')
7        plt.title(f'Average Target by {col}')
8        plt.xlabel('Average Target')
9        plt.tight_layout()
10        plt.show()

2.5 Multivariate Analysis

Crosstab for 2 Categorical:

Python

1# Example: Gender vs Department
2if 'gender' in df.columns and 'department' in df.columns:
3    ct = pd.crosstab(df['gender'], df['department'], normalize='index')
4    ct.plot(kind='bar', stacked=True, figsize=(10, 6))
5    plt.title('Department Distribution by Gender')
6    plt.ylabel('Proportion')
7    plt.xticks(rotation=0)
8    plt.legend(title='Department')
9    plt.show()

3D Scatter:

Python

1from mpl_toolkits.mplot3d import Axes3D
2
3if len(numerical_cols) >= 3:
4    fig = plt.figure(figsize=(10, 8))
5    ax = fig.add_subplot(111, projection='3d')
6    
7    scatter = ax.scatter(df[numerical_cols[0]], 
8                        df[numerical_cols[1]], 
9                        df[numerical_cols[2]],
10                        c=df['target'] if 'target' in df.columns else None,
11                        cmap='viridis')
12    
13    ax.set_xlabel(numerical_cols[0])
14    ax.set_ylabel(numerical_cols[1])
15    ax.set_zlabel(numerical_cols[2])
16    plt.colorbar(scatter)
17    plt.show()

3 EDA Report Generation

Python

1# Automated EDA with pandas-profiling
2import pandas_profiling
3
4# Generate comprehensive report
5profile = pandas_profiling.ProfileReport(df, title='EDA Report', 
6                                        explorative=True)
7profile.to_file("eda_report.html")
8
9print("✓ EDA Report generated: eda_report.html")

Advanced Feature Engineering Techniques

1 Target Encoding (Mean Encoding)

Replace category with average target value - powerful but risky (overfitting)

Python

1# Calculate mean target per category
2means = df.groupby('city')['target'].mean()
3
4# Map to new feature
5df['city_encoded'] = df['city'].map(means)
6
7# With Cross-Validation to avoid overfitting
8from sklearn.model_selection import KFold
9
10def target_encode_cv(df, col, target, n_folds=5):
11    """Target encode with CV to prevent overfitting"""
12    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
13    df[f'{col}_encoded'] = 0
14    
15    for train_idx, val_idx in kf.split(df):
16        # Calculate mean on train
17        means = df.iloc[train_idx].groupby(col)[target].mean()
18        # Apply to validation
19        df.loc[val_idx, f'{col}_encoded'] = df.loc[val_idx, col].map(means)
20    
21    return df
22
23df = target_encode_cv(df, 'city', 'target')

2 Frequency Encoding

Replace category with its count/frequency

Python

1# Frequency encoding
2freq = df['color'].value_counts()
3df['color_freq'] = df['color'].map(freq)
4
5# Normalized frequency (proportion)
6freq_norm = df['color'].value_counts(normalize=True)
7df['color_freq_norm'] = df['color'].map(freq_norm)

3 Binning (Discretization)

Convert continuous → categorical buckets

Python

1# Fixed width bins
2df['age_bin'] = pd.cut(df['age'], bins=5, labels=['Very Young', 'Young', 'Middle', 'Old', 'Very Old'])
3
4# Quantile-based bins (equal frequency)
5df['salary_quartile'] = pd.qcut(df['salary'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
6
7# Custom bins
8bins = [0, 18, 30, 50, 100]
9labels = ['Child', 'Young Adult', 'Adult', 'Senior']
10df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

Task 8

🏥 Real-World Example - Healthcare Patient Data

TB5 min

🎯 Bài toán thực tế: Dự đoán bệnh tiểu đường

Context: Bệnh viện có 768 bệnh nhân với 8 features. Cần xây dựng model dự đoán ai có nguy cơ tiểu đường.

Dataset: Pima Indians Diabetes (from sklearn)

📊 Step 1: Load và Initial Exploration

Mô tả: Load data và hiểu structure, identify issues cần xử lý.

Code:

Python

1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4import seaborn as sns
5from sklearn.datasets import load_diabetes
6from sklearn.model_selection import train_test_split
7from sklearn.preprocessing import StandardScaler
8from sklearn.impute import SimpleImputer
9
10# Load Pima Indians Diabetes dataset
11url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
12columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 
13           'Insulin', 'BMI', 'DiabetesPedigree', 'Age', 'Outcome']
14df = pd.read_csv(url, names=columns)
15
16print("=== Dataset Info ===")
17print(f"Shape: {df.shape}")
18print(f"\nData types:\n{df.dtypes}")
19print(f"\nFirst 5 rows:\n{df.head()}")
20print(f"\nStatistical Summary:\n{df.describe()}")
21print(f"\nMissing values:\n{df.isnull().sum()}")

Giải thích:

768 patients, 8 features + 1 target (Outcome: 0=No diabetes, 1=Diabetes)
Tất cả numerical features
Không có missing values rõ ràng, nhưng...

🔍 Step 2: Identify Hidden Missing Values

Mô tả: Medical data không thể có 0 cho Glucose, BloodPressure, BMI. Đây là missing được encode bằng 0!

Code:

Python

1# Check for suspicious zeros
2zero_columns = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
3
4print("=== Zero Values (Suspicious Missing) ===")
5for col in zero_columns:
6    zero_count = (df[col] == 0).sum()
7    zero_pct = zero_count / len(df) * 100
8    print(f"{col}: {zero_count} zeros ({zero_pct:.1f}%)")
9
10# Replace 0 with NaN
11df[zero_columns] = df[zero_columns].replace(0, np.nan)
12
13# Now check real missing
14print(f"\n=== Real Missing Values ===")
15print(df.isnull().sum())
16print(f"\nMissing percentage:\n{df.isnull().sum() / len(df) * 100}")
17
18# Visualize missing pattern
19import missingno as msno
20msno.matrix(df)
21plt.title('Missing Data Pattern')
22plt.show()

Output:

Ví dụ

1=== Zero Values (Suspicious Missing) ===
2Glucose: 5 zeros (0.7%)
3BloodPressure: 35 zeros (4.6%)
4SkinThickness: 227 zeros (29.6%)  ← Nhiều!
5Insulin: 374 zeros (48.7%)  ← RẤT NHIỀU!
6BMI: 11 zeros (1.4%)

Giải thích:

Insulin 48.7% missing → Không thể drop, phải impute hoặc drop column
SkinThickness 29.6% → Impute
Glucose, BloodPressure, BMI <5% → Impute với median

🛠️ Step 3: Handle Missing Values

Mô tả: Strategy dựa trên % missing và domain knowledge.

Code:

Python

1# Decision:
2# - Insulin: Drop column (quá nhiều missing, ít quan trọng)
3# - Others: Impute with median (robust với outliers)
4
5# Drop Insulin
6df_clean = df.drop('Insulin', axis=1)
7
8# Impute others with median
9imputer = SimpleImputer(strategy='median')
10df_clean[zero_columns[:-1]] = imputer.fit_transform(df_clean[zero_columns[:-1]])
11
12print("=== After Imputation ===")
13print(f"Shape: {df_clean.shape}")
14print(f"Missing values: {df_clean.isnull().sum().sum()}")

Giải thích:

Drop Insulin vì: 48.7% missing + medical test không phải lúc nào cũng có
Median imputation vì: Medical data thường có outliers
Verify: 0 missing values sau imputation

📊 Step 4: Exploratory Data Analysis

Code:

Python

1# Correlation with target
2plt.figure(figsize=(10, 6))
3corr = df_clean.corr()['Outcome'].sort_values(ascending=False)
4corr.plot(kind='barh')
5plt.title('Feature Correlation with Diabetes Outcome')
6plt.xlabel('Correlation')
7plt.tight_layout()
8plt.show()
9
10print("=== Top Correlations ===")
11print(corr)
12
13# Distribution analysis
14fig, axes = plt.subplots(3, 3, figsize=(15, 12))
15axes = axes.ravel()
16
17for idx, col in enumerate(df_clean.columns[:-1]):
18    # Histogram by outcome
19    df_clean[df_clean['Outcome']==0][col].hist(ax=axes[idx], alpha=0.5, 
20                                                 label='No Diabetes', bins=30)
21    df_clean[df_clean['Outcome']==1][col].hist(ax=axes[idx], alpha=0.5, 
22                                                 label='Diabetes', bins=30)
23    axes[idx].set_title(f'{col} Distribution')
24    axes[idx].legend()
25
26plt.tight_layout()
27plt.show()

Findings:

Glucose highest correlation (0.47) → Strong predictor
BMI (0.29), Age (0.24) → Moderate predictors
Pregnancies low correlation → Might drop in feature selection

🔍 Step 5: Outlier Detection

Code:

Python

1# IQR method for each feature
2def detect_outliers_iqr(data, column):
3    Q1 = data[column].quantile(0.25)
4    Q3 = data[column].quantile(0.75)
5    IQR = Q3 - Q1
6    lower = Q1 - 1.5 * IQR
7    upper = Q3 + 1.5 * IQR
8    outliers = data[(data[column] < lower) | (data[column] > upper)]
9    return outliers, lower, upper
10
11print("=== Outlier Detection ===")
12for col in df_clean.columns[:-1]:
13    outliers, lower, upper = detect_outliers_iqr(df_clean, col)
14    print(f"{col}: {len(outliers)} outliers ({len(outliers)/len(df_clean)*100:.1f}%)")
15    print(f"  Range: [{lower:.1f}, {upper:.1f}]")
16
17# Boxplots
18fig, axes = plt.subplots(2, 4, figsize=(16, 8))
19axes = axes.ravel()
20
21for idx, col in enumerate(df_clean.columns[:-1]):
22    df_clean.boxplot(column=col, ax=axes[idx])
23    axes[idx].set_title(f'{col}')
24
25plt.tight_layout()
26plt.show()

Decision: Giữ outliers vì:

Medical data: Outliers thường là cases quan trọng (severe diabetes)
RobustScaler sẽ handle outliers tốt hơn là remove

⚙️ Step 6: Feature Scaling

Code:

Python

1# Separate features and target
2X = df_clean.drop('Outcome', axis=1)
3y = df_clean['Outcome']
4
5# Train-test split TRƯỚC KHI scale
6X_train, X_test, y_train, y_test = train_test_split(
7    X, y, test_size=0.2, random_state=42, stratify=y
8)
9
10print(f"Train: {X_train.shape}, Test: {X_test.shape}")
11print(f"Train diabetes rate: {y_train.mean():.2%}")
12print(f"Test diabetes rate: {y_test.mean():.2%}")
13
14# Use RobustScaler (vì có outliers)
15from sklearn.preprocessing import RobustScaler
16
17scaler = RobustScaler()
18X_train_scaled = scaler.fit_transform(X_train)
19X_test_scaled = scaler.transform(X_test)  # Chỉ transform!
20
21print("\n=== After Scaling ===")
22print(f"Train mean: {X_train_scaled.mean(axis=0)}")
23print(f"Train median: {np.median(X_train_scaled, axis=0)}")

Giải thích:

RobustScaler thay vì StandardScaler vì có outliers
Stratify trong train_test_split để giữ diabetes rate đồng đều
fit_transform trên train, transform trên test

🎯 Step 7: Train Model và Evaluate

Code:

Python

1from sklearn.linear_model import LogisticRegression
2from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
3
4# Train model
5model = LogisticRegression(random_state=42, max_iter=1000)
6model.fit(X_train_scaled, y_train)
7
8# Predictions
9y_pred = model.predict(X_test_scaled)
10y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
11
12# Evaluate
13print("=== Model Performance ===")
14print(f"Accuracy: {model.score(X_test_scaled, y_test):.2%}")
15print(f"ROC-AUC: {roc_auc_score(y_test, y_pred_proba):.3f}")
16print(f"\nClassification Report:\n{classification_report(y_test, y_pred)}")
17
18# Confusion Matrix
19fig, ax = plt.subplots(figsize=(8, 6))
20sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', 
21            cmap='Blues', cbar=False, ax=ax)
22ax.set_xlabel('Predicted')
23ax.set_ylabel('Actual')
24ax.set_title('Confusion Matrix')
25plt.show()
26
27# Feature importance
28feature_importance = pd.DataFrame({
29    'Feature': X.columns,
30    'Coefficient': model.coef_[0]
31}).sort_values('Coefficient', key=abs, ascending=False)
32
33print("\n=== Feature Importance ===")
34print(feature_importance)
35
36plt.figure(figsize=(10, 6))
37plt.barh(feature_importance['Feature'], feature_importance['Coefficient'])
38plt.xlabel('Coefficient')
39plt.title('Feature Importance (Logistic Regression)')
40plt.tight_layout()
41plt.show()

Results:

Ví dụ

1Accuracy: 77%
2ROC-AUC: 0.835
3 
4Top Features:
51. Glucose (0.89) ← Strongest predictor
62. BMI (0.45)
73. Age (0.32)

✅ Summary - Preprocessing Impact

🎯 Những bước preprocessing đã làm:

✅ Identified hidden missing (0 values) → +10% data quality
✅ Dropped high-missing column (Insulin 48%) → Giảm noise
✅ Median imputation cho features quan trọng → Giữ được data
✅ Kept outliers (severe cases) → Model học được edge cases
✅ RobustScaler → Handle outliers tốt
✅ Stratified split → Balanced train/test
✅ Proper scaling workflow → Tránh data leakage

📊 Impact:

Without preprocessing: Accuracy ~65%, nhiều warnings
With preprocessing: Accuracy 77%, ROC-AUC 0.835
Improvement: +12% accuracy!

💡 Key Learnings:

Domain knowledge giúp identify hidden missing (0 values)
Không phải lúc nào cũng drop outliers → Medical data cần giữ
RobustScaler > StandardScaler khi có outliers
Proper train-test split workflow QUAN TRỌNG!

Task 9

📝 Quiz - Kiểm tra kiến thức

TB5 min

Câu 1: Missing values chiếm 15% trong cột numerical. Phương pháp nào TỐT NHẤT?

Đáp án: Điền Mean hoặc Median (tùy có outliers không)

Giải thích:

15% không quá nhiều để drop
Mean phù hợp nếu phân phối chuẩn, không có outliers
Median tốt hơn nếu có outliers (robust)
Mode chỉ dùng cho categorical
Drop chỉ khi <5% missing

Câu 2: Khi nào KHÔNG NÊN dùng Mean imputation?

Đáp án: Khi có outliers hoặc data không phân phối chuẩn

Giải thích:

Mean bị ảnh hưởng nặng bởi outliers
Giảm variance của data → bias model
Ví dụ: Lương [50k, 55k, 60k, 500k] → Mean = 166k (không đại diện)
→ Dùng Median = 57.5k (tốt hơn)

💡 Khi dùng:

Data phân phối chuẩn (normal distribution)
Không có outliers
Missing <30%

Câu 3: Z-score > 3 nghĩa là gì? Xử lý thế nào?

Đáp án: Outlier (cách mean >3 std). Xử lý: Cap/Remove/Transform

Giải thích:

Z-score = (x - mean) / std
|Z| > 3: Chỉ 0.3% data, rất hiếm
|Z| > 2: ~5% data

Cách xử lý:

Cap (Winsorize): Giới hạn ở 1st-99th percentile
Remove: Xóa nếu chắc chắn là lỗi
Transform: Log/Sqrt để giảm tác động
Keep: Nếu là giá trị thật, quan trọng

Câu 4: One-Hot Encoding vs Label Encoding - Khi nào dùng?

Đáp án:

One-Hot: Nominal categorical (không có thứ tự) - Color, City
Label: Ordinal categorical (có thứ tự) - Education Level, Size

Giải thích:

One-Hot:

Color: [Red, Blue] → [1,0], [0,1]
Model không học thứ tự sai
Tăng số features (curse of dimensionality)

Label:

Education: [High School, Bachelor, Master] → [1, 2, 3]
Có thứ tự tự nhiên
Không tăng dimensions

❌ SAI: Label encode Color → Red=1, Blue=2 → Model nghĩ Red < Blue

Câu 5: Scaling trước hay sau train-test split?

Đáp án: Sau train-test split! Fit trên train, transform trên test.

Giải thích:

✅ ĐÚNG:

Python

1X_train, X_test = train_test_split(X, y)
2scaler = StandardScaler()
3X_train_scaled = scaler.fit_transform(X_train)
4X_test_scaled = scaler.transform(X_test)  # Chỉ transform!

❌ SAI:

Python

1X_scaled = scaler.fit_transform(X)  # Data leakage!
2X_train, X_test = train_test_split(X_scaled)

Lý do:

Fit_transform trên cả dataset → Test data leak vào train
Model học statistics của test → Overly optimistic accuracy
Production: Không có test data để fit

Câu 6: StandardScaler vs MinMaxScaler - Khác nhau gì?

Đáp án:

StandardScaler: Mean=0, Std=1 (Z-score normalization)
MinMaxScaler: Scale vào [0, 1] range

Khi nào dùng:

StandardScaler:

Data có outliers ít
Phân phối gần chuẩn
Algorithms: Linear Regression, Logistic, SVM, Neural Networks

MinMaxScaler:

Cần bounded range (0-1)
Algorithms: Neural Networks (activation functions), KNN
⚠️ Nhạy cảm với outliers

RobustScaler:

Có nhiều outliers
Dùng median và IQR thay vì mean/std

Câu 7: Correlation = 0.95 giữa 2 features. Làm gì?

Đáp án: Xóa 1 trong 2 features (multicollinearity)

Giải thích:

Correlation > 0.9: Multicollinearity nghiêm trọng
2 features gần như duplicate information
Ảnh hưởng:
- Linear models: Coefficients không stable
- Feature importance bị sai
- Overfitting
- Tăng training time

Cách xử lý:

Drop feature ít quan trọng hơn
PCA để combine features
Regularization (L1/L2)

💡 Threshold:

|r| > 0.9: Drop
|r| > 0.7: Xem xét
|r| < 0.5: OK

Câu 8: Feature engineering cho datetime - Tạo features gì?

Đáp án: Year, Month, Day, DayOfWeek, Hour, IsWeekend, Quarter, Season

Code example:

Python

1df['year'] = df['date'].dt.year
2df['month'] = df['date'].dt.month
3df['day'] = df['date'].dt.day
4df['dayofweek'] = df['date'].dt.dayofweek
5df['hour'] = df['date'].dt.hour
6df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
7df['quarter'] = df['date'].dt.quarter

Advanced:

Cyclic encoding (sin/cos) cho hour, month
Time since event (days_since_signup)
Seasonality features

Câu 9: Polynomial features degree=2 tăng số features bao nhiêu?

Đáp án: Từ n features → n(n+1)/2 + n features

Giải thích:

Original: [x1, x2, x3]

Polynomial degree=2:

Original: x1, x2, x3 (3)
Squares: x1², x2², x3² (3)
Interactions: x1×x2, x1×x3, x2×x3 (3)
Total: 3 + 3 + 3 = 9 features

Formula: n=3 → 3(3+1)/2 + 3 = 6 + 3 = 9

⚠️ Cẩn thận:

n=10 → 65 features
n=20 → 230 features
Curse of dimensionality!

Câu 10: Data leakage là gì? Ví dụ?

Đáp án: Thông tin từ test/future leak vào training → Accuracy giả cao

Ví dụ data leakage:

Scaling trước split:

Python

1X_scaled = scaler.fit_transform(X)  # Leak!
2X_train, X_test = train_test_split(X_scaled)

Feature engineering trên toàn bộ data:

Python

1df['mean_price'] = df.groupby('city')['price'].transform('mean')  # Leak!

Include target-related features:
- Dự đoán credit default, dùng feature "payment_after_default"
Temporal leakage:
- Train data: 2020-2022
- Test data: 2019 → Model học future!

💡 Phòng tránh:

Fit trên train only
Feature engineering separate cho train/test
Validate với time-based split

💪 Bài tập thực hành

Bài 1: Xử lý Missing Values - Titanic Dataset

Đề bài: Load Titanic dataset và xử lý missing values cho cột Age và Embarked.

Template Code:

Python

1import pandas as pd
2import seaborn as sns
3
4# Load data
5df = sns.load_dataset('titanic')
6
7# TODO:
8# 1. Kiểm tra missing values
9# 2. Visualize missing pattern
10# 3. Fill Age với median theo Pclass
11# 4. Fill Embarked với mode
12# 5. Verify không còn missing

Solution:

Python

1# 1. Check missing
2print(df.isnull().sum())
3print(f"Age missing: {df['age'].isnull().sum()} ({df['age'].isnull().mean()*100:.1f}%)")
4
5# 2. Visualize
6import matplotlib.pyplot as plt
7plt.figure(figsize=(10, 6))
8sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
9plt.title('Missing Values Pattern')
10plt.show()
11
12# 3. Fill Age by Pclass (higher class → older)
13df['age'] = df.groupby('pclass')['age'].transform(lambda x: x.fillna(x.median()))
14
15# 4. Fill Embarked with mode
16df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)
17
18# 5. Verify
19print("\nAfter imputation:")
20print(df[['age', 'embarked']].isnull().sum())

💡 Học được:

Group-based imputation (fill theo nhóm)
Mode cho categorical
Visualization missing patterns

Bài 2: Phát hiện và xử lý Outliers

Đề bài: Tạo dataset giá nhà, phát hiện outliers bằng IQR method và xử lý bằng capping.

Template:

Python

1import numpy as np
2import pandas as pd
3
4# Create data with outliers
5np.random.seed(42)
6prices = np.concatenate([
7    np.random.normal(300000, 50000, 95),  # Normal prices
8    np.array([1000000, 1200000, 1500000, 2000000, 2500000])  # Outliers
9])
10df = pd.DataFrame({'price': prices})
11
12# TODO:
13# 1. Calculate IQR and bounds
14# 2. Identify outliers
15# 3. Cap outliers at Q1-1.5*IQR and Q3+1.5*IQR
16# 4. Compare before/after distribution

Solution:

Python

1# 1. IQR method
2Q1 = df['price'].quantile(0.25)
3Q3 = df['price'].quantile(0.75)
4IQR = Q3 - Q1
5
6lower_bound = Q1 - 1.5 * IQR
7upper_bound = Q3 + 1.5 * IQR
8
9print(f"IQR: {IQR:,.0f}")
10print(f"Bounds: [{lower_bound:,.0f}, {upper_bound:,.0f}]")
11
12# 2. Identify outliers
13outliers = df[(df['price'] < lower_bound) | (df['price'] > upper_bound)]
14print(f"\nOutliers: {len(outliers)} ({len(outliers)/len(df)*100:.1f}%)")
15print(outliers['price'].values)
16
17# 3. Cap outliers
18df['price_capped'] = df['price'].clip(lower=lower_bound, upper=upper_bound)
19
20# 4. Compare
21import matplotlib.pyplot as plt
22fig, axes = plt.subplots(1, 2, figsize=(12, 4))
23
24axes[0].boxplot(df['price'])
25axes[0].set_title('Before Capping')
26axes[0].set_ylabel('Price')
27
28axes[1].boxplot(df['price_capped'])
29axes[1].set_title('After Capping')
30
31plt.tight_layout()
32plt.show()
33
34print(f"\nBefore: Mean={df['price'].mean():,.0f}, Std={df['price'].std():,.0f}")
35print(f"After: Mean={df['price_capped'].mean():,.0f}, Std={df['price_capped'].std():,.0f}")

💡 Học được:

IQR method robust hơn Z-score
Capping giữ lại data points
Visualize impact của outlier removal

Bài 3: Encoding Categorical Variables

Đề bài: Encode categorical features cho dataset và train model.

Template:

Python

1from sklearn.preprocessing import LabelEncoder, OneHotEncoder
2from sklearn.compose import ColumnTransformer
3from sklearn.linear_model import LogisticRegression
4
5# Sample data
6data = {
7    'color': ['red', 'blue', 'green', 'red', 'blue'],
8    'size': ['S', 'M', 'L', 'M', 'L'],
9    'education': ['High School', 'Bachelor', 'Master', 'Bachelor', 'High School'],
10    'price': [100, 200, 300, 150, 250]
11}
12df = pd.DataFrame(data)
13
14# TODO:
15# 1. Label encode 'size' and 'education' (ordinal)
16# 2. One-hot encode 'color' (nominal)
17# 3. Train simple model

Solution:

Python

1# 1. Label Encoding (ordinal)
2size_mapping = {'S': 1, 'M': 2, 'L': 3}
3df['size_encoded'] = df['size'].map(size_mapping)
4
5education_mapping = {'High School': 1, 'Bachelor': 2, 'Master': 3}
6df['education_encoded'] = df['education'].map(education_mapping)
7
8# 2. One-Hot Encoding (nominal)
9df_onehot = pd.get_dummies(df, columns=['color'], prefix='color')
10
11print("After encoding:")
12print(df_onehot)
13
14# 3. Using ColumnTransformer (Production way)
15from sklearn.compose import ColumnTransformer
16from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
17
18ct = ColumnTransformer([
19    ('onehot', OneHotEncoder(sparse_output=False), ['color']),
20    ('ordinal_size', OrdinalEncoder(categories=[['S', 'M', 'L']]), ['size']),
21    ('ordinal_edu', OrdinalEncoder(categories=[['High School', 'Bachelor', 'Master']]), ['education'])
22], remainder='passthrough')
23
24X_encoded = ct.fit_transform(df[['color', 'size', 'education', 'price']])
25print("\nEncoded features shape:", X_encoded.shape)
26print(X_encoded)

💡 Học được:

Phân biệt nominal vs ordinal
Manual mapping vs sklearn
ColumnTransformer cho pipeline

Bài 4: Feature Engineering - DateTime

Đề bài: Tạo datetime features từ cột timestamp để dự đoán sales.

Template:

Python

1# Create sales data with datetime
2dates = pd.date_range('2023-01-01', '2023-12-31', freq='D')
3sales = np.random.randint(100, 1000, len(dates))
4df = pd.DataFrame({'date': dates, 'sales': sales})
5
6# TODO:
7# 1. Extract: year, month, day, dayofweek, hour
8# 2. Create: is_weekend, is_month_start, is_month_end
9# 3. Cyclic encoding cho month (sin/cos)
10# 4. Visualize sales pattern by dayofweek

Solution:

Python

1# 1. Basic extraction
2df['year'] = df['date'].dt.year
3df['month'] = df['date'].dt.month
4df['day'] = df['date'].dt.day
5df['dayofweek'] = df['date'].dt.dayofweek  # 0=Monday, 6=Sunday
6df['dayofweek_name'] = df['date'].dt.day_name()
7
8# 2. Boolean features
9df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
10df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
11df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
12df['quarter'] = df['date'].dt.quarter
13
14# 3. Cyclic encoding (month)
15df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
16df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
17
18# 4. Visualize
19import matplotlib.pyplot as plt
20
21fig, axes = plt.subplots(2, 2, figsize=(14, 10))
22
23# Sales by day of week
24df.groupby('dayofweek_name')['sales'].mean().plot(kind='bar', ax=axes[0,0])
25axes[0,0].set_title('Average Sales by Day of Week')
26axes[0,0].set_ylabel('Sales')
27
28# Weekend vs Weekday
29df.groupby('is_weekend')['sales'].mean().plot(kind='bar', ax=axes[0,1])
30axes[0,1].set_title('Weekend vs Weekday Sales')
31axes[0,1].set_xticks([0, 1])
32axes[0,1].set_xticklabels(['Weekday', 'Weekend'])
33
34# Monthly pattern
35df.groupby('month')['sales'].mean().plot(kind='line', ax=axes[1,0])
36axes[1,0].set_title('Average Sales by Month')
37axes[1,0].set_xlabel('Month')
38
39# Cyclic encoding visualization
40axes[1,1].scatter(df['month_sin'], df['month_cos'], c=df['month'], cmap='hsv')
41axes[1,1].set_title('Cyclic Encoding (Month)')
42axes[1,1].set_xlabel('sin(month)')
43axes[1,1].set_ylabel('cos(month)')
44
45plt.tight_layout()
46plt.show()
47
48print("Created features:")
49print(df[['date', 'month', 'month_sin', 'month_cos', 'is_weekend']].head(10))

💡 Học được:

Datetime extraction comprehensive
Cyclic encoding cho periodic features
Pattern analysis với visualization

Bài 5: Complete Data Pipeline

Đề bài: Xây dựng pipeline xử lý data hoàn chỉnh từ raw → ready for model.

Template:

Python

1from sklearn.pipeline import Pipeline
2from sklearn.compose import ColumnTransformer
3from sklearn.preprocessing import StandardScaler, OneHotEncoder
4from sklearn.impute import SimpleImputer
5from sklearn.model_selection import train_test_split
6
7# TODO: Build pipeline with:
8# 1. Impute missing
9# 2. Encode categorical
10# 3. Scale numerical
11# 4. Train model

Solution:

Python

1# Sample data
2data = {
3    'age': [25, 30, np.nan, 35, 40],
4    'salary': [50000, 60000, 75000, np.nan, 90000],
5    'department': ['IT', 'HR', 'IT', 'Finance', 'HR'],
6    'city': ['HN', 'HCM', 'HN', 'DN', 'HCM'],
7    'performance': [1, 0, 1, 1, 0]
8}
9df = pd.DataFrame(data)
10
11# Separate features
12X = df.drop('performance', axis=1)
13y = df['performance']
14
15# Define column types
16numeric_features = ['age', 'salary']
17categorical_features = ['department', 'city']
18
19# Numeric pipeline
20numeric_transformer = Pipeline(steps=[
21    ('imputer', SimpleImputer(strategy='median')),
22    ('scaler', StandardScaler())
23])
24
25# Categorical pipeline
26categorical_transformer = Pipeline(steps=[
27    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
28    ('onehot', OneHotEncoder(drop='first', sparse_output=False))
29])
30
31# Combined preprocessor
32preprocessor = ColumnTransformer(
33    transformers=[
34        ('num', numeric_transformer, numeric_features),
35        ('cat', categorical_transformer, categorical_features)
36    ])
37
38# Full pipeline with model
39from sklearn.linear_model import LogisticRegression
40
41full_pipeline = Pipeline(steps=[
42    ('preprocessor', preprocessor),
43    ('classifier', LogisticRegression())
44])
45
46# Train-test split
47X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
48
49# Fit pipeline
50full_pipeline.fit(X_train, y_train)
51
52# Predict
53y_pred = full_pipeline.predict(X_test)
54
55print("Pipeline steps:", full_pipeline.steps)
56print(f"Accuracy: {full_pipeline.score(X_test, y_test)}")
57
58# Transform to see preprocessed data
59X_train_processed = preprocessor.fit_transform(X_train)
60print(f"\nOriginal features: {X_train.shape}")
61print(f"Processed features: {X_train_processed.shape}")
62print(f"Feature names: {preprocessor.get_feature_names_out()}")

💡 Học được:

Pipeline automation toàn bộ preprocessing
ColumnTransformer cho mixed data types
fit_transform trên train, transform trên test
Reusable pipeline cho production

Task 10

📝 Bài tập tự luyện

TB5 min

Bài tập 1: Load dataset Titanic, xử lý missing values cho Age và Embarked
Bài tập 2: Phát hiện và xử lý outliers trong cột Fare
Bài tập 3: One-hot encode cột Sex và Embarked, sau đó train model
Bài tập 4: Tạo datetime features từ cột ngày tháng (year, month, dayofweek, is_weekend)
Bài tập 5: Implement complete EDA workflow cho dataset mới
Bài tập 6: Create polynomial features (degree 2) và so sánh model performance
Bài tập 7: Encode cyclic features (hour, month) với sin/cos
Bài tập 8: Perform feature selection với 3 methods: Filter, Wrapper, Embedded

Task 11

📝 Tổng Kết

TB5 min

✅ Những điều bạn đã học được

1️⃣ Xử lý Missing Values

✅ Phát hiện và visualize missing patterns
✅ Phương pháp imputation: Mean, Median, Mode, Forward/Backward Fill
✅ Khi nào nên drop vs fill
✅ MCAR vs MAR vs MNAR

2️⃣ Phát hiện và xử lý Outliers

✅ IQR method (robust, recommended)
✅ Z-score method (cho normal distribution)
✅ Capping, Removing, Transformation
✅ Khi nào giữ lại outliers

3️⃣ Encoding Categorical Variables

✅ One-Hot Encoding cho nominal (Color, City)
✅ Label Encoding cho ordinal (Education, Size)
✅ Target Encoding, Frequency Encoding
✅ Tránh dummy variable trap

4️⃣ Feature Scaling

✅ StandardScaler (mean=0, std=1)
✅ MinMaxScaler ([0, 1] range)
✅ RobustScaler (cho outliers)
✅ QUAN TRỌNG: Fit trên train, transform trên test

5️⃣ Feature Engineering

✅ Polynomial features để học non-linear patterns
✅ Interaction features (A × B)
✅ DateTime extraction (year, month, dayofweek, is_weekend)
✅ Cyclic encoding (sin/cos cho periodic features)
✅ Binning/Discretization

6️⃣ Exploratory Data Analysis

✅ Statistical summary (describe, info)
✅ Correlation analysis (heatmap, pairplot)
✅ Distribution analysis (histogram, boxplot)
✅ Phát hiện patterns và relationships

🔑 Key Takeaways

✅ LUÔN NHỚ:

• Garbage In, Garbage Out
• EDA trước khi preprocessing
• Fit trên train, transform trên test
• Kiểm tra data leakage
• Visualize để hiểu data

⚠️ TRÁNH:

• Scale trước train-test split
• Label encode nominal variables
• Drop outliers vội vàng
• Mean imputation khi có outliers
• Ignore missing data patterns

📊 Preprocessing Workflow Checklist

Preprocessing Workflow

📥1️⃣ Load Data

🔍2️⃣ Initial Exploration

❓3️⃣ Handle Missing Values

📊4️⃣ Handle Outliers

🏷️5️⃣ Encode Categorical

⚙️6️⃣ Feature Engineering

✂️7️⃣ Feature Selection

📂8️⃣ Train-Test Split

📏9️⃣ Feature Scaling

🤖🔟 Build Model

Chi tiết từng bước

Load Data: df = pd.read_csv('data.csv')
Exploration: df.info(), df.describe(), df.isnull().sum()
Missing Values: Visualize → Decide Drop/Fill → Impute (mean/median/mode)
Outliers: Detect (IQR, Z-score) → Visualize (boxplot) → Treat (cap/remove/transform)
Encode: Nominal → One-hot, Ordinal → Label/Ordinal encoding
Feature Engineering: Datetime, interactions, polynomial, binning
Feature Selection: Remove high correlation (>0.9), low variance, domain knowledge
Train-Test Split: train_test_split(...) → 70-80% train, 10-15% val, 10-15% test
Scaling: scaler.fit(X_train) then transform(X_train) and transform(X_test) — No fit on test!
Build Model: model.fit(X_train_scaled, y_train)

💡 Next Steps

Tiếp theo bạn nên học:

Feature Selection - Chọn features quan trọng
Dimensionality Reduction - PCA, t-SNE
Handling Imbalanced Data - SMOTE, undersampling
Pipeline & Automation - sklearn Pipeline
Model Training - Supervised Learning algorithms

Thực hành:

Kaggle competitions: Titanic, House Prices
Real datasets: UCI ML Repository
Personal projects với real-world data

Checkpoint

Bạn đã nắm vững các kỹ thuật xử lý dữ liệu chưa?

Tài liệu tham khảo

Nguồn	Link
Pandas Documentation	pandas.pydata.org
Scikit-learn Preprocessing	scikit-learn.org
Python Data Science Handbook	jakevdp.github.io
Seaborn Tutorial	seaborn.pydata.org

Câu hỏi tự kiểm tra

Có những phương pháp nào để xử lý Missing Values? Khi nào nên dùng mean, median hay mode để impute?
Phân biệt One-Hot Encoding và Label Encoding — khi nào dùng phương pháp nào?
Tại sao phải fit scaler trên train data rồi mới transform trên test data (không fit trên toàn bộ data)?
EDA (Exploratory Data Analysis) giúp ích gì cho quá trình xây dựng model ML?

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Xử lý Dữ liệu với Pandas!

Tiếp theo: Cùng học Linear Regression — thuật toán Supervised Learning đầu tiên!

Task 12