Feature Engineering — Tạo Features cho ML

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Tạo features mới từ dữ liệu có sẵn (mathematical, datetime, text, aggregation)

✅ Encode biến categorical (Label, One-Hot, Ordinal, Target Encoding)

✅ Scale biến numeric (Standard, MinMax, Robust)

✅ Biến đổi phân phối (Log, Box-Cox, Binning)

✅ Chọn features quan trọng (Filter, Wrapper, Embedded)

✅ Xây dựng Feature Engineering Pipeline với Sklearn

Thời gian: 3 giờ | Độ khó: Intermediate → Advanced | Yêu cầu: Data Cleaning (Bài 10)

Task 0

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Feature	Đặc trưng	Biến đầu vào cho mô hình ML
Feature Creation	Tạo đặc trưng	Sinh features mới từ dữ liệu gốc
Encoding	Mã hóa	Chuyển categorical → numeric
Scaling	Chuẩn hóa	Đưa numeric về cùng khoảng giá trị
One-Hot Encoding	Mã hóa nhị phân	Tạo cột 0/1 cho mỗi category
Target Encoding	Mã hóa theo target	Thay category bằng mean(target)
Binning	Phân nhóm	Biến continuous → categorical
Feature Selection	Chọn đặc trưng	Loại bỏ features không quan trọng
Data Leakage	Rò rỉ dữ liệu	Dùng thông tin test khi train → kết quả sai
Pipeline	Đường ống	Chuỗi xử lý tự động, tránh data leakage

Checkpoint

"Feature Engineering is the art of turning data into information." — Andrew Ng coi đây là kỹ năng quan trọng nhất của Data Scientist!

Task 1

🔢 Feature Creation — Tạo Features Mới

TB5 min

Feature Engineering là gì? Là nghệ thuật tạo biến mới từ dữ liệu có sẵn để giúp model ML hoạt động tốt hơn. Ví dụ: từ cột price và quantity, tạo cột revenue = price × quantity. Feature tốt thường quan trọng hơn model phức tạp!

Andrew Ng: "Applied ML is basically feature engineering." — Feature Engineering là kỹ năng số 1 của Data Scientist.

Mathematical Features

Python

1import pandas as pd
2import numpy as np
3
4df = pd.DataFrame({
5    'price': [100, 200, 150],
6    'quantity': [10, 5, 8],
7    'discount': [0.1, 0.2, 0.15],
8    'cost': [80, 160, 120]
9})
10
11# Arithmetic
12df['revenue'] = df['price'] * df['quantity']
13df['profit'] = df['revenue'] - (df['cost'] * df['quantity'])
14df['profit_margin'] = df['profit'] / df['revenue']
15df['final_price'] = df['price'] * (1 - df['discount'])
16
17# Ratios
18df['cost_ratio'] = df['cost'] / df['price']
19
20# Statistical
21df['price_zscore'] = (df['price'] - df['price'].mean()) / df['price'].std()
22df['price_percentile'] = df['price'].rank(pct=True)
23
24# Log transform (for skewed data)
25df['price_log'] = np.log1p(df['price'])

Date/Time Features

Python

1df = pd.DataFrame({
2    'date': pd.date_range('2024-01-01', periods=100, freq='D')
3})
4
5# Basic components
6df['year'] = df['date'].dt.year
7df['month'] = df['date'].dt.month
8df['day'] = df['date'].dt.day
9df['dayofweek'] = df['date'].dt.dayofweek    # Mon=0, Sun=6
10df['quarter'] = df['date'].dt.quarter
11df['weekofyear'] = df['date'].dt.isocalendar().week
12
13# Boolean
14df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
15df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
16df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
17
18# Cyclical encoding (quan trọng cho ML!)
19df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
20df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
21df['dow_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
22df['dow_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)
23
24# Date differences
25df['days_since_start'] = (df['date'] - df['date'].min()).dt.days

Text Features

Python

1df = pd.DataFrame({
2    'text': ["This is GREAT product!",
3             "Not good at all...",
4             "Amazing quality, love it!!!"]
5})
6
7df['text_length'] = df['text'].str.len()
8df['word_count'] = df['text'].str.split().str.len()
9df['uppercase_count'] = df['text'].str.count(r'[A-Z]')
10df['exclamation_count'] = df['text'].str.count('!')
11df['avg_word_length'] = df['text'].str.replace(' ', '').str.len() / df['word_count']

Aggregation Features

Python

1orders = pd.DataFrame({
2    'customer_id': [1, 1, 1, 2, 2, 3],
3    'amount': [100, 200, 150, 300, 250, 500],
4    'date': pd.date_range('2024-01-01', periods=6)
5})
6
7# Customer-level aggregations
8customer_feats = orders.groupby('customer_id').agg(
9    txn_count = ('amount', 'count'),
10    total_amount = ('amount', 'sum'),
11    avg_amount = ('amount', 'mean'),
12    std_amount = ('amount', 'std'),
13    max_amount = ('amount', 'max'),
14).reset_index()
15
16# Merge back
17orders = orders.merge(customer_feats, on='customer_id', how='left')

Aggregation features rất powerful trong Kaggle competitions. Pattern: Group by entity → Tính count, sum, mean, std, min, max → Merge lại.

Checkpoint

Tạo ≥5 features mới cho dataset bán hàng: revenue, profit_margin, is_weekend, month_sin, customer_total_orders.

Task 2

🏷️ Encoding — Mã Hóa Categorical

TB5 min

Label Encoding (Ordinal)

Python

1from sklearn.preprocessing import LabelEncoder
2
3le = LabelEncoder()
4df['color_encoded'] = le.fit_transform(df['color'])
5# red→2, blue→0, green→1 (alphabetical)
6
7# Manual ordinal mapping
8size_map = {'S': 1, 'M': 2, 'L': 3, 'XL': 4}
9df['size_num'] = df['size'].map(size_map)

One-Hot Encoding

Python

1# Pandas (đơn giản nhất)
2df_encoded = pd.get_dummies(df, columns=['city', 'color'], drop_first=True)
3
4# Sklearn
5from sklearn.preprocessing import OneHotEncoder
6ohe = OneHotEncoder(sparse_output=False, drop='first')
7encoded = ohe.fit_transform(df[['city']])

Ordinal Encoding (có thứ tự)

Python

1from sklearn.preprocessing import OrdinalEncoder
2
3education_order = [['High School', 'Bachelor', 'Master', 'PhD']]
4oe = OrdinalEncoder(categories=education_order)
5df['edu_encoded'] = oe.fit_transform(df[['education']])
6# High School→0, Bachelor→1, Master→2, PhD→3

Target Encoding (high cardinality)

Python

1# Thay category bằng mean(target) của group đó
2def target_encode(df, col, target):
3    means = df.groupby(col)[target].mean()
4    return df[col].map(means)
5
6df['city_target'] = target_encode(df, 'city', 'revenue')
7
8# Frequency Encoding
9df['city_freq'] = df['city'].map(df['city'].value_counts())

Chọn encoding đúng:

Label/Ordinal: Biến có thứ tự (size: S < M < L, education: HS < BS < MS)
One-Hot: Biến KHÔNG có thứ tự, ÍT categories (≤ 10) — city, color
Target/Frequency: Biến KHÔNG có thứ tự, NHIỀU categories (> 10) — zip code, user_id
One-Hot với 100 categories = 100 cột → Curse of dimensionality!

Checkpoint

Cho cột 'education' = ['PhD', 'Bachelor', 'Master', 'High School']. Nên dùng encoding nào? Tại sao?

Task 3

⚖️ Scaling — Chuẩn Hóa Numeric

TB5 min

StandardScaler (Z-score)

Python

1from sklearn.preprocessing import StandardScaler
2
3scaler = StandardScaler()
4df[['age_scaled', 'salary_scaled']] = scaler.fit_transform(df[['age', 'salary']])
5# Mean = 0, Std = 1

MinMaxScaler

Python

1from sklearn.preprocessing import MinMaxScaler
2
3scaler = MinMaxScaler(feature_range=(0, 1))
4df[['age_scaled']] = scaler.fit_transform(df[['age']])
5# Range [0, 1]

RobustScaler (cho data có outliers)

Python

1from sklearn.preprocessing import RobustScaler
2
3scaler = RobustScaler()
4df[['salary_scaled']] = scaler.fit_transform(df[['salary']])
5# Dùng median và IQR thay vì mean/std → robust với outliers

Khi nào dùng scaler nào?

Scaler	Khi nào dùng
StandardScaler	Data gần normal, không có outliers — phổ biến nhất
MinMaxScaler	Cần range cố định [0,1] — Neural Networks, KNN
RobustScaler	Data có outliers — dùng median thay mean

Task 4

📊 Binning và Power Transforms

TB5 min

Binning — Continuous → Categorical

Python

1# Equal-width
2df['age_bins'] = pd.cut(df['age'], bins=5, 
3    labels=['Very Young', 'Young', 'Middle', 'Senior', 'Elderly'])
4
5# Equal-frequency (quantiles)
6df['income_q'] = pd.qcut(df['income'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
7
8# Custom bins
9bins = [0, 18, 35, 55, 100]
10labels = ['Teen', 'Young Adult', 'Middle Aged', 'Senior']
11df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

Power Transforms — Giảm Skewness

Python

1# Log transform (phổ biến nhất)
2df['income_log'] = np.log1p(df['income'])
3
4# Square root
5df['count_sqrt'] = np.sqrt(df['count'])
6
7# Box-Cox (requires positive values)
8from sklearn.preprocessing import PowerTransformer
9pt = PowerTransformer(method='box-cox')
10df['income_boxcox'] = pt.fit_transform(df[['income']] + 1)
11
12# Yeo-Johnson (works with all values)
13pt = PowerTransformer(method='yeo-johnson')
14df['amount_yj'] = pt.fit_transform(df[['amount']])

Checkpoint

Khi feature có skewness > 1 (right-skewed), bạn nên dùng transform nào? Log transform hay StandardScaler?

Task 5

🎯 Feature Selection — Chọn Features Quan Trọng

TB5 min

Filter Methods

Python

1# 1. Correlation với target
2corr = df.corr()['target'].abs().sort_values(ascending=False)
3top_features = corr[corr > 0.3].index.tolist()
4
5# 2. Variance Threshold
6from sklearn.feature_selection import VarianceThreshold
7selector = VarianceThreshold(threshold=0.01)
8X_selected = selector.fit_transform(X)
9
10# 3. Mutual Information
11from sklearn.feature_selection import mutual_info_classif
12mi_scores = mutual_info_classif(X, y)
13mi_df = pd.DataFrame({'feature': X.columns, 'score': mi_scores})
14print(mi_df.sort_values('score', ascending=False))

Embedded Methods (Feature Importance)

Python

1from sklearn.ensemble import RandomForestClassifier
2import matplotlib.pyplot as plt
3
4model = RandomForestClassifier(n_estimators=100, random_state=42)
5model.fit(X, y)
6
7importance = pd.DataFrame({
8    'feature': X.columns,
9    'importance': model.feature_importances_
10}).sort_values('importance', ascending=False)
11
12# Plot top 15
13plt.figure(figsize=(10, 6))
14plt.barh(importance['feature'][:15], importance['importance'][:15])
15plt.xlabel('Importance')
16plt.title('Feature Importance (Random Forest)')
17plt.gca().invert_yaxis()
18plt.tight_layout()
19plt.show()

Remove Correlated Features

Python

1# Ma trận tương quan
2corr_matrix = X.corr().abs()
3
4# Upper triangle
5upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
6
7# Columns with correlation > 0.95
8to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
9X = X.drop(columns=to_drop)
10print(f"Dropped {len(to_drop)} highly correlated features")

Feature Selection giúp:

Giảm overfitting
Tăng tốc training
Dễ interpret model
Quy tắc: Bắt đầu với embedded methods (Random Forest importance) → đơn giản và hiệu quả nhất

Checkpoint

Tại sao nên remove features có correlation > 0.95? Multicollinearity ảnh hưởng gì đến model?

Task 6

🔧 Sklearn Pipeline — Tránh Data Leakage

TB5 min

Vấn đề Data Leakage

Python

1# ❌ SAI — Fit scaler trên TOÀN BỘ data trước khi split
2scaler = StandardScaler()
3df['scaled'] = scaler.fit_transform(df[['amount']])  # Dùng cả test data!
4X_train, X_test = train_test_split(df)
5
6# ✅ ĐÚNG — Fit trên train, transform trên test
7from sklearn.model_selection import train_test_split
8X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
9
10scaler = StandardScaler()
11X_train_scaled = scaler.fit_transform(X_train)     # fit + transform
12X_test_scaled = scaler.transform(X_test)            # CHỈ transform!

Complete Pipeline

Python

1from sklearn.pipeline import Pipeline
2from sklearn.compose import ColumnTransformer
3from sklearn.preprocessing import StandardScaler, OneHotEncoder
4from sklearn.impute import SimpleImputer
5from sklearn.ensemble import RandomForestClassifier
6
7# Define column groups
8numeric_features = ['age', 'income', 'balance']
9categorical_features = ['gender', 'education', 'occupation']
10
11# Numeric pipeline
12numeric_pipe = Pipeline([
13    ('imputer', SimpleImputer(strategy='median')),
14    ('scaler', StandardScaler())
15])
16
17# Categorical pipeline
18categorical_pipe = Pipeline([
19    ('imputer', SimpleImputer(strategy='most_frequent')),
20    ('onehot', OneHotEncoder(handle_unknown='ignore'))
21])
22
23# Combine
24preprocessor = ColumnTransformer([
25    ('num', numeric_pipe, numeric_features),
26    ('cat', categorical_pipe, categorical_features)
27])
28
29# Full pipeline with model
30full_pipeline = Pipeline([
31    ('preprocessor', preprocessor),
32    ('classifier', RandomForestClassifier(random_state=42))
33])
34
35# Fit and predict — NO DATA LEAKAGE!
36full_pipeline.fit(X_train, y_train)
37predictions = full_pipeline.predict(X_test)
38score = full_pipeline.score(X_test, y_test)
39print(f"Accuracy: {score:.3f}")

Data Leakage là lỗi #1 trong ML projects. Pipeline giải quyết vấn đề này bằng cách:

fit_transform() chỉ chạy trên train data
transform() (không fit lại) chạy trên test data
Luôn dùng Pipeline thay vì transform thủ công!

Checkpoint

Bạn đã hiểu tại sao phải fit scaler trên train VÀ CHỈ transform trên test chưa? Đây là concept cực kỳ quan trọng!

Task 7

📝 Tổng Kết

TB5 min

Feature Engineering Checklist

Phase	Tasks
1. Explore	Understand data types, distributions, missing patterns
2. Create	Math features, datetime, text, aggregations
3. Transform	Encode categoricals, scale numerics, handle skewness
4. Select	Remove low-variance, correlated, unimportant features
5. Validate	Use Pipeline, check for data leakage, cross-validate

Quick Reference

Python

1# Encode
2pd.get_dummies(df, columns=['cat_col'], drop_first=True)
3
4# Scale
5from sklearn.preprocessing import StandardScaler
6scaler = StandardScaler()
7X_train_scaled = scaler.fit_transform(X_train)
8X_test_scaled = scaler.transform(X_test)
9
10# Select
11model.feature_importances_  # Tree-based importance
12
13# Pipeline
14from sklearn.pipeline import Pipeline
15pipe = Pipeline([
16    ('preprocessor', preprocessor),
17    ('model', RandomForestClassifier())
18])
19pipe.fit(X_train, y_train)

Bài tiếp theo: Data Crawling & Building Web với Streamlit — thu thập dữ liệu và xây dựng ứng dụng! 🌐

Câu hỏi tự kiểm tra

One-Hot Encoding và Ordinal Encoding khác nhau như thế nào? Khi nào dùng mỗi loại?
StandardScaler biến đổi dữ liệu như thế nào? Tại sao phải fit trên train và chỉ transform trên test?
Data Leakage là gì? Tại sao sklearn Pipeline giúp tránh được vấn đề này?
Feature importance từ Random Forest thuộc phương pháp feature selection nào (Filter, Wrapper, hay Embedded)?

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Feature Engineering!

Tiếp theo: Data Crawling & Streamlit — học cách thu thập dữ liệu từ web và xây dựng ứng dụng dashboard!

Task 8