Feature Engineering

1. Feature Engineering là gì?

Feature Engineering là quá trình tạo, chọn lọc và biến đổi features để cải thiện hiệu suất mô hình ML.

"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." — Andrew Ng

Các loại Feature Engineering

Feature Engineering

Feature Creation

Mathematical

Date/Time

Text

Aggregations

Feature Transformation

Scaling

Encoding

Binning

Log/Power

Feature Selection

Filter Methods

Wrapper Methods

Embedded Methods

2. Feature Creation

2.1 Mathematical Features

Python

1import pandas as pd
2import numpy as np
3
4df = pd.DataFrame({
5    'price': [100, 200, 150],
6    'quantity': [10, 5, 8],
7    'discount': [0.1, 0.2, 0.15],
8    'cost': [80, 160, 120]
9})
10
11# Arithmetic operations
12df['revenue'] = df['price'] * df['quantity']
13df['profit'] = df['revenue'] - (df['cost'] * df['quantity'])
14df['profit_margin'] = df['profit'] / df['revenue']
15df['final_price'] = df['price'] * (1 - df['discount'])
16
17# Ratios
18df['price_per_unit'] = df['revenue'] / df['quantity']
19df['cost_ratio'] = df['cost'] / df['price']
20
21# Statistical features
22df['price_zscore'] = (df['price'] - df['price'].mean()) / df['price'].std()
23df['price_percentile'] = df['price'].rank(pct=True)
24
25# Log transform (for skewed data)
26df['price_log'] = np.log1p(df['price'])
27df['revenue_log'] = np.log1p(df['revenue'])

2.2 Date/Time Features

Python

1df = pd.DataFrame({
2    'date': pd.date_range('2024-01-01', periods=100, freq='D')
3})
4
5# Basic date components
6df['year'] = df['date'].dt.year
7df['month'] = df['date'].dt.month
8df['day'] = df['date'].dt.day
9df['dayofweek'] = df['date'].dt.dayofweek  # Monday=0, Sunday=6
10df['dayofyear'] = df['date'].dt.dayofyear
11df['weekofyear'] = df['date'].dt.isocalendar().week
12df['quarter'] = df['date'].dt.quarter
13
14# Boolean features
15df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
16df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
17df['is_month_end'] = df['date'].dt.is_month_end.astype(int)
18df['is_quarter_end'] = df['date'].dt.is_quarter_end.astype(int)
19
20# Cyclical encoding (cho ML models)
21df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
22df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
23df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
24df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)
25
26# Date differences
27df['days_since_start'] = (df['date'] - df['date'].min()).dt.days

2.3 Text Features

Python

1df = pd.DataFrame({
2    'text': [
3        "This is GREAT product!",
4        "Not good at all...",
5        "Amazing quality, love it!!!"
6    ]
7})
8
9# Basic text features
10df['text_length'] = df['text'].str.len()
11df['word_count'] = df['text'].str.split().str.len()
12df['char_count'] = df['text'].str.replace(' ', '').str.len()
13
14# Character features
15df['uppercase_count'] = df['text'].str.count(r'[A-Z]')
16df['lowercase_count'] = df['text'].str.count(r'[a-z]')
17df['digit_count'] = df['text'].str.count(r'\d')
18df['special_count'] = df['text'].str.count(r'[!?.,]')
19df['exclamation_count'] = df['text'].str.count('!')
20
21# Word features
22df['avg_word_length'] = df['char_count'] / df['word_count']
23df['has_exclamation'] = df['text'].str.contains('!').astype(int)
24df['has_question'] = df['text'].str.contains(r'\?').astype(int)
25
26# TF-IDF (for ML)
27from sklearn.feature_extraction.text import TfidfVectorizer
28
29tfidf = TfidfVectorizer(max_features=100)
30tfidf_features = tfidf.fit_transform(df['text'])
31tfidf_df = pd.DataFrame(tfidf_features.toarray(), 
32                        columns=tfidf.get_feature_names_out())

2.4 Aggregation Features

Python

1# Sample data
2df = pd.DataFrame({
3    'customer_id': [1, 1, 1, 2, 2, 3],
4    'amount': [100, 200, 150, 300, 250, 500],
5    'date': pd.date_range('2024-01-01', periods=6)
6})
7
8# Customer-level aggregations
9customer_features = df.groupby('customer_id').agg({
10    'amount': ['count', 'sum', 'mean', 'std', 'min', 'max'],
11    'date': ['min', 'max']
12}).reset_index()
13
14customer_features.columns = ['customer_id', 
15                             'transaction_count', 'total_amount', 'avg_amount',
16                             'std_amount', 'min_amount', 'max_amount',
17                             'first_transaction', 'last_transaction']
18
19# Range and recency
20customer_features['amount_range'] = (customer_features['max_amount'] - 
21                                     customer_features['min_amount'])
22customer_features['days_active'] = (customer_features['last_transaction'] - 
23                                    customer_features['first_transaction']).dt.days
24
25# Merge back
26df = df.merge(customer_features, on='customer_id', how='left')

3. Feature Transformation

3.1 Scaling

Python

1from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
2
3# StandardScaler - Mean=0, Std=1
4scaler = StandardScaler()
5df['amount_standard'] = scaler.fit_transform(df[['amount']])
6
7# MinMaxScaler - Range [0, 1]
8scaler = MinMaxScaler()
9df['amount_minmax'] = scaler.fit_transform(df[['amount']])
10
11# RobustScaler - Robust to outliers
12scaler = RobustScaler()
13df['amount_robust'] = scaler.fit_transform(df[['amount']])
14
15# Manual scaling
16df['amount_normalized'] = (df['amount'] - df['amount'].min()) / (df['amount'].max() - df['amount'].min())

3.2 Encoding Categorical Variables

Python

1from sklearn.preprocessing import LabelEncoder, OneHotEncoder
2
3df = pd.DataFrame({
4    'color': ['red', 'blue', 'green', 'red', 'blue'],
5    'size': ['S', 'M', 'L', 'XL', 'M']
6})
7
8# Label Encoding
9le = LabelEncoder()
10df['color_encoded'] = le.fit_transform(df['color'])
11
12# One-Hot Encoding
13df_onehot = pd.get_dummies(df, columns=['color', 'size'], drop_first=True)
14
15# Ordinal Encoding (với thứ tự)
16from sklearn.preprocessing import OrdinalEncoder
17
18size_order = [['S', 'M', 'L', 'XL']]
19oe = OrdinalEncoder(categories=size_order)
20df['size_ordinal'] = oe.fit_transform(df[['size']])
21
22# Target Encoding (for high cardinality)
23def target_encode(df, col, target):
24    means = df.groupby(col)[target].mean()
25    return df[col].map(means)
26
27# Frequency Encoding
28df['color_freq'] = df['color'].map(df['color'].value_counts())

3.3 Binning

Python

1# Equal-width binning
2df['amount_bins'] = pd.cut(df['amount'], bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])
3
4# Equal-frequency binning (quantiles)
5df['amount_quantile'] = pd.qcut(df['amount'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
6
7# Custom bins
8bins = [0, 100, 200, 500, float('inf')]
9labels = ['Small', 'Medium', 'Large', 'Very Large']
10df['amount_custom'] = pd.cut(df['amount'], bins=bins, labels=labels)

3.4 Power Transformations

Python

1from sklearn.preprocessing import PowerTransformer
2
3# Box-Cox (requires positive values)
4pt = PowerTransformer(method='box-cox')
5df['amount_boxcox'] = pt.fit_transform(df[['amount']] + 1)
6
7# Yeo-Johnson (works with negative values)
8pt = PowerTransformer(method='yeo-johnson')
9df['amount_yeojohnson'] = pt.fit_transform(df[['amount']])
10
11# Log transform
12df['amount_log'] = np.log1p(df['amount'])
13
14# Square root
15df['amount_sqrt'] = np.sqrt(df['amount'])

4. Handling Missing Values

Python

1# Simple imputation
2df['col'].fillna(df['col'].mean())      # Mean
3df['col'].fillna(df['col'].median())    # Median
4df['col'].fillna(df['col'].mode()[0])   # Mode
5
6# Forward/Backward fill
7df['col'].fillna(method='ffill')
8df['col'].fillna(method='bfill')
9
10# Interpolation
11df['col'].interpolate(method='linear')
12
13# KNN Imputation
14from sklearn.impute import KNNImputer
15
16imputer = KNNImputer(n_neighbors=5)
17df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
18
19# Create indicator for missing
20df['col_is_missing'] = df['col'].isnull().astype(int)

5. Feature Selection

5.1 Filter Methods

Python

1# Correlation-based
2correlation = df.corr()['target'].abs().sort_values(ascending=False)
3top_features = correlation[correlation > 0.3].index.tolist()
4
5# Variance threshold
6from sklearn.feature_selection import VarianceThreshold
7
8selector = VarianceThreshold(threshold=0.01)
9X_selected = selector.fit_transform(X)
10
11# Chi-square (for categorical features)
12from sklearn.feature_selection import chi2, SelectKBest
13
14selector = SelectKBest(chi2, k=10)
15X_selected = selector.fit_transform(X, y)
16
17# Mutual Information
18from sklearn.feature_selection import mutual_info_classif
19
20mi_scores = mutual_info_classif(X, y)
21mi_df = pd.DataFrame({'feature': X.columns, 'mi_score': mi_scores})
22mi_df.sort_values('mi_score', ascending=False)

5.2 Wrapper Methods

Python

1# Recursive Feature Elimination
2from sklearn.feature_selection import RFE
3from sklearn.ensemble import RandomForestClassifier
4
5model = RandomForestClassifier(n_estimators=100)
6rfe = RFE(model, n_features_to_select=10)
7rfe.fit(X, y)
8
9selected_features = X.columns[rfe.support_].tolist()
10
11# Sequential Feature Selection
12from sklearn.feature_selection import SequentialFeatureSelector
13
14sfs = SequentialFeatureSelector(model, n_features_to_select=10)
15sfs.fit(X, y)
16selected_features = X.columns[sfs.get_support()].tolist()

5.3 Embedded Methods (Feature Importance)

Python

1# Tree-based importance
2from sklearn.ensemble import RandomForestClassifier
3
4model = RandomForestClassifier(n_estimators=100)
5model.fit(X, y)
6
7importance = pd.DataFrame({
8    'feature': X.columns,
9    'importance': model.feature_importances_
10}).sort_values('importance', ascending=False)
11
12# Plot
13import matplotlib.pyplot as plt
14
15plt.figure(figsize=(10, 8))
16plt.barh(importance['feature'][:20], importance['importance'][:20])
17plt.xlabel('Importance')
18plt.title('Feature Importance')
19plt.gca().invert_yaxis()
20plt.show()
21
22# L1-based selection (Lasso)
23from sklearn.linear_model import LassoCV
24from sklearn.feature_selection import SelectFromModel
25
26lasso = LassoCV(cv=5)
27sfm = SelectFromModel(lasso)
28sfm.fit(X, y)
29selected_features = X.columns[sfm.get_support()].tolist()

6. Feature Engineering Pipeline

Python

1from sklearn.pipeline import Pipeline
2from sklearn.compose import ColumnTransformer
3from sklearn.preprocessing import StandardScaler, OneHotEncoder
4from sklearn.impute import SimpleImputer
5
6# Define column groups
7numeric_features = ['age', 'income', 'balance']
8categorical_features = ['gender', 'education', 'occupation']
9
10# Numeric pipeline
11numeric_transformer = Pipeline(steps=[
12    ('imputer', SimpleImputer(strategy='median')),
13    ('scaler', StandardScaler())
14])
15
16# Categorical pipeline
17categorical_transformer = Pipeline(steps=[
18    ('imputer', SimpleImputer(strategy='most_frequent')),
19    ('onehot', OneHotEncoder(handle_unknown='ignore'))
20])
21
22# Combine
23preprocessor = ColumnTransformer(
24    transformers=[
25        ('num', numeric_transformer, numeric_features),
26        ('cat', categorical_transformer, categorical_features)
27    ])
28
29# Full pipeline with model
30from sklearn.ensemble import RandomForestClassifier
31
32full_pipeline = Pipeline(steps=[
33    ('preprocessor', preprocessor),
34    ('classifier', RandomForestClassifier())
35])
36
37# Fit and predict
38full_pipeline.fit(X_train, y_train)
39predictions = full_pipeline.predict(X_test)

7. Best Practices

7.1 Feature Engineering Checklist

Python

1"""
2Feature Engineering Checklist:
3
41. Data Understanding
5   □ Understand business context
6   □ Identify target variable
7   □ Identify feature types (numeric, categorical, text, date)
8
92. Feature Creation
10   □ Mathematical combinations (ratios, differences)
11   □ Date/time features
12   □ Text features (if applicable)
13   □ Aggregation features
14
153. Feature Transformation
16   □ Handle missing values
17   □ Scale numeric features
18   □ Encode categorical features
19   □ Handle outliers
20
214. Feature Selection
22   □ Remove low-variance features
23   □ Remove highly correlated features
24   □ Use feature importance
25   □ Cross-validate selected features
26
275. Validation
28   □ No data leakage
29   □ Consistent with test data
30   □ Pipeline for reproducibility
31"""

7.2 Avoid Data Leakage

Python

1# WRONG - Data leakage!
2df['amount_scaled'] = (df['amount'] - df['amount'].mean()) / df['amount'].std()
3X_train, X_test = train_test_split(df)
4
5# CORRECT - Scale after split
6X_train, X_test = train_test_split(df)
7scaler = StandardScaler()
8X_train['amount_scaled'] = scaler.fit_transform(X_train[['amount']])
9X_test['amount_scaled'] = scaler.transform(X_test[['amount']])  # Only transform!

Tổng Kết

Trong bài này, bạn đã học:

✅ Feature Creation: Mathematical, Date, Text, Aggregation
✅ Feature Transformation: Scaling, Encoding, Binning
✅ Handling Missing Values
✅ Feature Selection: Filter, Wrapper, Embedded
✅ Building Feature Engineering Pipeline
✅ Best practices và avoiding data leakage

Bài tiếp theo: Tổng kết và Project thực hành!