Feature Engineering
1. Feature Engineering là gì?
Feature Engineering là quá trình tạo, chọn lọc và biến đổi features để cải thiện hiệu suất mô hình ML.
"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." — Andrew Ng
Các loại Feature Engineering
Feature Engineering
Feature Creation
Mathematical
Date/Time
Text
Aggregations
Feature Transformation
Scaling
Encoding
Binning
Log/Power
Feature Selection
Filter Methods
Wrapper Methods
Embedded Methods
2. Feature Creation
2.1 Mathematical Features
Python
1import pandas as pd2import numpy as np34df = pd.DataFrame({5 'price': [100, 200, 150],6 'quantity': [10, 5, 8],7 'discount': [0.1, 0.2, 0.15],8 'cost': [80, 160, 120]9})1011# Arithmetic operations12df['revenue'] = df['price'] * df['quantity']13df['profit'] = df['revenue'] - (df['cost'] * df['quantity'])14df['profit_margin'] = df['profit'] / df['revenue']15df['final_price'] = df['price'] * (1 - df['discount'])1617# Ratios18df['price_per_unit'] = df['revenue'] / df['quantity']19df['cost_ratio'] = df['cost'] / df['price']2021# Statistical features22df['price_zscore'] = (df['price'] - df['price'].mean()) / df['price'].std()23df['price_percentile'] = df['price'].rank(pct=True)2425# Log transform (for skewed data)26df['price_log'] = np.log1p(df['price'])27df['revenue_log'] = np.log1p(df['revenue'])2.2 Date/Time Features
Python
1df = pd.DataFrame({2 'date': pd.date_range('2024-01-01', periods=100, freq='D')3})45# Basic date components6df['year'] = df['date'].dt.year7df['month'] = df['date'].dt.month8df['day'] = df['date'].dt.day9df['dayofweek'] = df['date'].dt.dayofweek # Monday=0, Sunday=610df['dayofyear'] = df['date'].dt.dayofyear11df['weekofyear'] = df['date'].dt.isocalendar().week12df['quarter'] = df['date'].dt.quarter1314# Boolean features15df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)16df['is_month_start'] = df['date'].dt.is_month_start.astype(int)17df['is_month_end'] = df['date'].dt.is_month_end.astype(int)18df['is_quarter_end'] = df['date'].dt.is_quarter_end.astype(int)1920# Cyclical encoding (cho ML models)21df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)22df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)23df['dayofweek_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)24df['dayofweek_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)2526# Date differences27df['days_since_start'] = (df['date'] - df['date'].min()).dt.days2.3 Text Features
Python
1df = pd.DataFrame({2 'text': [3 "This is GREAT product!",4 "Not good at all...",5 "Amazing quality, love it!!!"6 ]7})89# Basic text features10df['text_length'] = df['text'].str.len()11df['word_count'] = df['text'].str.split().str.len()12df['char_count'] = df['text'].str.replace(' ', '').str.len()1314# Character features15df['uppercase_count'] = df['text'].str.count(r'[A-Z]')16df['lowercase_count'] = df['text'].str.count(r'[a-z]')17df['digit_count'] = df['text'].str.count(r'\d')18df['special_count'] = df['text'].str.count(r'[!?.,]')19df['exclamation_count'] = df['text'].str.count('!')2021# Word features22df['avg_word_length'] = df['char_count'] / df['word_count']23df['has_exclamation'] = df['text'].str.contains('!').astype(int)24df['has_question'] = df['text'].str.contains(r'\?').astype(int)2526# TF-IDF (for ML)27from sklearn.feature_extraction.text import TfidfVectorizer2829tfidf = TfidfVectorizer(max_features=100)30tfidf_features = tfidf.fit_transform(df['text'])31tfidf_df = pd.DataFrame(tfidf_features.toarray(), 32 columns=tfidf.get_feature_names_out())2.4 Aggregation Features
Python
1# Sample data2df = pd.DataFrame({3 'customer_id': [1, 1, 1, 2, 2, 3],4 'amount': [100, 200, 150, 300, 250, 500],5 'date': pd.date_range('2024-01-01', periods=6)6})78# Customer-level aggregations9customer_features = df.groupby('customer_id').agg({10 'amount': ['count', 'sum', 'mean', 'std', 'min', 'max'],11 'date': ['min', 'max']12}).reset_index()1314customer_features.columns = ['customer_id', 15 'transaction_count', 'total_amount', 'avg_amount',16 'std_amount', 'min_amount', 'max_amount',17 'first_transaction', 'last_transaction']1819# Range and recency20customer_features['amount_range'] = (customer_features['max_amount'] - 21 customer_features['min_amount'])22customer_features['days_active'] = (customer_features['last_transaction'] - 23 customer_features['first_transaction']).dt.days2425# Merge back26df = df.merge(customer_features, on='customer_id', how='left')3. Feature Transformation
3.1 Scaling
Python
1from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler23# StandardScaler - Mean=0, Std=14scaler = StandardScaler()5df['amount_standard'] = scaler.fit_transform(df[['amount']])67# MinMaxScaler - Range [0, 1]8scaler = MinMaxScaler()9df['amount_minmax'] = scaler.fit_transform(df[['amount']])1011# RobustScaler - Robust to outliers12scaler = RobustScaler()13df['amount_robust'] = scaler.fit_transform(df[['amount']])1415# Manual scaling16df['amount_normalized'] = (df['amount'] - df['amount'].min()) / (df['amount'].max() - df['amount'].min())3.2 Encoding Categorical Variables
Python
1from sklearn.preprocessing import LabelEncoder, OneHotEncoder23df = pd.DataFrame({4 'color': ['red', 'blue', 'green', 'red', 'blue'],5 'size': ['S', 'M', 'L', 'XL', 'M']6})78# Label Encoding9le = LabelEncoder()10df['color_encoded'] = le.fit_transform(df['color'])1112# One-Hot Encoding13df_onehot = pd.get_dummies(df, columns=['color', 'size'], drop_first=True)1415# Ordinal Encoding (với thứ tự)16from sklearn.preprocessing import OrdinalEncoder1718size_order = [['S', 'M', 'L', 'XL']]19oe = OrdinalEncoder(categories=size_order)20df['size_ordinal'] = oe.fit_transform(df[['size']])2122# Target Encoding (for high cardinality)23def target_encode(df, col, target):24 means = df.groupby(col)[target].mean()25 return df[col].map(means)2627# Frequency Encoding28df['color_freq'] = df['color'].map(df['color'].value_counts())3.3 Binning
Python
1# Equal-width binning2df['amount_bins'] = pd.cut(df['amount'], bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])34# Equal-frequency binning (quantiles)5df['amount_quantile'] = pd.qcut(df['amount'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])67# Custom bins8bins = [0, 100, 200, 500, float('inf')]9labels = ['Small', 'Medium', 'Large', 'Very Large']10df['amount_custom'] = pd.cut(df['amount'], bins=bins, labels=labels)3.4 Power Transformations
Python
1from sklearn.preprocessing import PowerTransformer23# Box-Cox (requires positive values)4pt = PowerTransformer(method='box-cox')5df['amount_boxcox'] = pt.fit_transform(df[['amount']] + 1)67# Yeo-Johnson (works with negative values)8pt = PowerTransformer(method='yeo-johnson')9df['amount_yeojohnson'] = pt.fit_transform(df[['amount']])1011# Log transform12df['amount_log'] = np.log1p(df['amount'])1314# Square root15df['amount_sqrt'] = np.sqrt(df['amount'])4. Handling Missing Values
Python
1# Simple imputation2df['col'].fillna(df['col'].mean()) # Mean3df['col'].fillna(df['col'].median()) # Median4df['col'].fillna(df['col'].mode()[0]) # Mode56# Forward/Backward fill7df['col'].fillna(method='ffill')8df['col'].fillna(method='bfill')910# Interpolation11df['col'].interpolate(method='linear')1213# KNN Imputation14from sklearn.impute import KNNImputer1516imputer = KNNImputer(n_neighbors=5)17df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)1819# Create indicator for missing20df['col_is_missing'] = df['col'].isnull().astype(int)5. Feature Selection
5.1 Filter Methods
Python
1# Correlation-based2correlation = df.corr()['target'].abs().sort_values(ascending=False)3top_features = correlation[correlation > 0.3].index.tolist()45# Variance threshold6from sklearn.feature_selection import VarianceThreshold78selector = VarianceThreshold(threshold=0.01)9X_selected = selector.fit_transform(X)1011# Chi-square (for categorical features)12from sklearn.feature_selection import chi2, SelectKBest1314selector = SelectKBest(chi2, k=10)15X_selected = selector.fit_transform(X, y)1617# Mutual Information18from sklearn.feature_selection import mutual_info_classif1920mi_scores = mutual_info_classif(X, y)21mi_df = pd.DataFrame({'feature': X.columns, 'mi_score': mi_scores})22mi_df.sort_values('mi_score', ascending=False)5.2 Wrapper Methods
Python
1# Recursive Feature Elimination2from sklearn.feature_selection import RFE3from sklearn.ensemble import RandomForestClassifier45model = RandomForestClassifier(n_estimators=100)6rfe = RFE(model, n_features_to_select=10)7rfe.fit(X, y)89selected_features = X.columns[rfe.support_].tolist()1011# Sequential Feature Selection12from sklearn.feature_selection import SequentialFeatureSelector1314sfs = SequentialFeatureSelector(model, n_features_to_select=10)15sfs.fit(X, y)16selected_features = X.columns[sfs.get_support()].tolist()5.3 Embedded Methods (Feature Importance)
Python
1# Tree-based importance2from sklearn.ensemble import RandomForestClassifier34model = RandomForestClassifier(n_estimators=100)5model.fit(X, y)67importance = pd.DataFrame({8 'feature': X.columns,9 'importance': model.feature_importances_10}).sort_values('importance', ascending=False)1112# Plot13import matplotlib.pyplot as plt1415plt.figure(figsize=(10, 8))16plt.barh(importance['feature'][:20], importance['importance'][:20])17plt.xlabel('Importance')18plt.title('Feature Importance')19plt.gca().invert_yaxis()20plt.show()2122# L1-based selection (Lasso)23from sklearn.linear_model import LassoCV24from sklearn.feature_selection import SelectFromModel2526lasso = LassoCV(cv=5)27sfm = SelectFromModel(lasso)28sfm.fit(X, y)29selected_features = X.columns[sfm.get_support()].tolist()6. Feature Engineering Pipeline
Python
1from sklearn.pipeline import Pipeline2from sklearn.compose import ColumnTransformer3from sklearn.preprocessing import StandardScaler, OneHotEncoder4from sklearn.impute import SimpleImputer56# Define column groups7numeric_features = ['age', 'income', 'balance']8categorical_features = ['gender', 'education', 'occupation']910# Numeric pipeline11numeric_transformer = Pipeline(steps=[12 ('imputer', SimpleImputer(strategy='median')),13 ('scaler', StandardScaler())14])1516# Categorical pipeline17categorical_transformer = Pipeline(steps=[18 ('imputer', SimpleImputer(strategy='most_frequent')),19 ('onehot', OneHotEncoder(handle_unknown='ignore'))20])2122# Combine23preprocessor = ColumnTransformer(24 transformers=[25 ('num', numeric_transformer, numeric_features),26 ('cat', categorical_transformer, categorical_features)27 ])2829# Full pipeline with model30from sklearn.ensemble import RandomForestClassifier3132full_pipeline = Pipeline(steps=[33 ('preprocessor', preprocessor),34 ('classifier', RandomForestClassifier())35])3637# Fit and predict38full_pipeline.fit(X_train, y_train)39predictions = full_pipeline.predict(X_test)7. Best Practices
7.1 Feature Engineering Checklist
Python
1"""2Feature Engineering Checklist:341. Data Understanding5 □ Understand business context6 □ Identify target variable7 □ Identify feature types (numeric, categorical, text, date)892. Feature Creation10 □ Mathematical combinations (ratios, differences)11 □ Date/time features12 □ Text features (if applicable)13 □ Aggregation features14153. Feature Transformation16 □ Handle missing values17 □ Scale numeric features18 □ Encode categorical features19 □ Handle outliers20214. Feature Selection22 □ Remove low-variance features23 □ Remove highly correlated features24 □ Use feature importance25 □ Cross-validate selected features26275. Validation28 □ No data leakage29 □ Consistent with test data30 □ Pipeline for reproducibility31"""7.2 Avoid Data Leakage
Python
1# WRONG - Data leakage!2df['amount_scaled'] = (df['amount'] - df['amount'].mean()) / df['amount'].std()3X_train, X_test = train_test_split(df)45# CORRECT - Scale after split6X_train, X_test = train_test_split(df)7scaler = StandardScaler()8X_train['amount_scaled'] = scaler.fit_transform(X_train[['amount']])9X_test['amount_scaled'] = scaler.transform(X_test[['amount']]) # Only transform!Tổng Kết
Trong bài này, bạn đã học:
- ✅ Feature Creation: Mathematical, Date, Text, Aggregation
- ✅ Feature Transformation: Scaling, Encoding, Binning
- ✅ Handling Missing Values
- ✅ Feature Selection: Filter, Wrapper, Embedded
- ✅ Building Feature Engineering Pipeline
- ✅ Best practices và avoiding data leakage
Bài tiếp theo: Tổng kết và Project thực hành!
