Lý thuyết
4-5 gio
Bài 5/15

Xử lý Dữ liệu với Pandas

Data Cleaning, Missing Values, va Feature Engineering

Xử lý Dữ liệu với Pandas

Mục tiêu bài học

Sau bài học này, học viên sẽ:

  • Thao tác dữ liệu với Pandas DataFrame
  • Xử lý Missing Values va Outliers
  • Thực hiện Feature Engineering co ban
  • Encode categorical variables

1. Pandas DataFrame co ban

1.1 Tạo và đọc DataFrame

Python
1import pandas as pd
2import numpy as np
3
4# Tao DataFrame tu dictionary
5data = {
6 'Name': ['An', 'Binh', 'Chi', 'Đúng'],
7 'Age': [25, 30, 35, np.nan],
8 'Salary': [50000, 60000, 75000, 80000],
9 'Department': ['IT', 'HR', 'IT', 'Finance']
10}
11df = pd.DataFrame(data)
12print(df)
13
14# Doc tu CSV
15# df = pd.read_csv('data.csv')
16
17# Doc tu Excel
18# df = pd.read_excel('data.xlsx')

1.2 Các thao tác cơ bản

Python
1# Xem thong tin
2print(df.info())
3print(df.describe())
4print(df.shape)
5print(df.columns)
6print(df.dtypes)
7
8# Truy cap du lieu
9print(df['Age']) # Mot cot
10print(df[['Name', 'Age']]) # Nhieu cot
11print(df.iloc[0]) # Hang dau tien
12print(df.loc[0, 'Name']) # Cell cu the
13
14# Loc du lieu
15print(df[df['Age'] > 25])
16print(df[(df['Age'] > 25) & (df['Department'] == 'IT')])

2. Xử lý Missing Values

2.1 Phát hiện Missing Values

Python
1# Kiểm tra missing
2print(df.isnull().sum())
3print(df.isnull().sum() / len(df) * 100) # Ty le %
4
5# Visualize missing
6import matplotlib.pyplot as plt
7import seaborn as sns
8
9plt.figure(figsize=(10, 6))
10sns.heatmap(df.isnull(), cbar=True, yticklabels=False)
11plt.title('Missing Values Heatmap')
12plt.show()

2.2 Các chiến lược xử lý Missing

Chien luocKhi nào dùngCode
Xoa hangMissing it, data nhieudf.dropna()
Xoa cotCot missing qua nhieu (> 50%)df.drop(columns=['col'])
Dien MeanNumerical, phan phoi chuandf.fillna(df.mean())
Dien MedianNumerical, co outliersdf.fillna(df.median())
Dien ModeCategoricaldf.fillna(df.mode()[0])
Forward FillTime seriesdf.fillna(method='ffill')

2.3 Thuc hanh

Python
1# Xử lý missing
2df_clean = df.copy()
3
4# Dien Age bang median
5df_clean['Age'].fillna(df_clean['Age'].median(), inplace=True)
6
7# Hoac dung SimpleImputer
8from sklearn.impute import SimpleImputer
9
10imputer = SimpleImputer(strategy='median')
11df_clean['Age'] = imputer.fit_transform(df_clean[['Age']])
12
13print(df_clean)

3. Xử lý Outliers

3.1 Phát hiện Outliers

Phương pháp IQR (Interquartile Range):

IQR=Q3Q1IQR = Q3 - Q1 Lower bound=Q11.5\tıˋmesIQR\text{Lower bound} = Q1 - 1.5 \tìmes IQR Upper bound=Q3+1.5\tıˋmesIQR\text{Upper bound} = Q3 + 1.5 \tìmes IQR

Boxplot Outliers

Hinh: Boxplot va cach xac dinh Outliers

3.2 Thực hành phát hiện

Python
1import numpy as np
2
3def detect_outliers_iqr(data, column):
4 Q1 = data[column].quantile(0.25)
5 Q3 = data[column].quantile(0.75)
6 IQR = Q3 - Q1
7
8 lower = Q1 - 1.5 * IQR
9 upper = Q3 + 1.5 * IQR
10
11 outliers = data[(data[column] < lower) | (data[column] > upper)]
12 return outliers, lower, upper
13
14# Vi du
15data = pd.DataFrame({
16 'value': [10, 12, 14, 15, 16, 18, 100] # 100 la outlier
17})
18
19outliers, lower, upper = detect_outliers_iqr(data, 'value')
20print(f"Outliers:\n{outliers}")
21print(f"Bounds: [{lower:.2f}, {upper:.2f}]")
22
23# Visualize
24import matplotlib.pyplot as plt
25plt.boxplot(data['value'])
26plt.title('Boxplot - Outlier Detection')
27plt.show()

3.3 Xử lý Outliers

Phương phápCode
Xoadf = df[(df['col'] >= lower) & (df['col'] <= upper)]
Cap (Winsorization)df['col'] = df['col'].clip(lower, upper)
Transformdf['col_log'] = np.log1p(df['col'])

4. Feature Engineering

4.1 Encoding Categorical Variables

Label Encoding:

Python
1from sklearn.preprocessing import LabelEncoder
2
3le = LabelEncoder()
4df['Department_encoded'] = le.fit_transform(df['Department'])
5print(df)
6print(f"Classes: {le.classes_}")

One-Hot Encoding:

Python
1# Pandas
2df_onehot = pd.get_dummies(df, columns=['Department'], prefix='Dept')
3print(df_onehot)
4
5# Sklearn
6from sklearn.preprocessing import OneHotEncoder
7
8ohe = OneHotEncoder(sparse_output=False, drop='first')
9encoded = ohe.fit_transform(df[['Department']])
10print(encoded)

4.2 Khi nào dùng Encoding nào?

EncodingKhi nào dùngVi du
Label EncodingOrdinal data (co thu tu)Education level: Low < Medium < High
One-Hot EncodingNominal data (khong thu tu)Color: Red, Blue, Green
Target EncodingHigh cardinalityZip code, City

4.3 Tạo Features mới

Python
1# Tao features tu date
2df['Date'] = pd.to_datetìme(['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05'])
3df['Year'] = df['Date'].dt.year
4df['Month'] = df['Date'].dt.month
5df['DayOfWeek'] = df['Date'].dt.dayofweek
6df['IsWeekend'] = df['DayOfWeek'].isin([5, 6]).astype(int)
7
8# Binning
9df['Age_Group'] = pd.cut(df['Age'], bins=[0, 25, 35, 100], labels=['Young', 'Middle', 'Senior'])
10
11# Interaction features
12df['Age_Salary'] = df['Age'] * df['Salary']

5. Data Pipeline hoàn chỉnh

Python
1import pandas as pd
2import numpy as np
3from sklearn.model_selection import train_test_split
4from sklearn.preprocessing import StandardScaler, OneHotEncoder
5from sklearn.impute import SimpleImputer
6from sklearn.compose import ColumnTransformer
7from sklearn.pipeline import Pipeline
8
9# Giả sử có data
10# df = pd.read_csv('data.csv')
11
12# Định nghĩa cot
13numerical_cols = ['Age', 'Salary']
14categorical_cols = ['Department']
15
16# Numerical pipeline
17numerical_pipeline = Pipeline([
18 ('imputer', SimpleImputer(strategy='median')),
19 ('scaler', StandardScaler())
20])
21
22# Categorical pipeline
23categorical_pipeline = Pipeline([
24 ('imputer', SimpleImputer(strategy='most_frequent')),
25 ('encoder', OneHotEncoder(drop='first', sparse_output=False))
26])
27
28# Ket hop
29preprocessor = ColumnTransformer([
30 ('num', numerical_pipeline, numerical_cols),
31 ('cat', categorical_pipeline, categorical_cols)
32])
33
34# Ap dung
35# X_processed = preprocessor.fit_transform(X_train)

6. Ưu và nhược điểm cac phuong phap

6.1 Missing Value Handling

Phương phápUu điểmNhuoc điểm
DropDon gianMat du lieu
Mean/MedianGiu duoc du lieuCo the bias
Model-based (KNN)Chinh xac honCham

6.2 Encoding

Phương phápUu điểmNhuoc điểm
LabelDon gian, 1 cotTao thu tu gia
One-HotKhong tao thu tuNhieu cot (curse of dimensionality)

Bài tập tự luyện

  1. Bài tập 1: Load dataset Titanic, xử lý missing values cho Age và Embarked
  2. Bài tập 2: Phát hiện và xử lý outliers trong cột Fare
  3. Bài tập 3: One-hot encode cột Sex và Embarked, sau đó train model
  4. Bài tập 4: Tạo datetime features từ cột ngày tháng
  5. Bài tập 5: Implement complete EDA workflow cho dataset mới

7. Feature Engineering Advanced

7.1 Workflow tổng quan

Text
1Raw Data → Extract → Transform → Select → Model

Mục tiêu: Tạo features có correlation mạnh với target, loại noise, đảm bảo assumptions

7.2 DateTime & Cyclic Features

7.2.1 Basic DateTime Extraction

Python
1# Tạo datetime object
2df['dt'] = pd.to_datetime(df['date'])
3
4# Extract components
5df['year'] = df['dt'].dt.year
6df['month'] = df['dt'].dt.month
7df['day'] = df['dt'].dt.day
8df['dayofweek'] = df['dt'].dt.dayofweek # 0=Monday, 6=Sunday
9df['hour'] = df['dt'].dt.hour
10df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)
11df['is_month_start'] = df['dt'].dt.is_month_start.astype(int)
12df['is_month_end'] = df['dt'].dt.is_month_end.astype(int)
13
14# Time differences
15df['days_since'] = (df['dt'] - df['dt'].min()).dt.days

7.2.2 Cyclic Encoding (Sin/Cos)

Vấn đề: Hour 23 và Hour 0 rất gần nhau, nhưng numerically xa (23 vs 0)

Giải pháp: Encode thành Sin/Cos để preserve "closeness"

Python
1import numpy as np
2
3# Hour cyclic encoding (0-23 hours)
4df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
5df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
6
7# Month cyclic encoding (1-12 months)
8df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
9df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
10
11# Dayofweek cyclic encoding (0-6)
12df['dow_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)
13df['dow_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)

Visualization:

Text
1Hour 23 ≈ Hour 0
2
3 Sin/Cos: Close
4 vs
5 Linear: 23 - 0 = 23 (Far)

7.3 Polynomial & Interaction Features

7.3.1 Polynomial Features

Tạo non-linear relationships: x2x^2, xyxy, y2y^2

Python
1from sklearn.preprocessing import PolynomialFeatures
2
3# Degree 2: x1, x2 → x1, x2, x1², x2², x1·x2
4poly = PolynomialFeatures(degree=2, include_bias=False)
5X_poly = poly.fit_transform(X)
6
7print(poly.get_feature_names_out())
8# Output: ['x1', 'x2', 'x1^2', 'x1 x2', 'x2^2']

Khi nào dùng:

  • Linear models cần non-linear relationships
  • Example: House price = Area + Area² (diminishing returns)

7.3.2 Domain-Specific Interactions

Python
1# Ratio features
2df['price_per_sqft'] = df['price'] / df['sqft']
3df['income_to_loan'] = df['income'] / (df['loan_amount'] + 1)
4
5# Sum/Difference
6df['total_income'] = df['applicant_income'] + df['spouse_income']
7df['income_diff'] = df['applicant_income'] - df['spouse_income']
8
9# Multiplication
10df['sqft_x_bedrooms'] = df['sqft'] * df['bedrooms']
11
12# Binning interactions
13df['age_income_group'] = df['age_group'] + '_' + df['income_group']

7.4 Text Features (NLP)

7.4.1 Basic Stats

Python
1# Word count
2df['word_count'] = df['text'].apply(lambda x: len(str(x).split()))
3
4# Character count
5df['char_count'] = df['text'].str.len()
6
7# Average word length
8df['avg_word_len'] = df['char_count'] / df['word_count']
9
10# Number of uppercase words
11df['uppercase_count'] = df['text'].apply(
12 lambda x: sum(1 for w in str(x).split() if w.isupper())
13)

7.4.2 TF-IDF (Term Frequency - Inverse Document Frequency)

Phản ánh importance của word trong document collection

Python
1from sklearn.feature_extraction.text import TfidfVectorizer
2
3# TF-IDF with top 100 features
4tfidf = TfidfVectorizer(
5 max_features=100,
6 stop_words='english',
7 ngram_range=(1, 2) # Unigrams + Bigrams
8)
9
10X_text = tfidf.fit_transform(df['text'])
11
12# Get feature names
13feature_names = tfidf.get_feature_names_out()
14print(f"Created {len(feature_names)} text features")

7.5 Feature Selection

7.5.1 Filter Methods (Statistical Tests)

Remove High Correlation:

Python
1# Correlation matrix
2corr_matrix = df.corr().abs()
3
4# Upper triangle
5upper = corr_matrix.where(
6 np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
7)
8
9# Find features with correlation > 0.95
10to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
11print(f"Dropping {len(to_drop)} highly correlated features")
12
13df_filtered = df.drop(columns=to_drop)

7.5.2 Wrapper Methods (RFE)

Recursive Feature Elimination - iteratively removes weakest features

Python
1from sklearn.feature_selection import RFE
2from sklearn.linear_model import LinearRegression
3
4# Select top 10 features
5selector = RFE(
6 estimator=LinearRegression(),
7 n_features_to_select=10,
8 step=1
9)
10selector.fit(X, y)
11
12# Get selected features
13selected_features = X.columns[selector.support_]
14print(f"Selected features: {selected_features}")
15
16# Transform data
17X_selected = selector.transform(X)

7.5.3 Embedded Methods (Model-based)

Random Forest Feature Importance:

Python
1from sklearn.ensemble import RandomForestClassifier
2
3rf = RandomForestClassifier(n_estimators=100, random_state=42)
4rf.fit(X, y)
5
6# Feature importances
7importances = pd.DataFrame({
8 'feature': X.columns,
9 'importance': rf.feature_importances_
10}).sort_values('importance', ascending=False)
11
12# Select top features
13top_features = importances.head(15)['feature'].values
14X_important = X[top_features]

Lasso Regularization (L1):

Python
1from sklearn.linear_model import LassoCV
2
3# Lasso automatically selects features (sets weak coefficients to 0)
4lasso = LassoCV(cv=5, random_state=42)
5lasso.fit(X, y)
6
7# Non-zero coefficients
8selected_mask = lasso.coef_ != 0
9selected_features = X.columns[selected_mask]
10print(f"Lasso selected {len(selected_features)} features")

8. Exploratory Data Analysis (EDA)

8.1 EDA Workflow

Text
1Load → Clean → Analyze → Visualize → Insights

8.2 Step-by-Step EDA

8.2.1 Load & Initial Exploration

Python
1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4import seaborn as sns
5
6# Configure
7plt.style.use('ggplot')
8sns.set_palette("husl")
9pd.set_option('display.max_columns', None)
10
11# Load data
12df = pd.read_csv('data.csv')
13
14# Quick inspection
15print("=== Shape ===")
16print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")
17
18print("\n=== First 5 Rows ===")
19print(df.head())
20
21print("\n=== Data Types ===")
22print(df.dtypes)
23
24print("\n=== Info ===")
25print(df.info())
26
27print("\n=== Summary Statistics ===")
28print(df.describe())

8.2.2 Missing Values Analysis

Python
1# Missing count & percentage
2missing = pd.DataFrame({
3 'count': df.isnull().sum(),
4 'percent': (df.isnull().sum() / len(df) * 100).round(2)
5})
6missing = missing[missing['count'] > 0].sort_values('count', ascending=False)
7print(missing)
8
9# Visualize missing
10plt.figure(figsize=(12, 6))
11sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis')
12plt.title('Missing Values Heatmap')
13plt.show()
14
15# Missing correlation
16import missingno as msno
17msno.matrix(df)
18plt.show()

8.2.3 Univariate Analysis

Numerical Features:

Python
1# Distribution plots for all numerical columns
2numerical_cols = df.select_dtypes(include=[np.number]).columns
3
4fig, axes = plt.subplots(nrows=len(numerical_cols)//3 + 1, ncols=3,
5 figsize=(15, 10))
6axes = axes.ravel()
7
8for idx, col in enumerate(numerical_cols):
9 df[col].hist(bins=30, ax=axes[idx], edgecolor='black')
10 axes[idx].set_title(f'Distribution of {col}')
11 axes[idx].set_xlabel(col)
12
13plt.tight_layout()
14plt.show()
15
16# Boxplots for outliers
17fig, axes = plt.subplots(nrows=len(numerical_cols)//3 + 1, ncols=3,
18 figsize=(15, 10))
19axes = axes.ravel()
20
21for idx, col in enumerate(numerical_cols):
22 df.boxplot(column=col, ax=axes[idx])
23 axes[idx].set_title(f'Boxplot of {col}')
24
25plt.tight_layout()
26plt.show()

Categorical Features:

Python
1categorical_cols = df.select_dtypes(include=['object']).columns
2
3for col in categorical_cols:
4 plt.figure(figsize=(10, 5))
5
6 # Value counts
7 value_counts = df[col].value_counts()
8
9 # Bar plot
10 value_counts.plot(kind='bar')
11 plt.title(f'Distribution of {col}')
12 plt.xlabel(col)
13 plt.ylabel('Count')
14 plt.xticks(rotation=45)
15 plt.tight_layout()
16 plt.show()
17
18 # Print stats
19 print(f"\n=== {col} ===")
20 print(f"Unique values: {df[col].nunique()}")
21 print(value_counts.head(10))

8.2.4 Bivariate Analysis

Correlation Heatmap:

Python
1# Correlation matrix
2plt.figure(figsize=(12, 10))
3corr = df[numerical_cols].corr()
4sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',
5 center=0, square=True, linewidths=1)
6plt.title('Correlation Heatmap')
7plt.tight_layout()
8plt.show()
9
10# Top correlations with target
11if 'target' in df.columns:
12 target_corr = corr['target'].abs().sort_values(ascending=False)
13 print("\n=== Top 10 Correlations with Target ===")
14 print(target_corr.head(10))

Scatter Matrix (Pair Plot):

Python
1# Select top features for pair plot
2from sklearn.feature_selection import SelectKBest, f_regression
3
4if 'target' in df.columns:
5 selector = SelectKBest(f_regression, k=5)
6 selector.fit(df[numerical_cols], df['target'])
7 top_features = df[numerical_cols].columns[selector.get_support()]
8
9 # Pair plot
10 sns.pairplot(df[list(top_features) + ['target']], diag_kind='kde')
11 plt.show()

Categorical vs Target:

Python
1for col in categorical_cols:
2 if df[col].nunique() < 10: # Only if not too many categories
3 plt.figure(figsize=(10, 6))
4
5 # Group by and plot
6 df.groupby(col)['target'].mean().sort_values().plot(kind='barh')
7 plt.title(f'Average Target by {col}')
8 plt.xlabel('Average Target')
9 plt.tight_layout()
10 plt.show()

8.2.5 Multivariate Analysis

Crosstab for 2 Categorical:

Python
1# Example: Gender vs Department
2if 'gender' in df.columns and 'department' in df.columns:
3 ct = pd.crosstab(df['gender'], df['department'], normalize='index')
4 ct.plot(kind='bar', stacked=True, figsize=(10, 6))
5 plt.title('Department Distribution by Gender')
6 plt.ylabel('Proportion')
7 plt.xticks(rotation=0)
8 plt.legend(title='Department')
9 plt.show()

3D Scatter:

Python
1from mpl_toolkits.mplot3d import Axes3D
2
3if len(numerical_cols) >= 3:
4 fig = plt.figure(figsize=(10, 8))
5 ax = fig.add_subplot(111, projection='3d')
6
7 scatter = ax.scatter(df[numerical_cols[0]],
8 df[numerical_cols[1]],
9 df[numerical_cols[2]],
10 c=df['target'] if 'target' in df.columns else None,
11 cmap='viridis')
12
13 ax.set_xlabel(numerical_cols[0])
14 ax.set_ylabel(numerical_cols[1])
15 ax.set_zlabel(numerical_cols[2])
16 plt.colorbar(scatter)
17 plt.show()

8.3 EDA Report Generation

Python
1# Automated EDA with pandas-profiling
2import pandas_profiling
3
4# Generate comprehensive report
5profile = pandas_profiling.ProfileReport(df, title='EDA Report',
6 explorative=True)
7profile.to_file("eda_report.html")
8
9print("✓ EDA Report generated: eda_report.html")

9. Advanced Feature Engineering Techniques

9.1 Target Encoding (Mean Encoding)

Replace category with average target value - powerful but risky (overfitting)

Python
1# Calculate mean target per category
2means = df.groupby('city')['target'].mean()
3
4# Map to new feature
5df['city_encoded'] = df['city'].map(means)
6
7# With Cross-Validation to avoid overfitting
8from sklearn.model_selection import KFold
9
10def target_encode_cv(df, col, target, n_folds=5):
11 """Target encode with CV to prevent overfitting"""
12 kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
13 df[f'{col}_encoded'] = 0
14
15 for train_idx, val_idx in kf.split(df):
16 # Calculate mean on train
17 means = df.iloc[train_idx].groupby(col)[target].mean()
18 # Apply to validation
19 df.loc[val_idx, f'{col}_encoded'] = df.loc[val_idx, col].map(means)
20
21 return df
22
23df = target_encode_cv(df, 'city', 'target')

9.2 Frequency Encoding

Replace category with its count/frequency

Python
1# Frequency encoding
2freq = df['color'].value_counts()
3df['color_freq'] = df['color'].map(freq)
4
5# Normalized frequency (proportion)
6freq_norm = df['color'].value_counts(normalize=True)
7df['color_freq_norm'] = df['color'].map(freq_norm)

9.3 Binning (Discretization)

Convert continuous → categorical buckets

Python
1# Fixed width bins
2df['age_bin'] = pd.cut(df['age'], bins=5, labels=['Very Young', 'Young', 'Middle', 'Old', 'Very Old'])
3
4# Quantile-based bins (equal frequency)
5df['salary_quartile'] = pd.qcut(df['salary'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
6
7# Custom bins
8bins = [0, 18, 30, 50, 100]
9labels = ['Child', 'Young Adult', 'Adult', 'Senior']
10df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)

Bài tập tự luyện

  1. Bài tập 1: Load dataset Titanic, xử lý missing values cho Age và Embarked
  2. Bài tập 2: Phát hiện và xử lý outliers trong cột Fare
  3. Bài tập 3: One-hot encode cột Sex và Embarked, sau đó train model
  4. Bài tập 4: Tạo datetime features từ cột ngày tháng (year, month, dayofweek, is_weekend)
  5. Bài tập 5: Implement complete EDA workflow cho dataset mới
  6. Bài tập 6: Create polynomial features (degree 2) và so sánh model performance
  7. Bài tập 7: Encode cyclic features (hour, month) với sin/cos
  8. Bài tập 8: Perform feature selection với 3 methods: Filter, Wrapper, Embedded

Tài liệu tham khảo