Xử lý Dữ liệu với Pandas
Mục tiêu bài học
Sau bài học này, học viên sẽ:
- Thao tác dữ liệu với Pandas DataFrame
- Xử lý Missing Values va Outliers
- Thực hiện Feature Engineering co ban
- Encode categorical variables
1. Pandas DataFrame co ban
1.1 Tạo và đọc DataFrame
1import pandas as pd2import numpy as np34# Tao DataFrame tu dictionary5data = {6 'Name': ['An', 'Binh', 'Chi', 'Đúng'],7 'Age': [25, 30, 35, np.nan],8 'Salary': [50000, 60000, 75000, 80000],9 'Department': ['IT', 'HR', 'IT', 'Finance']10}11df = pd.DataFrame(data)12print(df)1314# Doc tu CSV15# df = pd.read_csv('data.csv')1617# Doc tu Excel18# df = pd.read_excel('data.xlsx')1.2 Các thao tác cơ bản
1# Xem thong tin2print(df.info())3print(df.describe())4print(df.shape)5print(df.columns)6print(df.dtypes)78# Truy cap du lieu9print(df['Age']) # Mot cot10print(df[['Name', 'Age']]) # Nhieu cot11print(df.iloc[0]) # Hang dau tien12print(df.loc[0, 'Name']) # Cell cu the1314# Loc du lieu15print(df[df['Age'] > 25])16print(df[(df['Age'] > 25) & (df['Department'] == 'IT')])2. Xử lý Missing Values
2.1 Phát hiện Missing Values
1# Kiểm tra missing2print(df.isnull().sum())3print(df.isnull().sum() / len(df) * 100) # Ty le %45# Visualize missing6import matplotlib.pyplot as plt7import seaborn as sns89plt.figure(figsize=(10, 6))10sns.heatmap(df.isnull(), cbar=True, yticklabels=False)11plt.title('Missing Values Heatmap')12plt.show()2.2 Các chiến lược xử lý Missing
| Chien luoc | Khi nào dùng | Code |
|---|---|---|
| Xoa hang | Missing it, data nhieu | df.dropna() |
| Xoa cot | Cot missing qua nhieu (> 50%) | df.drop(columns=['col']) |
| Dien Mean | Numerical, phan phoi chuan | df.fillna(df.mean()) |
| Dien Median | Numerical, co outliers | df.fillna(df.median()) |
| Dien Mode | Categorical | df.fillna(df.mode()[0]) |
| Forward Fill | Time series | df.fillna(method='ffill') |
2.3 Thuc hanh
1# Xử lý missing2df_clean = df.copy()34# Dien Age bang median5df_clean['Age'].fillna(df_clean['Age'].median(), inplace=True)67# Hoac dung SimpleImputer8from sklearn.impute import SimpleImputer910imputer = SimpleImputer(strategy='median')11df_clean['Age'] = imputer.fit_transform(df_clean[['Age']])1213print(df_clean)3. Xử lý Outliers
3.1 Phát hiện Outliers
Phương pháp IQR (Interquartile Range):
Hinh: Boxplot va cach xac dinh Outliers
3.2 Thực hành phát hiện
1import numpy as np23def detect_outliers_iqr(data, column):4 Q1 = data[column].quantile(0.25)5 Q3 = data[column].quantile(0.75)6 IQR = Q3 - Q17 8 lower = Q1 - 1.5 * IQR9 upper = Q3 + 1.5 * IQR10 11 outliers = data[(data[column] < lower) | (data[column] > upper)]12 return outliers, lower, upper1314# Vi du15data = pd.DataFrame({16 'value': [10, 12, 14, 15, 16, 18, 100] # 100 la outlier17})1819outliers, lower, upper = detect_outliers_iqr(data, 'value')20print(f"Outliers:\n{outliers}")21print(f"Bounds: [{lower:.2f}, {upper:.2f}]")2223# Visualize24import matplotlib.pyplot as plt25plt.boxplot(data['value'])26plt.title('Boxplot - Outlier Detection')27plt.show()3.3 Xử lý Outliers
| Phương pháp | Code |
|---|---|
| Xoa | df = df[(df['col'] >= lower) & (df['col'] <= upper)] |
| Cap (Winsorization) | df['col'] = df['col'].clip(lower, upper) |
| Transform | df['col_log'] = np.log1p(df['col']) |
4. Feature Engineering
4.1 Encoding Categorical Variables
Label Encoding:
1from sklearn.preprocessing import LabelEncoder23le = LabelEncoder()4df['Department_encoded'] = le.fit_transform(df['Department'])5print(df)6print(f"Classes: {le.classes_}")One-Hot Encoding:
1# Pandas2df_onehot = pd.get_dummies(df, columns=['Department'], prefix='Dept')3print(df_onehot)45# Sklearn6from sklearn.preprocessing import OneHotEncoder78ohe = OneHotEncoder(sparse_output=False, drop='first')9encoded = ohe.fit_transform(df[['Department']])10print(encoded)4.2 Khi nào dùng Encoding nào?
| Encoding | Khi nào dùng | Vi du |
|---|---|---|
| Label Encoding | Ordinal data (co thu tu) | Education level: Low < Medium < High |
| One-Hot Encoding | Nominal data (khong thu tu) | Color: Red, Blue, Green |
| Target Encoding | High cardinality | Zip code, City |
4.3 Tạo Features mới
1# Tao features tu date2df['Date'] = pd.to_datetìme(['2024-01-15', '2024-02-20', '2024-03-10', '2024-04-05'])3df['Year'] = df['Date'].dt.year4df['Month'] = df['Date'].dt.month5df['DayOfWeek'] = df['Date'].dt.dayofweek6df['IsWeekend'] = df['DayOfWeek'].isin([5, 6]).astype(int)78# Binning9df['Age_Group'] = pd.cut(df['Age'], bins=[0, 25, 35, 100], labels=['Young', 'Middle', 'Senior'])1011# Interaction features12df['Age_Salary'] = df['Age'] * df['Salary']5. Data Pipeline hoàn chỉnh
1import pandas as pd2import numpy as np3from sklearn.model_selection import train_test_split4from sklearn.preprocessing import StandardScaler, OneHotEncoder5from sklearn.impute import SimpleImputer6from sklearn.compose import ColumnTransformer7from sklearn.pipeline import Pipeline89# Giả sử có data10# df = pd.read_csv('data.csv')1112# Định nghĩa cot13numerical_cols = ['Age', 'Salary']14categorical_cols = ['Department']1516# Numerical pipeline17numerical_pipeline = Pipeline([18 ('imputer', SimpleImputer(strategy='median')),19 ('scaler', StandardScaler())20])2122# Categorical pipeline23categorical_pipeline = Pipeline([24 ('imputer', SimpleImputer(strategy='most_frequent')),25 ('encoder', OneHotEncoder(drop='first', sparse_output=False))26])2728# Ket hop29preprocessor = ColumnTransformer([30 ('num', numerical_pipeline, numerical_cols),31 ('cat', categorical_pipeline, categorical_cols)32])3334# Ap dung35# X_processed = preprocessor.fit_transform(X_train)6. Ưu và nhược điểm cac phuong phap
6.1 Missing Value Handling
| Phương pháp | Uu điểm | Nhuoc điểm |
|---|---|---|
| Drop | Don gian | Mat du lieu |
| Mean/Median | Giu duoc du lieu | Co the bias |
| Model-based (KNN) | Chinh xac hon | Cham |
6.2 Encoding
| Phương pháp | Uu điểm | Nhuoc điểm |
|---|---|---|
| Label | Don gian, 1 cot | Tao thu tu gia |
| One-Hot | Khong tao thu tu | Nhieu cot (curse of dimensionality) |
Bài tập tự luyện
- Bài tập 1: Load dataset Titanic, xử lý missing values cho Age và Embarked
- Bài tập 2: Phát hiện và xử lý outliers trong cột Fare
- Bài tập 3: One-hot encode cột Sex và Embarked, sau đó train model
- Bài tập 4: Tạo datetime features từ cột ngày tháng
- Bài tập 5: Implement complete EDA workflow cho dataset mới
7. Feature Engineering Advanced
7.1 Workflow tổng quan
1Raw Data → Extract → Transform → Select → ModelMục tiêu: Tạo features có correlation mạnh với target, loại noise, đảm bảo assumptions
7.2 DateTime & Cyclic Features
7.2.1 Basic DateTime Extraction
1# Tạo datetime object2df['dt'] = pd.to_datetime(df['date'])34# Extract components5df['year'] = df['dt'].dt.year6df['month'] = df['dt'].dt.month7df['day'] = df['dt'].dt.day8df['dayofweek'] = df['dt'].dt.dayofweek # 0=Monday, 6=Sunday9df['hour'] = df['dt'].dt.hour10df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int)11df['is_month_start'] = df['dt'].dt.is_month_start.astype(int)12df['is_month_end'] = df['dt'].dt.is_month_end.astype(int)1314# Time differences15df['days_since'] = (df['dt'] - df['dt'].min()).dt.days7.2.2 Cyclic Encoding (Sin/Cos)
Vấn đề: Hour 23 và Hour 0 rất gần nhau, nhưng numerically xa (23 vs 0)
Giải pháp: Encode thành Sin/Cos để preserve "closeness"
1import numpy as np23# Hour cyclic encoding (0-23 hours)4df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)5df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)67# Month cyclic encoding (1-12 months)8df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)9df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)1011# Dayofweek cyclic encoding (0-6)12df['dow_sin'] = np.sin(2 * np.pi * df['dayofweek'] / 7)13df['dow_cos'] = np.cos(2 * np.pi * df['dayofweek'] / 7)Visualization:
1Hour 23 ≈ Hour 02 ↓3 Sin/Cos: Close4 vs5 Linear: 23 - 0 = 23 (Far)7.3 Polynomial & Interaction Features
7.3.1 Polynomial Features
Tạo non-linear relationships: , ,
1from sklearn.preprocessing import PolynomialFeatures23# Degree 2: x1, x2 → x1, x2, x1², x2², x1·x24poly = PolynomialFeatures(degree=2, include_bias=False)5X_poly = poly.fit_transform(X)67print(poly.get_feature_names_out())8# Output: ['x1', 'x2', 'x1^2', 'x1 x2', 'x2^2']Khi nào dùng:
- Linear models cần non-linear relationships
- Example: House price = Area + Area² (diminishing returns)
7.3.2 Domain-Specific Interactions
1# Ratio features2df['price_per_sqft'] = df['price'] / df['sqft']3df['income_to_loan'] = df['income'] / (df['loan_amount'] + 1)45# Sum/Difference6df['total_income'] = df['applicant_income'] + df['spouse_income']7df['income_diff'] = df['applicant_income'] - df['spouse_income']89# Multiplication10df['sqft_x_bedrooms'] = df['sqft'] * df['bedrooms']1112# Binning interactions13df['age_income_group'] = df['age_group'] + '_' + df['income_group']7.4 Text Features (NLP)
7.4.1 Basic Stats
1# Word count2df['word_count'] = df['text'].apply(lambda x: len(str(x).split()))34# Character count5df['char_count'] = df['text'].str.len()67# Average word length8df['avg_word_len'] = df['char_count'] / df['word_count']910# Number of uppercase words11df['uppercase_count'] = df['text'].apply(12 lambda x: sum(1 for w in str(x).split() if w.isupper())13)7.4.2 TF-IDF (Term Frequency - Inverse Document Frequency)
Phản ánh importance của word trong document collection
1from sklearn.feature_extraction.text import TfidfVectorizer23# TF-IDF with top 100 features4tfidf = TfidfVectorizer(5 max_features=100,6 stop_words='english',7 ngram_range=(1, 2) # Unigrams + Bigrams8)910X_text = tfidf.fit_transform(df['text'])1112# Get feature names13feature_names = tfidf.get_feature_names_out()14print(f"Created {len(feature_names)} text features")7.5 Feature Selection
7.5.1 Filter Methods (Statistical Tests)
Remove High Correlation:
1# Correlation matrix2corr_matrix = df.corr().abs()34# Upper triangle5upper = corr_matrix.where(6 np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)7)89# Find features with correlation > 0.9510to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]11print(f"Dropping {len(to_drop)} highly correlated features")1213df_filtered = df.drop(columns=to_drop)7.5.2 Wrapper Methods (RFE)
Recursive Feature Elimination - iteratively removes weakest features
1from sklearn.feature_selection import RFE2from sklearn.linear_model import LinearRegression34# Select top 10 features5selector = RFE(6 estimator=LinearRegression(), 7 n_features_to_select=10,8 step=19)10selector.fit(X, y)1112# Get selected features13selected_features = X.columns[selector.support_]14print(f"Selected features: {selected_features}")1516# Transform data17X_selected = selector.transform(X)7.5.3 Embedded Methods (Model-based)
Random Forest Feature Importance:
1from sklearn.ensemble import RandomForestClassifier23rf = RandomForestClassifier(n_estimators=100, random_state=42)4rf.fit(X, y)56# Feature importances7importances = pd.DataFrame({8 'feature': X.columns,9 'importance': rf.feature_importances_10}).sort_values('importance', ascending=False)1112# Select top features13top_features = importances.head(15)['feature'].values14X_important = X[top_features]Lasso Regularization (L1):
1from sklearn.linear_model import LassoCV23# Lasso automatically selects features (sets weak coefficients to 0)4lasso = LassoCV(cv=5, random_state=42)5lasso.fit(X, y)67# Non-zero coefficients8selected_mask = lasso.coef_ != 09selected_features = X.columns[selected_mask]10print(f"Lasso selected {len(selected_features)} features")8. Exploratory Data Analysis (EDA)
8.1 EDA Workflow
1Load → Clean → Analyze → Visualize → Insights8.2 Step-by-Step EDA
8.2.1 Load & Initial Exploration
1import pandas as pd2import numpy as np3import matplotlib.pyplot as plt4import seaborn as sns56# Configure7plt.style.use('ggplot')8sns.set_palette("husl")9pd.set_option('display.max_columns', None)1011# Load data12df = pd.read_csv('data.csv')1314# Quick inspection15print("=== Shape ===")16print(f"Rows: {df.shape[0]}, Columns: {df.shape[1]}")1718print("\n=== First 5 Rows ===")19print(df.head())2021print("\n=== Data Types ===")22print(df.dtypes)2324print("\n=== Info ===")25print(df.info())2627print("\n=== Summary Statistics ===")28print(df.describe())8.2.2 Missing Values Analysis
1# Missing count & percentage2missing = pd.DataFrame({3 'count': df.isnull().sum(),4 'percent': (df.isnull().sum() / len(df) * 100).round(2)5})6missing = missing[missing['count'] > 0].sort_values('count', ascending=False)7print(missing)89# Visualize missing10plt.figure(figsize=(12, 6))11sns.heatmap(df.isnull(), cbar=True, yticklabels=False, cmap='viridis')12plt.title('Missing Values Heatmap')13plt.show()1415# Missing correlation16import missingno as msno17msno.matrix(df)18plt.show()8.2.3 Univariate Analysis
Numerical Features:
1# Distribution plots for all numerical columns2numerical_cols = df.select_dtypes(include=[np.number]).columns34fig, axes = plt.subplots(nrows=len(numerical_cols)//3 + 1, ncols=3, 5 figsize=(15, 10))6axes = axes.ravel()78for idx, col in enumerate(numerical_cols):9 df[col].hist(bins=30, ax=axes[idx], edgecolor='black')10 axes[idx].set_title(f'Distribution of {col}')11 axes[idx].set_xlabel(col)12 13plt.tight_layout()14plt.show()1516# Boxplots for outliers17fig, axes = plt.subplots(nrows=len(numerical_cols)//3 + 1, ncols=3, 18 figsize=(15, 10))19axes = axes.ravel()2021for idx, col in enumerate(numerical_cols):22 df.boxplot(column=col, ax=axes[idx])23 axes[idx].set_title(f'Boxplot of {col}')24 25plt.tight_layout()26plt.show()Categorical Features:
1categorical_cols = df.select_dtypes(include=['object']).columns23for col in categorical_cols:4 plt.figure(figsize=(10, 5))5 6 # Value counts7 value_counts = df[col].value_counts()8 9 # Bar plot10 value_counts.plot(kind='bar')11 plt.title(f'Distribution of {col}')12 plt.xlabel(col)13 plt.ylabel('Count')14 plt.xticks(rotation=45)15 plt.tight_layout()16 plt.show()17 18 # Print stats19 print(f"\n=== {col} ===")20 print(f"Unique values: {df[col].nunique()}")21 print(value_counts.head(10))8.2.4 Bivariate Analysis
Correlation Heatmap:
1# Correlation matrix2plt.figure(figsize=(12, 10))3corr = df[numerical_cols].corr()4sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm', 5 center=0, square=True, linewidths=1)6plt.title('Correlation Heatmap')7plt.tight_layout()8plt.show()910# Top correlations with target11if 'target' in df.columns:12 target_corr = corr['target'].abs().sort_values(ascending=False)13 print("\n=== Top 10 Correlations with Target ===")14 print(target_corr.head(10))Scatter Matrix (Pair Plot):
1# Select top features for pair plot2from sklearn.feature_selection import SelectKBest, f_regression34if 'target' in df.columns:5 selector = SelectKBest(f_regression, k=5)6 selector.fit(df[numerical_cols], df['target'])7 top_features = df[numerical_cols].columns[selector.get_support()]8 9 # Pair plot10 sns.pairplot(df[list(top_features) + ['target']], diag_kind='kde')11 plt.show()Categorical vs Target:
1for col in categorical_cols:2 if df[col].nunique() < 10: # Only if not too many categories3 plt.figure(figsize=(10, 6))4 5 # Group by and plot6 df.groupby(col)['target'].mean().sort_values().plot(kind='barh')7 plt.title(f'Average Target by {col}')8 plt.xlabel('Average Target')9 plt.tight_layout()10 plt.show()8.2.5 Multivariate Analysis
Crosstab for 2 Categorical:
1# Example: Gender vs Department2if 'gender' in df.columns and 'department' in df.columns:3 ct = pd.crosstab(df['gender'], df['department'], normalize='index')4 ct.plot(kind='bar', stacked=True, figsize=(10, 6))5 plt.title('Department Distribution by Gender')6 plt.ylabel('Proportion')7 plt.xticks(rotation=0)8 plt.legend(title='Department')9 plt.show()3D Scatter:
1from mpl_toolkits.mplot3d import Axes3D23if len(numerical_cols) >= 3:4 fig = plt.figure(figsize=(10, 8))5 ax = fig.add_subplot(111, projection='3d')6 7 scatter = ax.scatter(df[numerical_cols[0]], 8 df[numerical_cols[1]], 9 df[numerical_cols[2]],10 c=df['target'] if 'target' in df.columns else None,11 cmap='viridis')12 13 ax.set_xlabel(numerical_cols[0])14 ax.set_ylabel(numerical_cols[1])15 ax.set_zlabel(numerical_cols[2])16 plt.colorbar(scatter)17 plt.show()8.3 EDA Report Generation
1# Automated EDA with pandas-profiling2import pandas_profiling34# Generate comprehensive report5profile = pandas_profiling.ProfileReport(df, title='EDA Report', 6 explorative=True)7profile.to_file("eda_report.html")89print("✓ EDA Report generated: eda_report.html")9. Advanced Feature Engineering Techniques
9.1 Target Encoding (Mean Encoding)
Replace category with average target value - powerful but risky (overfitting)
1# Calculate mean target per category2means = df.groupby('city')['target'].mean()34# Map to new feature5df['city_encoded'] = df['city'].map(means)67# With Cross-Validation to avoid overfitting8from sklearn.model_selection import KFold910def target_encode_cv(df, col, target, n_folds=5):11 """Target encode with CV to prevent overfitting"""12 kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)13 df[f'{col}_encoded'] = 014 15 for train_idx, val_idx in kf.split(df):16 # Calculate mean on train17 means = df.iloc[train_idx].groupby(col)[target].mean()18 # Apply to validation19 df.loc[val_idx, f'{col}_encoded'] = df.loc[val_idx, col].map(means)20 21 return df2223df = target_encode_cv(df, 'city', 'target')9.2 Frequency Encoding
Replace category with its count/frequency
1# Frequency encoding2freq = df['color'].value_counts()3df['color_freq'] = df['color'].map(freq)45# Normalized frequency (proportion)6freq_norm = df['color'].value_counts(normalize=True)7df['color_freq_norm'] = df['color'].map(freq_norm)9.3 Binning (Discretization)
Convert continuous → categorical buckets
1# Fixed width bins2df['age_bin'] = pd.cut(df['age'], bins=5, labels=['Very Young', 'Young', 'Middle', 'Old', 'Very Old'])34# Quantile-based bins (equal frequency)5df['salary_quartile'] = pd.qcut(df['salary'], q=4, labels=['Q1', 'Q2', 'Q3', 'Q4'])67# Custom bins8bins = [0, 18, 30, 50, 100]9labels = ['Child', 'Young Adult', 'Adult', 'Senior']10df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels)Bài tập tự luyện
- Bài tập 1: Load dataset Titanic, xử lý missing values cho Age và Embarked
- Bài tập 2: Phát hiện và xử lý outliers trong cột Fare
- Bài tập 3: One-hot encode cột Sex và Embarked, sau đó train model
- Bài tập 4: Tạo datetime features từ cột ngày tháng (year, month, dayofweek, is_weekend)
- Bài tập 5: Implement complete EDA workflow cho dataset mới
- Bài tập 6: Create polynomial features (degree 2) và so sánh model performance
- Bài tập 7: Encode cyclic features (hour, month) với sin/cos
- Bài tập 8: Perform feature selection với 3 methods: Filter, Wrapper, Embedded
