Data Quality & Validation
1. Tại sao Data Quality quan trọng?
Garbage In, Garbage Out
Dữ liệu chất lượng kém dẫn đến phân tích sai, decisions sai, và mất tiền/thời gian.
Các vấn đề Data Quality phổ biến:
- Missing Values: Dữ liệu thiếu
- Duplicates: Dữ liệu trùng lặp
- Outliers: Giá trị bất thường
- Inconsistent Formats: Định dạng không nhất quán
- Invalid Values: Giá trị không hợp lệ
2. Data Profiling
2.1 Quick Overview
Python
1import pandas as pd2import numpy as np34# Load data5df = pd.read_csv('sales_data.csv')67# Basic info8print(df.shape) # (rows, columns)9print(df.dtypes) # Data types10print(df.info()) # Memory usage, non-null counts1112# Statistical summary13print(df.describe()) # Numeric columns14print(df.describe(include='object')) # Categorical columns2.2 Missing Values Analysis
Python
1# Count missing values2missing_counts = df.isnull().sum()3missing_pct = (df.isnull().sum() / len(df)) * 10045# Summary DataFrame6missing_summary = pd.DataFrame({7 'Missing Count': missing_counts,8 'Missing %': missing_pct.round(2)9}).sort_values('Missing %', ascending=False)1011print(missing_summary[missing_summary['Missing Count'] > 0])2.3 Automated Profiling với pandas-profiling
Python
1# Install: pip install ydata-profiling2from ydata_profiling import ProfileReport34# Generate profile report5profile = ProfileReport(df, title='Data Quality Report', minimal=True)6profile.to_file('data_quality_report.html')78# Hoặc display trong notebook9profile.to_notebook_iframe()3. Handling Missing Values
3.1 Các chiến lược xử lý
Python
1# 1. Drop rows với missing values2df_clean = df.dropna() # Drop any row with any NaN3df_clean = df.dropna(subset=['important_col']) # Drop nếu specific column NaN4df_clean = df.dropna(thresh=3) # Keep rows với ít nhất 3 non-null values56# 2. Drop columns với quá nhiều missing7threshold = 0.5 # 50% missing8cols_to_drop = df.columns[df.isnull().mean() > threshold]9df_clean = df.drop(columns=cols_to_drop)3.2 Imputation (Điền giá trị)
Python
1# Simple imputation2df['numeric_col'].fillna(df['numeric_col'].mean(), inplace=True) # Mean3df['numeric_col'].fillna(df['numeric_col'].median(), inplace=True) # Median4df['categorical_col'].fillna(df['categorical_col'].mode()[0], inplace=True) # Mode56# Forward/Backward fill (time series)7df['value'].fillna(method='ffill', inplace=True) # Forward fill8df['value'].fillna(method='bfill', inplace=True) # Backward fill910# Interpolation11df['value'] = df['value'].interpolate(method='linear')3.3 Group-wise Imputation
Python
1# Điền median theo group2df['sales'] = df.groupby('category')['sales'].transform(3 lambda x: x.fillna(x.median())4)56# Điền mode theo group (categorical)7df['status'] = df.groupby('region')['status'].transform(8 lambda x: x.fillna(x.mode()[0] if len(x.mode()) > 0 else 'Unknown')9)3.4 Advanced Imputation với scikit-learn
Python
1from sklearn.impute import SimpleImputer, KNNImputer23# Simple Imputer4imputer = SimpleImputer(strategy='median') # mean, median, most_frequent, constant5df[numeric_cols] = imputer.fit_transform(df[numeric_cols])67# KNN Imputer (based on similar rows)8knn_imputer = KNNImputer(n_neighbors=5)9df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])4. Handling Duplicates
4.1 Identify Duplicates
Python
1# Check for duplicates2print(f"Total duplicates: {df.duplicated().sum()}")34# Check duplicates based on specific columns5print(f"Duplicates by ID: {df.duplicated(subset=['customer_id']).sum()}")67# View duplicate rows8duplicates = df[df.duplicated(keep=False)] # keep=False shows all duplicates9print(duplicates.sort_values('customer_id'))4.2 Remove Duplicates
Python
1# Remove exact duplicates (keep first)2df_clean = df.drop_duplicates()34# Remove duplicates based on specific columns5df_clean = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='last')67# Custom duplicate handling8# Keep row with highest value9df_clean = df.sort_values('amount', ascending=False).drop_duplicates(subset=['order_id'])5. Handling Outliers
5.1 Detecting Outliers
Python
1import numpy as np23# Method 1: IQR (Interquartile Range)4def detect_outliers_iqr(df, column):5 Q1 = df[column].quantile(0.25)6 Q3 = df[column].quantile(0.75)7 IQR = Q3 - Q18 lower_bound = Q1 - 1.5 * IQR9 upper_bound = Q3 + 1.5 * IQR10 11 outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]12 return outliers, lower_bound, upper_bound1314outliers, lb, ub = detect_outliers_iqr(df, 'sales')15print(f"Outliers found: {len(outliers)}")16print(f"Range: [{lb:.2f}, {ub:.2f}]")Python
1# Method 2: Z-Score2from scipy import stats34def detect_outliers_zscore(df, column, threshold=3):5 z_scores = np.abs(stats.zscore(df[column].dropna()))6 outliers = df[z_scores > threshold]7 return outliers89outliers = detect_outliers_zscore(df, 'sales')5.2 Handling Outliers
Python
1# Option 1: Remove outliers2df_clean = df[(df['sales'] >= lb) & (df['sales'] <= ub)]34# Option 2: Cap outliers (Winsorization)5df['sales_capped'] = df['sales'].clip(lower=lb, upper=ub)67# Option 3: Log transformation (for right-skewed data)8df['sales_log'] = np.log1p(df['sales']) # log(1+x) handles zeros910# Option 4: Flag outliers (keep for analysis)11df['is_outlier'] = ((df['sales'] < lb) | (df['sales'] > ub)).astype(int)5.3 Visualization
Python
1import matplotlib.pyplot as plt23fig, axes = plt.subplots(1, 2, figsize=(12, 4))45# Box plot6axes[0].boxplot(df['sales'].dropna())7axes[0].set_title('Box Plot - Sales')89# Histogram10axes[1].hist(df['sales'].dropna(), bins=50, edgecolor='black')11axes[1].axvline(lb, color='r', linestyle='--', label='Lower Bound')12axes[1].axvline(ub, color='r', linestyle='--', label='Upper Bound')13axes[1].set_title('Histogram - Sales')14axes[1].legend()1516plt.tight_layout()17plt.show()6. Data Validation
6.1 Custom Validation Rules
Python
1def validate_dataframe(df):2 """Validate DataFrame against business rules"""3 errors = []4 5 # Rule 1: No negative prices6 if (df['price'] < 0).any():7 count = (df['price'] < 0).sum()8 errors.append(f"Found {count} negative prices")9 10 # Rule 2: Valid date range11 if (df['order_date'] > pd.Timestamp.now()).any():12 errors.append("Found future dates in order_date")13 14 # Rule 3: Valid category15 valid_categories = ['A', 'B', 'C', 'D']16 invalid = ~df['category'].isin(valid_categories)17 if invalid.any():18 errors.append(f"Found {invalid.sum()} invalid categories")19 20 # Rule 4: Unique constraint21 if df['order_id'].duplicated().any():22 errors.append("Duplicate order_id found")23 24 # Rule 5: Referential integrity25 # customer_id must exist in customers table26 valid_customers = customers_df['customer_id'].unique()27 invalid_refs = ~df['customer_id'].isin(valid_customers)28 if invalid_refs.any():29 errors.append(f"Found {invalid_refs.sum()} invalid customer references")30 31 return errors3233# Run validation34errors = validate_dataframe(df)35if errors:36 print("Validation Errors:")37 for e in errors:38 print(f" - {e}")39else:40 print("✅ All validations passed!")6.2 Schema Validation với Pandera
Python
1# Install: pip install pandera2import pandera as pa3from pandera import Column, Check, DataFrameSchema45# Define schema6schema = DataFrameSchema({7 "customer_id": Column(int, Check.ge(1)),8 "name": Column(str, Check.str_length(min_value=1, max_value=100)),9 "email": Column(str, Check.str_matches(r'^[\w\.-]+@[\w\.-]+\.\w+$')),10 "age": Column(int, Check.in_range(18, 120), nullable=True),11 "balance": Column(float, Check.ge(0)),12 "status": Column(str, Check.isin(['active', 'inactive', 'pending'])),13 "created_at": Column(pa.DateTime)14})1516# Validate17try:18 validated_df = schema.validate(df)19 print("✅ Schema validation passed!")20except pa.errors.SchemaError as e:21 print(f"❌ Schema validation failed: {e}")6.3 Great Expectations (Enterprise-grade)
Python
1# Install: pip install great_expectations2import great_expectations as gx34# Create context5context = gx.get_context()67# Create expectations8validator = context.sources.pandas_default.read_dataframe(df)910# Add expectations11validator.expect_column_values_to_not_be_null("customer_id")12validator.expect_column_values_to_be_unique("order_id")13validator.expect_column_values_to_be_between("price", min_value=0, max_value=10000)14validator.expect_column_values_to_be_in_set("status", ['active', 'pending', 'completed'])1516# Validate17results = validator.validate()18print(results)7. Data Quality Pipeline
Python
1class DataQualityPipeline:2 """Comprehensive Data Quality Pipeline"""3 4 def __init__(self, df):5 self.df = df.copy()6 self.report = {}7 8 def profile(self):9 """Generate data profile"""10 self.report['shape'] = self.df.shape11 self.report['missing'] = self.df.isnull().sum().to_dict()12 self.report['dtypes'] = self.df.dtypes.astype(str).to_dict()13 return self14 15 def handle_missing(self, strategy='drop', numeric_fill='median', 16 categorical_fill='mode'):17 """Handle missing values"""18 if strategy == 'drop':19 self.df = self.df.dropna()20 elif strategy == 'fill':21 # Numeric columns22 numeric_cols = self.df.select_dtypes(include=[np.number]).columns23 for col in numeric_cols:24 if numeric_fill == 'median':25 self.df[col].fillna(self.df[col].median(), inplace=True)26 elif numeric_fill == 'mean':27 self.df[col].fillna(self.df[col].mean(), inplace=True)28 29 # Categorical columns30 cat_cols = self.df.select_dtypes(include=['object']).columns31 for col in cat_cols:32 if categorical_fill == 'mode':33 mode_val = self.df[col].mode()34 if len(mode_val) > 0:35 self.df[col].fillna(mode_val[0], inplace=True)36 37 return self38 39 def remove_duplicates(self, subset=None):40 """Remove duplicate rows"""41 before = len(self.df)42 self.df = self.df.drop_duplicates(subset=subset)43 self.report['duplicates_removed'] = before - len(self.df)44 return self45 46 def handle_outliers(self, columns, method='cap'):47 """Handle outliers using IQR method"""48 for col in columns:49 Q1 = self.df[col].quantile(0.25)50 Q3 = self.df[col].quantile(0.75)51 IQR = Q3 - Q152 lower = Q1 - 1.5 * IQR53 upper = Q3 + 1.5 * IQR54 55 if method == 'cap':56 self.df[col] = self.df[col].clip(lower, upper)57 elif method == 'remove':58 self.df = self.df[(self.df[col] >= lower) & (self.df[col] <= upper)]59 60 return self61 62 def validate(self, rules):63 """Run validation rules"""64 self.report['validation_errors'] = []65 for rule_name, rule_func in rules.items():66 if not rule_func(self.df):67 self.report['validation_errors'].append(rule_name)68 return self69 70 def get_result(self):71 """Return cleaned DataFrame and report"""72 return self.df, self.report737475# Usage76pipeline = DataQualityPipeline(df)7778rules = {79 'no_negative_prices': lambda df: (df['price'] >= 0).all(),80 'valid_quantity': lambda df: (df['quantity'] > 0).all(),81 'unique_orders': lambda df: not df['order_id'].duplicated().any()82}8384clean_df, report = (85 pipeline86 .profile()87 .handle_missing(strategy='fill')88 .remove_duplicates(subset=['order_id'])89 .handle_outliers(['price', 'quantity'], method='cap')90 .validate(rules)91 .get_result()92)9394print("Report:", report)8. Thực hành
Exercises
Exercise 1: Data Profiling
Python
1# Tạo comprehensive data profile report cho dataset2# Include: missing values, duplicates, data types, statistics34# Sample data5df = pd.DataFrame({6 'id': [1, 2, 2, 3, 4, 5],7 'name': ['Alice', 'Bob', 'Bob', None, 'Eve', 'Frank'],8 'age': [25, 30, 30, 35, -5, 150],9 'salary': [50000, 60000, 60000, None, 70000, 80000]10})1112# YOUR CODE HERE💡 Xem đáp án
Python
1def create_profile(df):2 profile = {3 'shape': df.shape,4 'columns': list(df.columns),5 'dtypes': df.dtypes.to_dict(),6 'missing': df.isnull().sum().to_dict(),7 'missing_pct': (df.isnull().sum() / len(df) * 100).round(2).to_dict(),8 'duplicates': df.duplicated().sum(),9 'unique_counts': df.nunique().to_dict(),10 'numeric_stats': df.describe().to_dict() if len(df.select_dtypes(include=[np.number]).columns) > 0 else {}11 }12 return profile1314profile = create_profile(df)15for key, value in profile.items():16 print(f"{key}: {value}")Exercise 2: Build Validation Function
Python
1# Viết function validate employee data với rules:2# - employee_id: unique, positive integer3# - name: not null, length 2-1004# - department: one of ['IT', 'HR', 'Sales', 'Finance']5# - salary: positive, max 50000067# YOUR CODE HERE💡 Xem đáp án
Python
1def validate_employee_data(df):2 errors = []3 4 # Rule 1: employee_id unique and positive5 if df['employee_id'].duplicated().any():6 errors.append("Duplicate employee_id found")7 if (df['employee_id'] <= 0).any():8 errors.append("Non-positive employee_id found")9 10 # Rule 2: name not null and length 2-10011 if df['name'].isnull().any():12 errors.append(f"Null names: {df['name'].isnull().sum()}")13 invalid_names = df['name'].dropna().str.len().between(2, 100) == False14 if invalid_names.any():15 errors.append(f"Invalid name length: {invalid_names.sum()}")16 17 # Rule 3: valid department18 valid_depts = ['IT', 'HR', 'Sales', 'Finance']19 invalid_dept = ~df['department'].isin(valid_depts)20 if invalid_dept.any():21 errors.append(f"Invalid departments: {df[invalid_dept]['department'].unique()}")22 23 # Rule 4: salary positive and max 50000024 if (df['salary'] <= 0).any():25 errors.append("Non-positive salary found")26 if (df['salary'] > 500000).any():27 errors.append("Salary exceeds maximum")28 29 return errors3031# Test32test_df = pd.DataFrame({33 'employee_id': [1, 2, 3, 3],34 'name': ['John', 'A', None, 'Bob'],35 'department': ['IT', 'HR', 'Unknown', 'Sales'],36 'salary': [50000, -1000, 600000, 45000]37})3839errors = validate_employee_data(test_df)40print("Validation Errors:", errors)9. Tổng kết
| Issue | Detection | Handling |
|---|---|---|
| Missing Values | isnull(), info() | Drop, Fill, Interpolate |
| Duplicates | duplicated() | drop_duplicates() |
| Outliers | IQR, Z-score | Remove, Cap, Transform |
| Invalid Data | Custom rules, Schema | Fix or Flag |
Best Practices:
- Profile data đầu tiên trước khi xử lý
- Document các decisions về data cleaning
- Keep raw data, tạo cleaned version riêng
- Automate với pipelines cho reproducibility
Tools Recommendation:
- Quick profiling:
pandas-profiling/ydata-profiling - Schema validation:
pandera - Enterprise:
great_expectations
Bài tiếp theo: ETL Pipelines with Python
