Lý thuyết
Bài 4/15

Data Quality & Validation

Xử lý missing values, outliers, data validation và data profiling

Data Quality & Validation

Data Quality & Validation

1. Tại sao Data Quality quan trọng?

Garbage In, Garbage Out

Dữ liệu chất lượng kém dẫn đến phân tích sai, decisions sai, và mất tiền/thời gian.

Các vấn đề Data Quality phổ biến:

  • Missing Values: Dữ liệu thiếu
  • Duplicates: Dữ liệu trùng lặp
  • Outliers: Giá trị bất thường
  • Inconsistent Formats: Định dạng không nhất quán
  • Invalid Values: Giá trị không hợp lệ

2. Data Profiling

2.1 Quick Overview

Python
1import pandas as pd
2import numpy as np
3
4# Load data
5df = pd.read_csv('sales_data.csv')
6
7# Basic info
8print(df.shape) # (rows, columns)
9print(df.dtypes) # Data types
10print(df.info()) # Memory usage, non-null counts
11
12# Statistical summary
13print(df.describe()) # Numeric columns
14print(df.describe(include='object')) # Categorical columns

2.2 Missing Values Analysis

Python
1# Count missing values
2missing_counts = df.isnull().sum()
3missing_pct = (df.isnull().sum() / len(df)) * 100
4
5# Summary DataFrame
6missing_summary = pd.DataFrame({
7 'Missing Count': missing_counts,
8 'Missing %': missing_pct.round(2)
9}).sort_values('Missing %', ascending=False)
10
11print(missing_summary[missing_summary['Missing Count'] > 0])

2.3 Automated Profiling với pandas-profiling

Python
1# Install: pip install ydata-profiling
2from ydata_profiling import ProfileReport
3
4# Generate profile report
5profile = ProfileReport(df, title='Data Quality Report', minimal=True)
6profile.to_file('data_quality_report.html')
7
8# Hoặc display trong notebook
9profile.to_notebook_iframe()

3. Handling Missing Values

3.1 Các chiến lược xử lý

Python
1# 1. Drop rows với missing values
2df_clean = df.dropna() # Drop any row with any NaN
3df_clean = df.dropna(subset=['important_col']) # Drop nếu specific column NaN
4df_clean = df.dropna(thresh=3) # Keep rows với ít nhất 3 non-null values
5
6# 2. Drop columns với quá nhiều missing
7threshold = 0.5 # 50% missing
8cols_to_drop = df.columns[df.isnull().mean() > threshold]
9df_clean = df.drop(columns=cols_to_drop)

3.2 Imputation (Điền giá trị)

Python
1# Simple imputation
2df['numeric_col'].fillna(df['numeric_col'].mean(), inplace=True) # Mean
3df['numeric_col'].fillna(df['numeric_col'].median(), inplace=True) # Median
4df['categorical_col'].fillna(df['categorical_col'].mode()[0], inplace=True) # Mode
5
6# Forward/Backward fill (time series)
7df['value'].fillna(method='ffill', inplace=True) # Forward fill
8df['value'].fillna(method='bfill', inplace=True) # Backward fill
9
10# Interpolation
11df['value'] = df['value'].interpolate(method='linear')

3.3 Group-wise Imputation

Python
1# Điền median theo group
2df['sales'] = df.groupby('category')['sales'].transform(
3 lambda x: x.fillna(x.median())
4)
5
6# Điền mode theo group (categorical)
7df['status'] = df.groupby('region')['status'].transform(
8 lambda x: x.fillna(x.mode()[0] if len(x.mode()) > 0 else 'Unknown')
9)

3.4 Advanced Imputation với scikit-learn

Python
1from sklearn.impute import SimpleImputer, KNNImputer
2
3# Simple Imputer
4imputer = SimpleImputer(strategy='median') # mean, median, most_frequent, constant
5df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
6
7# KNN Imputer (based on similar rows)
8knn_imputer = KNNImputer(n_neighbors=5)
9df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])

4. Handling Duplicates

4.1 Identify Duplicates

Python
1# Check for duplicates
2print(f"Total duplicates: {df.duplicated().sum()}")
3
4# Check duplicates based on specific columns
5print(f"Duplicates by ID: {df.duplicated(subset=['customer_id']).sum()}")
6
7# View duplicate rows
8duplicates = df[df.duplicated(keep=False)] # keep=False shows all duplicates
9print(duplicates.sort_values('customer_id'))

4.2 Remove Duplicates

Python
1# Remove exact duplicates (keep first)
2df_clean = df.drop_duplicates()
3
4# Remove duplicates based on specific columns
5df_clean = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='last')
6
7# Custom duplicate handling
8# Keep row with highest value
9df_clean = df.sort_values('amount', ascending=False).drop_duplicates(subset=['order_id'])

5. Handling Outliers

5.1 Detecting Outliers

Python
1import numpy as np
2
3# Method 1: IQR (Interquartile Range)
4def detect_outliers_iqr(df, column):
5 Q1 = df[column].quantile(0.25)
6 Q3 = df[column].quantile(0.75)
7 IQR = Q3 - Q1
8 lower_bound = Q1 - 1.5 * IQR
9 upper_bound = Q3 + 1.5 * IQR
10
11 outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
12 return outliers, lower_bound, upper_bound
13
14outliers, lb, ub = detect_outliers_iqr(df, 'sales')
15print(f"Outliers found: {len(outliers)}")
16print(f"Range: [{lb:.2f}, {ub:.2f}]")
Python
1# Method 2: Z-Score
2from scipy import stats
3
4def detect_outliers_zscore(df, column, threshold=3):
5 z_scores = np.abs(stats.zscore(df[column].dropna()))
6 outliers = df[z_scores > threshold]
7 return outliers
8
9outliers = detect_outliers_zscore(df, 'sales')

5.2 Handling Outliers

Python
1# Option 1: Remove outliers
2df_clean = df[(df['sales'] >= lb) & (df['sales'] <= ub)]
3
4# Option 2: Cap outliers (Winsorization)
5df['sales_capped'] = df['sales'].clip(lower=lb, upper=ub)
6
7# Option 3: Log transformation (for right-skewed data)
8df['sales_log'] = np.log1p(df['sales']) # log(1+x) handles zeros
9
10# Option 4: Flag outliers (keep for analysis)
11df['is_outlier'] = ((df['sales'] < lb) | (df['sales'] > ub)).astype(int)

5.3 Visualization

Python
1import matplotlib.pyplot as plt
2
3fig, axes = plt.subplots(1, 2, figsize=(12, 4))
4
5# Box plot
6axes[0].boxplot(df['sales'].dropna())
7axes[0].set_title('Box Plot - Sales')
8
9# Histogram
10axes[1].hist(df['sales'].dropna(), bins=50, edgecolor='black')
11axes[1].axvline(lb, color='r', linestyle='--', label='Lower Bound')
12axes[1].axvline(ub, color='r', linestyle='--', label='Upper Bound')
13axes[1].set_title('Histogram - Sales')
14axes[1].legend()
15
16plt.tight_layout()
17plt.show()

6. Data Validation

6.1 Custom Validation Rules

Python
1def validate_dataframe(df):
2 """Validate DataFrame against business rules"""
3 errors = []
4
5 # Rule 1: No negative prices
6 if (df['price'] < 0).any():
7 count = (df['price'] < 0).sum()
8 errors.append(f"Found {count} negative prices")
9
10 # Rule 2: Valid date range
11 if (df['order_date'] > pd.Timestamp.now()).any():
12 errors.append("Found future dates in order_date")
13
14 # Rule 3: Valid category
15 valid_categories = ['A', 'B', 'C', 'D']
16 invalid = ~df['category'].isin(valid_categories)
17 if invalid.any():
18 errors.append(f"Found {invalid.sum()} invalid categories")
19
20 # Rule 4: Unique constraint
21 if df['order_id'].duplicated().any():
22 errors.append("Duplicate order_id found")
23
24 # Rule 5: Referential integrity
25 # customer_id must exist in customers table
26 valid_customers = customers_df['customer_id'].unique()
27 invalid_refs = ~df['customer_id'].isin(valid_customers)
28 if invalid_refs.any():
29 errors.append(f"Found {invalid_refs.sum()} invalid customer references")
30
31 return errors
32
33# Run validation
34errors = validate_dataframe(df)
35if errors:
36 print("Validation Errors:")
37 for e in errors:
38 print(f" - {e}")
39else:
40 print("✅ All validations passed!")

6.2 Schema Validation với Pandera

Python
1# Install: pip install pandera
2import pandera as pa
3from pandera import Column, Check, DataFrameSchema
4
5# Define schema
6schema = DataFrameSchema({
7 "customer_id": Column(int, Check.ge(1)),
8 "name": Column(str, Check.str_length(min_value=1, max_value=100)),
9 "email": Column(str, Check.str_matches(r'^[\w\.-]+@[\w\.-]+\.\w+$')),
10 "age": Column(int, Check.in_range(18, 120), nullable=True),
11 "balance": Column(float, Check.ge(0)),
12 "status": Column(str, Check.isin(['active', 'inactive', 'pending'])),
13 "created_at": Column(pa.DateTime)
14})
15
16# Validate
17try:
18 validated_df = schema.validate(df)
19 print("✅ Schema validation passed!")
20except pa.errors.SchemaError as e:
21 print(f"❌ Schema validation failed: {e}")

6.3 Great Expectations (Enterprise-grade)

Python
1# Install: pip install great_expectations
2import great_expectations as gx
3
4# Create context
5context = gx.get_context()
6
7# Create expectations
8validator = context.sources.pandas_default.read_dataframe(df)
9
10# Add expectations
11validator.expect_column_values_to_not_be_null("customer_id")
12validator.expect_column_values_to_be_unique("order_id")
13validator.expect_column_values_to_be_between("price", min_value=0, max_value=10000)
14validator.expect_column_values_to_be_in_set("status", ['active', 'pending', 'completed'])
15
16# Validate
17results = validator.validate()
18print(results)

7. Data Quality Pipeline

Python
1class DataQualityPipeline:
2 """Comprehensive Data Quality Pipeline"""
3
4 def __init__(self, df):
5 self.df = df.copy()
6 self.report = {}
7
8 def profile(self):
9 """Generate data profile"""
10 self.report['shape'] = self.df.shape
11 self.report['missing'] = self.df.isnull().sum().to_dict()
12 self.report['dtypes'] = self.df.dtypes.astype(str).to_dict()
13 return self
14
15 def handle_missing(self, strategy='drop', numeric_fill='median',
16 categorical_fill='mode'):
17 """Handle missing values"""
18 if strategy == 'drop':
19 self.df = self.df.dropna()
20 elif strategy == 'fill':
21 # Numeric columns
22 numeric_cols = self.df.select_dtypes(include=[np.number]).columns
23 for col in numeric_cols:
24 if numeric_fill == 'median':
25 self.df[col].fillna(self.df[col].median(), inplace=True)
26 elif numeric_fill == 'mean':
27 self.df[col].fillna(self.df[col].mean(), inplace=True)
28
29 # Categorical columns
30 cat_cols = self.df.select_dtypes(include=['object']).columns
31 for col in cat_cols:
32 if categorical_fill == 'mode':
33 mode_val = self.df[col].mode()
34 if len(mode_val) > 0:
35 self.df[col].fillna(mode_val[0], inplace=True)
36
37 return self
38
39 def remove_duplicates(self, subset=None):
40 """Remove duplicate rows"""
41 before = len(self.df)
42 self.df = self.df.drop_duplicates(subset=subset)
43 self.report['duplicates_removed'] = before - len(self.df)
44 return self
45
46 def handle_outliers(self, columns, method='cap'):
47 """Handle outliers using IQR method"""
48 for col in columns:
49 Q1 = self.df[col].quantile(0.25)
50 Q3 = self.df[col].quantile(0.75)
51 IQR = Q3 - Q1
52 lower = Q1 - 1.5 * IQR
53 upper = Q3 + 1.5 * IQR
54
55 if method == 'cap':
56 self.df[col] = self.df[col].clip(lower, upper)
57 elif method == 'remove':
58 self.df = self.df[(self.df[col] >= lower) & (self.df[col] <= upper)]
59
60 return self
61
62 def validate(self, rules):
63 """Run validation rules"""
64 self.report['validation_errors'] = []
65 for rule_name, rule_func in rules.items():
66 if not rule_func(self.df):
67 self.report['validation_errors'].append(rule_name)
68 return self
69
70 def get_result(self):
71 """Return cleaned DataFrame and report"""
72 return self.df, self.report
73
74
75# Usage
76pipeline = DataQualityPipeline(df)
77
78rules = {
79 'no_negative_prices': lambda df: (df['price'] >= 0).all(),
80 'valid_quantity': lambda df: (df['quantity'] > 0).all(),
81 'unique_orders': lambda df: not df['order_id'].duplicated().any()
82}
83
84clean_df, report = (
85 pipeline
86 .profile()
87 .handle_missing(strategy='fill')
88 .remove_duplicates(subset=['order_id'])
89 .handle_outliers(['price', 'quantity'], method='cap')
90 .validate(rules)
91 .get_result()
92)
93
94print("Report:", report)

8. Thực hành

Exercises

Exercise 1: Data Profiling

Python
1# Tạo comprehensive data profile report cho dataset
2# Include: missing values, duplicates, data types, statistics
3
4# Sample data
5df = pd.DataFrame({
6 'id': [1, 2, 2, 3, 4, 5],
7 'name': ['Alice', 'Bob', 'Bob', None, 'Eve', 'Frank'],
8 'age': [25, 30, 30, 35, -5, 150],
9 'salary': [50000, 60000, 60000, None, 70000, 80000]
10})
11
12# YOUR CODE HERE
💡 Xem đáp án
Python
1def create_profile(df):
2 profile = {
3 'shape': df.shape,
4 'columns': list(df.columns),
5 'dtypes': df.dtypes.to_dict(),
6 'missing': df.isnull().sum().to_dict(),
7 'missing_pct': (df.isnull().sum() / len(df) * 100).round(2).to_dict(),
8 'duplicates': df.duplicated().sum(),
9 'unique_counts': df.nunique().to_dict(),
10 'numeric_stats': df.describe().to_dict() if len(df.select_dtypes(include=[np.number]).columns) > 0 else {}
11 }
12 return profile
13
14profile = create_profile(df)
15for key, value in profile.items():
16 print(f"{key}: {value}")

Exercise 2: Build Validation Function

Python
1# Viết function validate employee data với rules:
2# - employee_id: unique, positive integer
3# - name: not null, length 2-100
4# - department: one of ['IT', 'HR', 'Sales', 'Finance']
5# - salary: positive, max 500000
6
7# YOUR CODE HERE
💡 Xem đáp án
Python
1def validate_employee_data(df):
2 errors = []
3
4 # Rule 1: employee_id unique and positive
5 if df['employee_id'].duplicated().any():
6 errors.append("Duplicate employee_id found")
7 if (df['employee_id'] <= 0).any():
8 errors.append("Non-positive employee_id found")
9
10 # Rule 2: name not null and length 2-100
11 if df['name'].isnull().any():
12 errors.append(f"Null names: {df['name'].isnull().sum()}")
13 invalid_names = df['name'].dropna().str.len().between(2, 100) == False
14 if invalid_names.any():
15 errors.append(f"Invalid name length: {invalid_names.sum()}")
16
17 # Rule 3: valid department
18 valid_depts = ['IT', 'HR', 'Sales', 'Finance']
19 invalid_dept = ~df['department'].isin(valid_depts)
20 if invalid_dept.any():
21 errors.append(f"Invalid departments: {df[invalid_dept]['department'].unique()}")
22
23 # Rule 4: salary positive and max 500000
24 if (df['salary'] <= 0).any():
25 errors.append("Non-positive salary found")
26 if (df['salary'] > 500000).any():
27 errors.append("Salary exceeds maximum")
28
29 return errors
30
31# Test
32test_df = pd.DataFrame({
33 'employee_id': [1, 2, 3, 3],
34 'name': ['John', 'A', None, 'Bob'],
35 'department': ['IT', 'HR', 'Unknown', 'Sales'],
36 'salary': [50000, -1000, 600000, 45000]
37})
38
39errors = validate_employee_data(test_df)
40print("Validation Errors:", errors)

9. Tổng kết

IssueDetectionHandling
Missing Valuesisnull(), info()Drop, Fill, Interpolate
Duplicatesduplicated()drop_duplicates()
OutliersIQR, Z-scoreRemove, Cap, Transform
Invalid DataCustom rules, SchemaFix or Flag

Best Practices:

  • Profile data đầu tiên trước khi xử lý
  • Document các decisions về data cleaning
  • Keep raw data, tạo cleaned version riêng
  • Automate với pipelines cho reproducibility

Tools Recommendation:

  • Quick profiling: pandas-profiling / ydata-profiling
  • Schema validation: pandera
  • Enterprise: great_expectations

Bài tiếp theo: ETL Pipelines with Python