Data Quality & Validation

1. Tại sao Data Quality quan trọng?

Garbage In, Garbage Out

Dữ liệu chất lượng kém dẫn đến phân tích sai, decisions sai, và mất tiền/thời gian.

Các vấn đề Data Quality phổ biến:

Missing Values: Dữ liệu thiếu
Duplicates: Dữ liệu trùng lặp
Outliers: Giá trị bất thường
Inconsistent Formats: Định dạng không nhất quán
Invalid Values: Giá trị không hợp lệ

2. Data Profiling

2.1 Quick Overview

Python

1import pandas as pd
2import numpy as np
3
4# Load data
5df = pd.read_csv('sales_data.csv')
6
7# Basic info
8print(df.shape)  # (rows, columns)
9print(df.dtypes)  # Data types
10print(df.info())  # Memory usage, non-null counts
11
12# Statistical summary
13print(df.describe())  # Numeric columns
14print(df.describe(include='object'))  # Categorical columns

2.2 Missing Values Analysis

Python

1# Count missing values
2missing_counts = df.isnull().sum()
3missing_pct = (df.isnull().sum() / len(df)) * 100
4
5# Summary DataFrame
6missing_summary = pd.DataFrame({
7    'Missing Count': missing_counts,
8    'Missing %': missing_pct.round(2)
9}).sort_values('Missing %', ascending=False)
10
11print(missing_summary[missing_summary['Missing Count'] > 0])

2.3 Automated Profiling với pandas-profiling

Python

1# Install: pip install ydata-profiling
2from ydata_profiling import ProfileReport
3
4# Generate profile report
5profile = ProfileReport(df, title='Data Quality Report', minimal=True)
6profile.to_file('data_quality_report.html')
7
8# Hoặc display trong notebook
9profile.to_notebook_iframe()

3. Handling Missing Values

3.1 Các chiến lược xử lý

Python

1# 1. Drop rows với missing values
2df_clean = df.dropna()  # Drop any row with any NaN
3df_clean = df.dropna(subset=['important_col'])  # Drop nếu specific column NaN
4df_clean = df.dropna(thresh=3)  # Keep rows với ít nhất 3 non-null values
5
6# 2. Drop columns với quá nhiều missing
7threshold = 0.5  # 50% missing
8cols_to_drop = df.columns[df.isnull().mean() > threshold]
9df_clean = df.drop(columns=cols_to_drop)

3.2 Imputation (Điền giá trị)

Python

1# Simple imputation
2df['numeric_col'].fillna(df['numeric_col'].mean(), inplace=True)  # Mean
3df['numeric_col'].fillna(df['numeric_col'].median(), inplace=True)  # Median
4df['categorical_col'].fillna(df['categorical_col'].mode()[0], inplace=True)  # Mode
5
6# Forward/Backward fill (time series)
7df['value'].fillna(method='ffill', inplace=True)  # Forward fill
8df['value'].fillna(method='bfill', inplace=True)  # Backward fill
9
10# Interpolation
11df['value'] = df['value'].interpolate(method='linear')

3.3 Group-wise Imputation

Python

1# Điền median theo group
2df['sales'] = df.groupby('category')['sales'].transform(
3    lambda x: x.fillna(x.median())
4)
5
6# Điền mode theo group (categorical)
7df['status'] = df.groupby('region')['status'].transform(
8    lambda x: x.fillna(x.mode()[0] if len(x.mode()) > 0 else 'Unknown')
9)

3.4 Advanced Imputation với scikit-learn

Python

1from sklearn.impute import SimpleImputer, KNNImputer
2
3# Simple Imputer
4imputer = SimpleImputer(strategy='median')  # mean, median, most_frequent, constant
5df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
6
7# KNN Imputer (based on similar rows)
8knn_imputer = KNNImputer(n_neighbors=5)
9df[numeric_cols] = knn_imputer.fit_transform(df[numeric_cols])

4. Handling Duplicates

4.1 Identify Duplicates

Python

1# Check for duplicates
2print(f"Total duplicates: {df.duplicated().sum()}")
3
4# Check duplicates based on specific columns
5print(f"Duplicates by ID: {df.duplicated(subset=['customer_id']).sum()}")
6
7# View duplicate rows
8duplicates = df[df.duplicated(keep=False)]  # keep=False shows all duplicates
9print(duplicates.sort_values('customer_id'))

4.2 Remove Duplicates

Python

1# Remove exact duplicates (keep first)
2df_clean = df.drop_duplicates()
3
4# Remove duplicates based on specific columns
5df_clean = df.drop_duplicates(subset=['customer_id', 'order_date'], keep='last')
6
7# Custom duplicate handling
8# Keep row with highest value
9df_clean = df.sort_values('amount', ascending=False).drop_duplicates(subset=['order_id'])

5. Handling Outliers

5.1 Detecting Outliers

Python

1import numpy as np
2
3# Method 1: IQR (Interquartile Range)
4def detect_outliers_iqr(df, column):
5    Q1 = df[column].quantile(0.25)
6    Q3 = df[column].quantile(0.75)
7    IQR = Q3 - Q1
8    lower_bound = Q1 - 1.5 * IQR
9    upper_bound = Q3 + 1.5 * IQR
10    
11    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
12    return outliers, lower_bound, upper_bound
13
14outliers, lb, ub = detect_outliers_iqr(df, 'sales')
15print(f"Outliers found: {len(outliers)}")
16print(f"Range: [{lb:.2f}, {ub:.2f}]")

Python

1# Method 2: Z-Score
2from scipy import stats
3
4def detect_outliers_zscore(df, column, threshold=3):
5    z_scores = np.abs(stats.zscore(df[column].dropna()))
6    outliers = df[z_scores > threshold]
7    return outliers
8
9outliers = detect_outliers_zscore(df, 'sales')

5.2 Handling Outliers

Python

1# Option 1: Remove outliers
2df_clean = df[(df['sales'] >= lb) & (df['sales'] <= ub)]
3
4# Option 2: Cap outliers (Winsorization)
5df['sales_capped'] = df['sales'].clip(lower=lb, upper=ub)
6
7# Option 3: Log transformation (for right-skewed data)
8df['sales_log'] = np.log1p(df['sales'])  # log(1+x) handles zeros
9
10# Option 4: Flag outliers (keep for analysis)
11df['is_outlier'] = ((df['sales'] < lb) | (df['sales'] > ub)).astype(int)

5.3 Visualization

Python

1import matplotlib.pyplot as plt
2
3fig, axes = plt.subplots(1, 2, figsize=(12, 4))
4
5# Box plot
6axes[0].boxplot(df['sales'].dropna())
7axes[0].set_title('Box Plot - Sales')
8
9# Histogram
10axes[1].hist(df['sales'].dropna(), bins=50, edgecolor='black')
11axes[1].axvline(lb, color='r', linestyle='--', label='Lower Bound')
12axes[1].axvline(ub, color='r', linestyle='--', label='Upper Bound')
13axes[1].set_title('Histogram - Sales')
14axes[1].legend()
15
16plt.tight_layout()
17plt.show()

6. Data Validation

6.1 Custom Validation Rules

Python

1def validate_dataframe(df):
2    """Validate DataFrame against business rules"""
3    errors = []
4    
5    # Rule 1: No negative prices
6    if (df['price'] < 0).any():
7        count = (df['price'] < 0).sum()
8        errors.append(f"Found {count} negative prices")
9    
10    # Rule 2: Valid date range
11    if (df['order_date'] > pd.Timestamp.now()).any():
12        errors.append("Found future dates in order_date")
13    
14    # Rule 3: Valid category
15    valid_categories = ['A', 'B', 'C', 'D']
16    invalid = ~df['category'].isin(valid_categories)
17    if invalid.any():
18        errors.append(f"Found {invalid.sum()} invalid categories")
19    
20    # Rule 4: Unique constraint
21    if df['order_id'].duplicated().any():
22        errors.append("Duplicate order_id found")
23    
24    # Rule 5: Referential integrity
25    # customer_id must exist in customers table
26    valid_customers = customers_df['customer_id'].unique()
27    invalid_refs = ~df['customer_id'].isin(valid_customers)
28    if invalid_refs.any():
29        errors.append(f"Found {invalid_refs.sum()} invalid customer references")
30    
31    return errors
32
33# Run validation
34errors = validate_dataframe(df)
35if errors:
36    print("Validation Errors:")
37    for e in errors:
38        print(f"  - {e}")
39else:
40    print("✅ All validations passed!")

6.2 Schema Validation với Pandera

Python

1# Install: pip install pandera
2import pandera as pa
3from pandera import Column, Check, DataFrameSchema
4
5# Define schema
6schema = DataFrameSchema({
7    "customer_id": Column(int, Check.ge(1)),
8    "name": Column(str, Check.str_length(min_value=1, max_value=100)),
9    "email": Column(str, Check.str_matches(r'^[\w\.-]+@[\w\.-]+\.\w+$')),
10    "age": Column(int, Check.in_range(18, 120), nullable=True),
11    "balance": Column(float, Check.ge(0)),
12    "status": Column(str, Check.isin(['active', 'inactive', 'pending'])),
13    "created_at": Column(pa.DateTime)
14})
15
16# Validate
17try:
18    validated_df = schema.validate(df)
19    print("✅ Schema validation passed!")
20except pa.errors.SchemaError as e:
21    print(f"❌ Schema validation failed: {e}")

6.3 Great Expectations (Enterprise-grade)

Python

1# Install: pip install great_expectations
2import great_expectations as gx
3
4# Create context
5context = gx.get_context()
6
7# Create expectations
8validator = context.sources.pandas_default.read_dataframe(df)
9
10# Add expectations
11validator.expect_column_values_to_not_be_null("customer_id")
12validator.expect_column_values_to_be_unique("order_id")
13validator.expect_column_values_to_be_between("price", min_value=0, max_value=10000)
14validator.expect_column_values_to_be_in_set("status", ['active', 'pending', 'completed'])
15
16# Validate
17results = validator.validate()
18print(results)

7. Data Quality Pipeline

Python

1class DataQualityPipeline:
2    """Comprehensive Data Quality Pipeline"""
3    
4    def __init__(self, df):
5        self.df = df.copy()
6        self.report = {}
7    
8    def profile(self):
9        """Generate data profile"""
10        self.report['shape'] = self.df.shape
11        self.report['missing'] = self.df.isnull().sum().to_dict()
12        self.report['dtypes'] = self.df.dtypes.astype(str).to_dict()
13        return self
14    
15    def handle_missing(self, strategy='drop', numeric_fill='median', 
16                       categorical_fill='mode'):
17        """Handle missing values"""
18        if strategy == 'drop':
19            self.df = self.df.dropna()
20        elif strategy == 'fill':
21            # Numeric columns
22            numeric_cols = self.df.select_dtypes(include=[np.number]).columns
23            for col in numeric_cols:
24                if numeric_fill == 'median':
25                    self.df[col].fillna(self.df[col].median(), inplace=True)
26                elif numeric_fill == 'mean':
27                    self.df[col].fillna(self.df[col].mean(), inplace=True)
28            
29            # Categorical columns
30            cat_cols = self.df.select_dtypes(include=['object']).columns
31            for col in cat_cols:
32                if categorical_fill == 'mode':
33                    mode_val = self.df[col].mode()
34                    if len(mode_val) > 0:
35                        self.df[col].fillna(mode_val[0], inplace=True)
36        
37        return self
38    
39    def remove_duplicates(self, subset=None):
40        """Remove duplicate rows"""
41        before = len(self.df)
42        self.df = self.df.drop_duplicates(subset=subset)
43        self.report['duplicates_removed'] = before - len(self.df)
44        return self
45    
46    def handle_outliers(self, columns, method='cap'):
47        """Handle outliers using IQR method"""
48        for col in columns:
49            Q1 = self.df[col].quantile(0.25)
50            Q3 = self.df[col].quantile(0.75)
51            IQR = Q3 - Q1
52            lower = Q1 - 1.5 * IQR
53            upper = Q3 + 1.5 * IQR
54            
55            if method == 'cap':
56                self.df[col] = self.df[col].clip(lower, upper)
57            elif method == 'remove':
58                self.df = self.df[(self.df[col] >= lower) & (self.df[col] <= upper)]
59        
60        return self
61    
62    def validate(self, rules):
63        """Run validation rules"""
64        self.report['validation_errors'] = []
65        for rule_name, rule_func in rules.items():
66            if not rule_func(self.df):
67                self.report['validation_errors'].append(rule_name)
68        return self
69    
70    def get_result(self):
71        """Return cleaned DataFrame and report"""
72        return self.df, self.report
73
74
75# Usage
76pipeline = DataQualityPipeline(df)
77
78rules = {
79    'no_negative_prices': lambda df: (df['price'] >= 0).all(),
80    'valid_quantity': lambda df: (df['quantity'] > 0).all(),
81    'unique_orders': lambda df: not df['order_id'].duplicated().any()
82}
83
84clean_df, report = (
85    pipeline
86    .profile()
87    .handle_missing(strategy='fill')
88    .remove_duplicates(subset=['order_id'])
89    .handle_outliers(['price', 'quantity'], method='cap')
90    .validate(rules)
91    .get_result()
92)
93
94print("Report:", report)

8. Thực hành

Exercises

Exercise 1: Data Profiling

Python

1# Tạo comprehensive data profile report cho dataset
2# Include: missing values, duplicates, data types, statistics
3
4# Sample data
5df = pd.DataFrame({
6    'id': [1, 2, 2, 3, 4, 5],
7    'name': ['Alice', 'Bob', 'Bob', None, 'Eve', 'Frank'],
8    'age': [25, 30, 30, 35, -5, 150],
9    'salary': [50000, 60000, 60000, None, 70000, 80000]
10})
11
12# YOUR CODE HERE

💡 Xem đáp án

Python

1def create_profile(df):
2    profile = {
3        'shape': df.shape,
4        'columns': list(df.columns),
5        'dtypes': df.dtypes.to_dict(),
6        'missing': df.isnull().sum().to_dict(),
7        'missing_pct': (df.isnull().sum() / len(df) * 100).round(2).to_dict(),
8        'duplicates': df.duplicated().sum(),
9        'unique_counts': df.nunique().to_dict(),
10        'numeric_stats': df.describe().to_dict() if len(df.select_dtypes(include=[np.number]).columns) > 0 else {}
11    }
12    return profile
13
14profile = create_profile(df)
15for key, value in profile.items():
16    print(f"{key}: {value}")

Exercise 2: Build Validation Function

Python

1# Viết function validate employee data với rules:
2# - employee_id: unique, positive integer
3# - name: not null, length 2-100
4# - department: one of ['IT', 'HR', 'Sales', 'Finance']
5# - salary: positive, max 500000
6
7# YOUR CODE HERE

💡 Xem đáp án

Python

1def validate_employee_data(df):
2    errors = []
3    
4    # Rule 1: employee_id unique and positive
5    if df['employee_id'].duplicated().any():
6        errors.append("Duplicate employee_id found")
7    if (df['employee_id'] <= 0).any():
8        errors.append("Non-positive employee_id found")
9    
10    # Rule 2: name not null and length 2-100
11    if df['name'].isnull().any():
12        errors.append(f"Null names: {df['name'].isnull().sum()}")
13    invalid_names = df['name'].dropna().str.len().between(2, 100) == False
14    if invalid_names.any():
15        errors.append(f"Invalid name length: {invalid_names.sum()}")
16    
17    # Rule 3: valid department
18    valid_depts = ['IT', 'HR', 'Sales', 'Finance']
19    invalid_dept = ~df['department'].isin(valid_depts)
20    if invalid_dept.any():
21        errors.append(f"Invalid departments: {df[invalid_dept]['department'].unique()}")
22    
23    # Rule 4: salary positive and max 500000
24    if (df['salary'] <= 0).any():
25        errors.append("Non-positive salary found")
26    if (df['salary'] > 500000).any():
27        errors.append("Salary exceeds maximum")
28    
29    return errors
30
31# Test
32test_df = pd.DataFrame({
33    'employee_id': [1, 2, 3, 3],
34    'name': ['John', 'A', None, 'Bob'],
35    'department': ['IT', 'HR', 'Unknown', 'Sales'],
36    'salary': [50000, -1000, 600000, 45000]
37})
38
39errors = validate_employee_data(test_df)
40print("Validation Errors:", errors)

9. Tổng kết

Issue	Detection	Handling
Missing Values	`isnull()`, `info()`	Drop, Fill, Interpolate
Duplicates	`duplicated()`	`drop_duplicates()`
Outliers	IQR, Z-score	Remove, Cap, Transform
Invalid Data	Custom rules, Schema	Fix or Flag

Best Practices:

Profile data đầu tiên trước khi xử lý
Document các decisions về data cleaning
Keep raw data, tạo cleaned version riêng
Automate với pipelines cho reproducibility

Tools Recommendation:

Quick profiling: pandas-profiling / ydata-profiling
Schema validation: pandera
Enterprise: great_expectations

Bài tiếp theo: ETL Pipelines with Python

Data Quality & Validation

Data Quality & Validation

1. Tại sao Data Quality quan trọng?

2. Data Profiling

2.1 Quick Overview

2.2 Missing Values Analysis

2.3 Automated Profiling với pandas-profiling

3. Handling Missing Values

3.1 Các chiến lược xử lý

3.2 Imputation (Điền giá trị)

3.3 Group-wise Imputation

3.4 Advanced Imputation với scikit-learn

4. Handling Duplicates

4.1 Identify Duplicates

4.2 Remove Duplicates

5. Handling Outliers

5.1 Detecting Outliers

5.2 Handling Outliers

5.3 Visualization

6. Data Validation

6.1 Custom Validation Rules

6.2 Schema Validation với Pandera

6.3 Great Expectations (Enterprise-grade)

7. Data Quality Pipeline

8. Thực hành

Exercise 1: Data Profiling

Exercise 2: Build Validation Function

9. Tổng kết

Khóa học

Mentor & Hỗ trợ

Blog

Giới thiệu