📉 Thống kê Mô tả - Độ đo Phân tán

Mục tiêu bài học

Sau bài học này, bạn sẽ:

Hiểu và tính toán Range, Variance, Standard Deviation
Phân biệt Population vs Sample Variance
Hiểu ý nghĩa của Coefficient of Variation
Áp dụng quy tắc Empirical Rule

1. Tại sao cần Độ đo Phân tán?

Xét hai tập dữ liệu có cùng Mean = 50:

Dataset A	Dataset B
48, 49, 50, 51, 52	10, 30, 50, 70, 90
Mean = 50	Mean = 50
Tập trung	Phân tán

Insight

Chỉ biết Mean là chưa đủ! Cần biết dữ liệu phân tán như thế nào quanh giá trị trung bình.

2. Range (Khoảng biến thiên)

2.1 Công thức

$Range = Max - Min$

2.2 Ví dụ

Dataset A: [48, 49, 50, 51, 52] $Range_A = 52 - 48 = 4$

Dataset B: [10, 30, 50, 70, 90] $Range_B = 90 - 10 = 80$

2.3 Ưu và nhược điểm

Ưu điểm	Nhược điểm
Đơn giản, dễ tính	Chỉ dùng 2 giá trị
Nhanh chóng	Rất nhạy với outliers

2.4 Code Python

Python

1import numpy as np
2
3data = [10, 30, 50, 70, 90]
4range_value = np.max(data) - np.min(data)
5print(f"Range: {range_value}")  # 80
6
7# Hoặc dùng np.ptp (peak to peak)
8print(f"Range (ptp): {np.ptp(data)}")  # 80

3. Variance (Phương sai)

3.1 Công thức

Population Variance:

$\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2$

Sample Variance:

$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$

Bessel's Correction

Sample Variance chia cho (n-1) thay vì n để có unbiased estimator của population variance. Đây gọi là Bessel's correction.

3.2 Ví dụ tính tay

Dữ liệu: [2, 4, 4, 4, 5, 5, 7, 9]

Bước 1: Tính Mean $\bar{x} = \frac{2+4+4+4+5+5+7+9}{8} = \frac{40}{8} = 5$

Bước 2: Tính độ lệch từng giá trị

$x_i$	$x_i - \bar{x}$	$(x_i - \bar{x})^2$
2	-3	9
4	-1	1
4	-1	1
4	-1	1
5	0	0
5	0	0
7	2	4
9	4	16
Tổng	0	32

Bước 3: Tính Variance

Population: $\sigma^2 = \frac{32}{8} = 4$

Sample: $s^2 = \frac{32}{7} \approx 4.57$

3.3 Code Python

Python

1import numpy as np
2
3data = [2, 4, 4, 4, 5, 5, 7, 9]
4
5# Population Variance (ddof=0)
6pop_var = np.var(data, ddof=0)
7print(f"Population Variance: {pop_var}")  # 4.0
8
9# Sample Variance (ddof=1) - mặc định trong pandas
10sample_var = np.var(data, ddof=1)
11print(f"Sample Variance: {sample_var:.2f}")  # 4.57

4. Standard Deviation (Độ lệch chuẩn)

4.1 Công thức

$\sigma = \sqrt{\sigma^2} \quad \text{(Population)}$

$s = \sqrt{s^2} \quad \text{(Sample)}$

4.2 Ý nghĩa

Standard Deviation cho biết trung bình các giá trị cách xa mean bao nhiêu.

Từ ví dụ trên:

$\sigma = \sqrt{4} = 2$
Trung bình, các giá trị cách mean khoảng 2 đơn vị

4.3 Code Python

Python

1import numpy as np
2
3data = [2, 4, 4, 4, 5, 5, 7, 9]
4
5# Population Std
6pop_std = np.std(data, ddof=0)
7print(f"Population Std: {pop_std}")  # 2.0
8
9# Sample Std
10sample_std = np.std(data, ddof=1)
11print(f"Sample Std: {sample_std:.2f}")  # 2.14

5. Coefficient of Variation (Hệ số biến thiên)

5.1 Công thức

$CV = \frac{\sigma}{\mu} \times 100\%$

5.2 Ý nghĩa

CV cho phép so sánh độ phân tán giữa các tập dữ liệu có đơn vị khác nhau hoặc mean khác nhau.

5.3 Ví dụ

So sánh độ biến thiên giữa:

Chiều cao (cm): Mean = 170, Std = 10
Cân nặng (kg): Mean = 65, Std = 8

$CV_{height} = \frac{10}{170} \times 100\% = 5.88\%$

$CV_{weight} = \frac{8}{65} \times 100\% = 12.31\%$

→ Cân nặng có độ biến thiên tương đối lớn hơn.

5.4 Code Python

Python

1import numpy as np
2
3# Height
4height_mean, height_std = 170, 10
5cv_height = (height_std / height_mean) * 100
6
7# Weight
8weight_mean, weight_std = 65, 8
9cv_weight = (weight_std / weight_mean) * 100
10
11print(f"CV Height: {cv_height:.2f}%")  # 5.88%
12print(f"CV Weight: {cv_weight:.2f}%")  # 12.31%
13
14# Sử dụng scipy
15from scipy.stats import variation
16data = [2, 4, 4, 4, 5, 5, 7, 9]
17print(f"CV: {variation(data) * 100:.2f}%")

6. Empirical Rule (Quy tắc 68-95-99.7)

6.1 Định nghĩa

Với phân phối chuẩn (Normal Distribution):

Empirical Rule (68-95-99.7)

📊μ ± 1σ

📊μ ± 2σ

📊μ ± 3σ

68% Dữ liệu

95% Dữ liệu

99.7% Dữ liệu

Khoảng	% Dữ liệu
μ ± 1σ	~68%
μ ± 2σ	~95%
μ ± 3σ	~99.7%

6.2 Ví dụ

Điểm IQ có Mean = 100, Std = 15

Khoảng	Tính toán	Kết quả
68%	100 ± 15	85 - 115
95%	100 ± 30	70 - 130
99.7%	100 ± 45	55 - 145

→ 68% người có IQ từ 85-115

6.3 Code minh họa

Python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Generate normal distribution
6mu, sigma = 100, 15
7x = np.linspace(mu - 4*sigma, mu + 4*sigma, 1000)
8y = stats.norm.pdf(x, mu, sigma)
9
10plt.figure(figsize=(12, 6))
11plt.plot(x, y, 'b-', linewidth=2)
12
13# Fill regions
14plt.fill_between(x, y, where=(x >= mu-sigma) & (x <= mu+sigma), 
15                 alpha=0.3, color='green', label='68% (μ±1σ)')
16plt.fill_between(x, y, where=(x >= mu-2*sigma) & (x <= mu+2*sigma), 
17                 alpha=0.2, color='yellow', label='95% (μ±2σ)')
18plt.fill_between(x, y, where=(x >= mu-3*sigma) & (x <= mu+3*sigma), 
19                 alpha=0.1, color='red', label='99.7% (μ±3σ)')
20
21# Add vertical lines
22for i, color in zip([1, 2, 3], ['green', 'orange', 'red']):
23    plt.axvline(mu - i*sigma, color=color, linestyle='--', alpha=0.5)
24    plt.axvline(mu + i*sigma, color=color, linestyle='--', alpha=0.5)
25
26plt.xlabel('IQ Score')
27plt.ylabel('Probability Density')
28plt.title('Empirical Rule (68-95-99.7 Rule)')
29plt.legend()
30plt.grid(True, alpha=0.3)
31plt.show()
32
33# Verify with actual calculations
34print("=== Empirical Rule Verification ===")
35data = np.random.normal(mu, sigma, 100000)
36print(f"% within 1σ: {np.mean((data >= mu-sigma) & (data <= mu+sigma))*100:.1f}%")
37print(f"% within 2σ: {np.mean((data >= mu-2*sigma) & (data <= mu+2*sigma))*100:.1f}%")
38print(f"% within 3σ: {np.mean((data >= mu-3*sigma) & (data <= mu+3*sigma))*100:.1f}%")

7. Z-Score (Standard Score)

7.1 Công thức

$z = \frac{x - \mu}{\sigma}$

7.2 Ý nghĩa

Z-score cho biết giá trị cách mean bao nhiêu standard deviation.

Z-score	Ý nghĩa
z = 0	Bằng mean
z = 1	Cao hơn mean 1 std
z = -2	Thấp hơn mean 2 std
\|z\| > 3	Outlier tiềm năng

7.3 Ví dụ

IQ = 130, với μ = 100, σ = 15

$z = \frac{130 - 100}{15} = 2$

→ IQ 130 cao hơn trung bình 2 độ lệch chuẩn

7.4 Code Python

Python

1import numpy as np
2from scipy import stats
3
4# Data
5mu, sigma = 100, 15
6iq_score = 130
7
8# Z-score
9z = (iq_score - mu) / sigma
10print(f"Z-score: {z:.2f}")  # 2.0
11
12# Standardize entire array
13data = [85, 100, 115, 130, 145]
14z_scores = stats.zscore(data)
15print(f"Z-scores: {z_scores}")
16
17# Percentile từ Z-score
18percentile = stats.norm.cdf(z) * 100
19print(f"Percentile: {percentile:.1f}%")  # 97.7%

8. Tổng hợp tất cả độ đo

Code tổng hợp

Python

1import numpy as np
2import pandas as pd
3from scipy import stats
4
5def descriptive_stats(data, name="Data"):
6    """Tính tất cả các độ đo thống kê mô tả"""
7    
8    result = {
9        'Count': len(data),
10        'Mean': np.mean(data),
11        'Median': np.median(data),
12        'Mode': stats.mode(data, keepdims=True).mode[0],
13        'Min': np.min(data),
14        'Max': np.max(data),
15        'Range': np.ptp(data),
16        'Variance (pop)': np.var(data, ddof=0),
17        'Variance (sample)': np.var(data, ddof=1),
18        'Std (pop)': np.std(data, ddof=0),
19        'Std (sample)': np.std(data, ddof=1),
20        'CV (%)': (np.std(data) / np.mean(data)) * 100
21    }
22    
23    print(f"\n=== {name} ===")
24    for key, value in result.items():
25        print(f"{key}: {value:.4f}" if isinstance(value, float) else f"{key}: {value}")
26    
27    return result
28
29# Ví dụ sử dụng
30data = [2, 4, 4, 4, 5, 5, 7, 9]
31descriptive_stats(data, "Example Data")
32
33# Hoặc dùng pandas
34df = pd.DataFrame({'values': data})
35print("\n=== Pandas describe() ===")
36print(df.describe())

9. Bài tập thực hành

Bài tập 1: Tính các độ đo

Cho dữ liệu: [12, 15, 18, 22, 25, 28, 30, 35, 40, 100]

Tính Range
Tính Population Variance và Std
Tính Sample Variance và Std
Tính CV

Bài tập 2: So sánh hai lớp

Lớp A	Lớp B
Mean = 75	Mean = 80
Std = 10	Std = 5

Lớp nào có điểm đồng đều hơn?

Bài tập 3: Z-score

Chiều cao nam sinh viên: μ = 170cm, σ = 6cm

Tính Z-score của người cao 182cm
Người có Z = -1.5 cao bao nhiêu?

Tóm tắt

Độ đo	Công thức	Đặc điểm
Range	Max - Min	Đơn giản, nhạy outliers
Variance	$\frac{\sum(x_i-\bar{x})^2}{n}$	Đơn vị bình phương
Std	$\sqrt{Variance}$	Cùng đơn vị với data
CV	$\frac{\sigma}{\mu} \times 100\%$	So sánh tương đối
Z-score	$\frac{x-\mu}{\sigma}$	Chuẩn hóa dữ liệu

Key Takeaways

Range nhanh nhưng không đáng tin cậy
Variance/Std là độ đo phổ biến nhất
CV dùng để so sánh giữa các tập khác nhau
Z-score giúp chuẩn hóa và so sánh
Empirical Rule áp dụng cho phân phối chuẩn