📏 Confidence Intervals (Khoảng tin cậy)

Mục tiêu bài học

Sau bài học này, bạn sẽ:

Hiểu ý nghĩa của Confidence Interval
Tính CI cho mean và proportion
Phân biệt khi nào dùng Z vs T distribution
Giải thích CI đúng cách

1. Giới thiệu Confidence Interval

1.1 Vấn đề với Point Estimate

Point estimate (ước lượng điểm) như x̄ chỉ là MỘT giá trị, không phản ánh độ không chắc chắn.

Ví dụ: x̄ = 50

→ Không biết population mean μ có thể nằm trong khoảng nào!

1.2 Interval Estimate

Confidence Interval cung cấp một khoảng giá trị mà population parameter có khả năng nằm trong đó.

$\text{CI} = \text{Point Estimate} \pm \text{Margin of Error}$

1.3 Công thức tổng quát

$CI = \bar{x} \pm z^* \cdot SE$

Trong đó:

x̄ = sample mean
z* = critical value
SE = standard error

2. Confidence Level

2.1 Ý nghĩa

95% Confidence Level nghĩa là:

Nếu lấy nhiều mẫu và tính CI cho mỗi mẫu
95% các CI sẽ chứa population mean μ

Cách hiểu SAI

❌ "95% xác suất μ nằm trong CI này"

μ là cố định, không có xác suất. CI mới là biến thiên.

2.2 Critical Values

Confidence Level	z*
90%	1.645
95%	1.96
99%	2.576

Python

1from scipy import stats
2
3# Critical values
4for conf in [0.90, 0.95, 0.99]:
5    z_star = stats.norm.ppf((1 + conf) / 2)
6    print(f"{conf*100:.0f}% CI: z* = {z_star:.3f}")

2.3 Visualization

Python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Mô phỏng nhiều CI
6np.random.seed(42)
7population_mean = 100
8population_std = 15
9n = 30
10n_samples = 100
11
12fig, ax = plt.subplots(figsize=(12, 8))
13
14contains_mean = 0
15for i in range(n_samples):
16    sample = np.random.normal(population_mean, population_std, n)
17    x_bar = np.mean(sample)
18    se = population_std / np.sqrt(n)
19    ci_lower = x_bar - 1.96 * se
20    ci_upper = x_bar + 1.96 * se
21    
22    # Kiểm tra CI có chứa μ không
23    if ci_lower <= population_mean <= ci_upper:
24        color = 'blue'
25        contains_mean += 1
26    else:
27        color = 'red'
28    
29    ax.plot([ci_lower, ci_upper], [i, i], color=color, linewidth=1)
30    ax.plot(x_bar, i, 'o', color=color, markersize=3)
31
32ax.axvline(population_mean, color='green', linestyle='--', linewidth=2, label=f'μ = {population_mean}')
33ax.set_xlabel('Value')
34ax.set_ylabel('Sample #')
35ax.set_title(f'100 Confidence Intervals (95% CI)\n{contains_mean}% contain the true mean')
36ax.legend()
37plt.show()
38
39print(f"Percentage containing μ: {contains_mean}%")

3. CI for Mean (σ known) - Z-interval

3.1 Công thức

$CI = \bar{x} \pm z^* \cdot \frac{\sigma}{\sqrt{n}}$

3.2 Điều kiện

σ đã biết (hiếm trong thực tế)
n ≥ 30 hoặc population là Normal

3.3 Ví dụ

Chiều cao sinh viên: σ = 8cm. Mẫu n = 64, x̄ = 170cm. Tính 95% CI.

$CI = 170 \pm 1.96 \cdot \frac{8}{\sqrt{64}} = 170 \pm 1.96 = [168.04, 171.96]$

Python

1from scipy import stats
2import numpy as np
3
4x_bar = 170
5sigma = 8
6n = 64
7confidence = 0.95
8
9z_star = stats.norm.ppf((1 + confidence) / 2)
10se = sigma / np.sqrt(n)
11margin_error = z_star * se
12
13ci_lower = x_bar - margin_error
14ci_upper = x_bar + margin_error
15
16print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
17print(f"Margin of Error: {margin_error:.2f}")

4. CI for Mean (σ unknown) - T-interval

4.1 T-Distribution

Khi không biết σ, dùng sample std (s) và t-distribution.

$CI = \bar{x} \pm t^* \cdot \frac{s}{\sqrt{n}}$

4.2 Degrees of Freedom

$df = n - 1$

4.3 So sánh T vs Z

Đặc điểm	Z	T
σ	Known	Unknown
Shape	Fixed	Depends on df
Tails	Thinner	Thicker (more conservative)
df → ∞	-	T → Z

Python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5x = np.linspace(-4, 4, 1000)
6
7plt.figure(figsize=(10, 5))
8plt.plot(x, stats.norm.pdf(x), label='Z (Standard Normal)', linewidth=2)
9for df in [3, 10, 30]:
10    plt.plot(x, stats.t.pdf(x, df), label=f't (df={df})', linestyle='--')
11plt.title('T-distribution vs Standard Normal')
12plt.legend()
13plt.grid(True)
14plt.show()

4.4 Ví dụ

Mẫu 16 sinh viên: x̄ = 75 điểm, s = 10. Tính 95% CI.

Python

1from scipy import stats
2import numpy as np
3
4x_bar = 75
5s = 10
6n = 16
7confidence = 0.95
8df = n - 1
9
10t_star = stats.t.ppf((1 + confidence) / 2, df)
11se = s / np.sqrt(n)
12margin_error = t_star * se
13
14ci_lower = x_bar - margin_error
15ci_upper = x_bar + margin_error
16
17print(f"t* (df={df}): {t_star:.3f}")
18print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")

5. CI for Proportion

5.1 Công thức

$CI = \hat{p} \pm z^* \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$

5.2 Điều kiện

n × p̂ ≥ 10
n × (1 - p̂) ≥ 10

5.3 Ví dụ

Khảo sát 500 người, 280 ủng hộ (56%). Tính 95% CI.

$\hat{p} = \frac{280}{500} = 0.56$

$SE = \sqrt{\frac{0.56 \times 0.44}{500}} = 0.0222$

$CI = 0.56 \pm 1.96 \times 0.0222 = [0.516, 0.604]$

Python

1from scipy import stats
2import numpy as np
3
4x = 280  # successes
5n = 500
6p_hat = x / n
7confidence = 0.95
8
9z_star = stats.norm.ppf((1 + confidence) / 2)
10se = np.sqrt(p_hat * (1 - p_hat) / n)
11margin_error = z_star * se
12
13ci_lower = p_hat - margin_error
14ci_upper = p_hat + margin_error
15
16print(f"p̂ = {p_hat:.3f}")
17print(f"95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")
18print(f"As percentage: [{ci_lower*100:.1f}%, {ci_upper*100:.1f}%]")

6. Margin of Error và Sample Size

6.1 Margin of Error

$ME = z^* \cdot SE$

ME phụ thuộc vào:

Confidence level: cao hơn → ME lớn hơn
Sample size: n lớn hơn → ME nhỏ hơn
Variability: σ lớn → ME lớn

6.2 Trade-off

Các yếu tố ảnh hưởng độ rộng CI

📈Confidence ↑

📊Sample size ↑

📉Variability ↑

Width ↑

Width ↓

Width ↑

6.3 Sample Size cho desired ME

Cho Mean: $n = \left(\frac{z^* \cdot \sigma}{ME}\right)^2$

Cho Proportion: $n = \frac{(z^*)^2 \cdot p(1-p)}{ME^2}$

Python

1from scipy import stats
2import numpy as np
3
4def required_sample_size_mean(sigma, margin_error, confidence=0.95):
5    z_star = stats.norm.ppf((1 + confidence) / 2)
6    n = (z_star * sigma / margin_error) ** 2
7    return int(np.ceil(n))
8
9def required_sample_size_prop(margin_error, confidence=0.95, p=0.5):
10    z_star = stats.norm.ppf((1 + confidence) / 2)
11    n = (z_star ** 2 * p * (1-p)) / margin_error ** 2
12    return int(np.ceil(n))
13
14# Ví dụ
15print("Sample size for mean (σ=10, ME=2, 95%):", required_sample_size_mean(10, 2))
16print("Sample size for proportion (ME=3%, 95%):", required_sample_size_prop(0.03))

7. Python Functions cho CI

Python

1import numpy as np
2from scipy import stats
3
4def ci_mean_z(data, confidence=0.95, sigma=None):
5    """CI for mean with known σ"""
6    n = len(data)
7    x_bar = np.mean(data)
8    if sigma is None:
9        sigma = np.std(data, ddof=1)
10    
11    z_star = stats.norm.ppf((1 + confidence) / 2)
12    se = sigma / np.sqrt(n)
13    me = z_star * se
14    
15    return (x_bar - me, x_bar + me)
16
17def ci_mean_t(data, confidence=0.95):
18    """CI for mean with unknown σ (t-distribution)"""
19    n = len(data)
20    x_bar = np.mean(data)
21    s = np.std(data, ddof=1)
22    df = n - 1
23    
24    t_star = stats.t.ppf((1 + confidence) / 2, df)
25    se = s / np.sqrt(n)
26    me = t_star * se
27    
28    return (x_bar - me, x_bar + me)
29
30def ci_proportion(successes, n, confidence=0.95):
31    """CI for proportion"""
32    p_hat = successes / n
33    z_star = stats.norm.ppf((1 + confidence) / 2)
34    se = np.sqrt(p_hat * (1 - p_hat) / n)
35    me = z_star * se
36    
37    return (p_hat - me, p_hat + me)
38
39# Ví dụ sử dụng
40sample = np.random.normal(100, 15, 50)
41
42print("Z-interval:", ci_mean_z(sample, sigma=15))
43print("T-interval:", ci_mean_t(sample))
44print("Proportion CI:", ci_proportion(280, 500))
45
46# Sử dụng scipy trực tiếp
47print("\nUsing scipy.stats:")
48print("T-interval:", stats.t.interval(0.95, df=49, loc=np.mean(sample), scale=stats.sem(sample)))

8. Giải thích CI đúng cách

8.1 Cách nói ĐÚNG ✅

"Chúng ta 95% confident rằng population mean nằm trong khoảng [a, b]"

"Nếu lấy nhiều mẫu và tính CI, 95% các CI sẽ chứa population mean"

8.2 Cách nói SAI ❌

"95% xác suất μ nằm trong CI này" (μ cố định, không có xác suất)

"95% dữ liệu nằm trong CI" (CI cho parameter, không phải data)

9. Bài tập thực hành

Bài tập 1: Z-interval

σ = 12. Mẫu n = 100, x̄ = 85.

Tính 90% CI
Tính 99% CI
So sánh độ rộng

Bài tập 2: T-interval

Mẫu: [78, 82, 85, 89, 74, 91, 80, 77, 86, 83]

Tính 95% CI cho mean
Cần thêm bao nhiêu mẫu để giảm ME một nửa?

Bài tập 3: Proportion CI

300/400 khách hài lòng:

Tính 95% CI cho proportion
CI có chứa 80% không?

Tóm tắt

CI Type	Formula	When to use
Z (mean)	x̄ ± z* × σ/√n	σ known
T (mean)	x̄ ± t* × s/√n	σ unknown
Proportion	p̂ ± z* × √(p̂q̂/n)	Proportions

Key Takeaways

CI = Point Estimate ± Margin of Error
Confidence level ≠ Probability of μ in CI
T-distribution khi không biết σ
Width tăng khi confidence tăng, giảm khi n tăng
Sample size ~ 1/(ME)² để giảm margin of error