Confidence Intervals | MinAI Learning

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Hiểu ý nghĩa của Confidence Interval

✅ Tính CI cho mean và proportion

✅ Phân biệt khi nào dùng Z vs T distribution

✅ Giải thích CI đúng cách

Thời gian: 1 giờ | Độ khó: Intermediate | Yêu cầu: Bài 09-10

Task 0

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Confidence Interval	Khoảng tin cậy	Khoảng chứa parameter
Confidence Level	Mức tin cậy	90%, 95%, 99%
Margin of Error	Biên sai số	z* × SE
Critical Value	Giá trị tới hạn	z* hoặc t*
Point Estimate	Ước lượng điểm	x̄ hoặc p̂
T-distribution	Phân phối T	Dùng khi σ unknown
Degrees of Freedom	Bậc tự do	df = n-1

Checkpoint

95% CI ≠ "95% xác suất μ nằm trong khoảng này". Đây là hiểu sai phổ biến!

Task 1

📏 Giới thiệu Confidence Interval

TB5 min

1.1 Vấn đề với Point Estimate

Point estimate (ước lượng điểm) như x̄ chỉ là MỘT giá trị, không phản ánh độ không chắc chắn.

Ví dụ: x̄ = 50

→ Không biết population mean μ có thể nằm trong khoảng nào!

1.2 Interval Estimate

Confidence Interval cung cấp một khoảng giá trị mà population parameter có khả năng nằm trong đó.

$\text{CI} = \text{Point Estimate} \pm \text{Margin of Error}$

1.3 Công thức tổng quát

$CI = \bar{x} \pm z^* \cdot SE$

Trong đó:

x̄ = sample mean
z* = critical value
SE = standard error

Task 2

📊 Confidence Level

TB5 min

2.1 Ý nghĩa

95% Confidence Level nghĩa là:

Nếu lấy nhiều mẫu và tính CI cho mỗi mẫu
95% các CI sẽ chứa population mean μ

Cách hiểu SAI

❌ "95% xác suất μ nằm trong CI này"

μ là cố định, không có xác suất. CI mới là biến thiên.

2.2 Critical Values

Confidence Level	z*
90%	1.645
95%	1.96
99%	2.576

Python

1from scipy import stats
2
3# Critical values
4for conf in [0.90, 0.95, 0.99]:
5    z_star = stats.norm.ppf((1 + conf) / 2)
6    print(f"{conf*100:.0f}% CI: z* = {z_star:.3f}")

2.3 Visualization

Python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Mô phỏng nhiều CI
6np.random.seed(42)
7population_mean = 100
8population_std = 15
9n = 30
10n_samples = 100
11
12fig, ax = plt.subplots(figsize=(12, 8))
13
14contains_mean = 0
15for i in range(n_samples):
16    sample = np.random.normal(population_mean, population_std, n)
17    x_bar = np.mean(sample)
18    se = population_std / np.sqrt(n)
19    ci_lower = x_bar - 1.96 * se
20    ci_upper = x_bar + 1.96 * se
21    
22    # Kiểm tra CI có chứa μ không
23    if ci_lower <= population_mean <= ci_upper:
24        color = 'blue'
25        contains_mean += 1
26    else:
27        color = 'red'
28    
29    ax.plot([ci_lower, ci_upper], [i, i], color=color, linewidth=1)
30    ax.plot(x_bar, i, 'o', color=color, markersize=3)
31
32ax.axvline(population_mean, color='green', linestyle='--', linewidth=2, label=f'μ = {population_mean}')
33ax.set_xlabel('Value')
34ax.set_ylabel('Sample #')
35ax.set_title(f'100 Confidence Intervals (95% CI)\n{contains_mean}% contain the true mean')
36ax.legend()
37plt.show()
38
39print(f"Percentage containing μ: {contains_mean}%")

Task 3

📊 CI for Mean (σ known) - Z-interval

TB5 min

3.1 Công thức

$CI = \bar{x} \pm z^* \cdot \frac{\sigma}{\sqrt{n}}$

3.2 Điều kiện

σ đã biết (hiếm trong thực tế)
n ≥ 30 hoặc population là Normal

3.3 Ví dụ

Chiều cao sinh viên: σ = 8cm. Mẫu n = 64, x̄ = 170cm. Tính 95% CI.

$CI = 170 \pm 1.96 \cdot \frac{8}{\sqrt{64}} = 170 \pm 1.96 = [168.04, 171.96]$

Python

1from scipy import stats
2import numpy as np
3
4x_bar = 170
5sigma = 8
6n = 64
7confidence = 0.95
8
9z_star = stats.norm.ppf((1 + confidence) / 2)
10se = sigma / np.sqrt(n)
11margin_error = z_star * se
12
13ci_lower = x_bar - margin_error
14ci_upper = x_bar + margin_error
15
16print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
17print(f"Margin of Error: {margin_error:.2f}")

Task 4

📊 CI for Mean (σ unknown) - T-interval

TB5 min

4.1 T-Distribution

Khi không biết σ, dùng sample std (s) và t-distribution.

$CI = \bar{x} \pm t^* \cdot \frac{s}{\sqrt{n}}$

4.2 Degrees of Freedom

$df = n - 1$

4.3 So sánh T vs Z

Đặc điểm	Z	T
σ	Known	Unknown
Shape	Fixed	Depends on df
Tails	Thinner	Thicker (more conservative)
df → ∞	-	T → Z

Python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5x = np.linspace(-4, 4, 1000)
6
7plt.figure(figsize=(10, 5))
8plt.plot(x, stats.norm.pdf(x), label='Z (Standard Normal)', linewidth=2)
9for df in [3, 10, 30]:
10    plt.plot(x, stats.t.pdf(x, df), label=f't (df={df})', linestyle='--')
11plt.title('T-distribution vs Standard Normal')
12plt.legend()
13plt.grid(True)
14plt.show()

4.4 Ví dụ

Mẫu 16 sinh viên: x̄ = 75 điểm, s = 10. Tính 95% CI.

Python

1from scipy import stats
2import numpy as np
3
4x_bar = 75
5s = 10
6n = 16
7confidence = 0.95
8df = n - 1
9
10t_star = stats.t.ppf((1 + confidence) / 2, df)
11se = s / np.sqrt(n)
12margin_error = t_star * se
13
14ci_lower = x_bar - margin_error
15ci_upper = x_bar + margin_error
16
17print(f"t* (df={df}): {t_star:.3f}")
18print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")

Checkpoint

n=25, x̄=80, s=10. Tính 95% CI dùng T-distribution (df=24).

Task 5

📊 CI for Proportion

TB5 min

5.1 Công thức

$CI = \hat{p} \pm z^* \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$

5.2 Điều kiện

n × p̂ ≥ 10
n × (1 - p̂) ≥ 10

5.3 Ví dụ

Khảo sát 500 người, 280 ủng hộ (56%). Tính 95% CI.

$\hat{p} = \frac{280}{500} = 0.56$

$SE = \sqrt{\frac{0.56 \times 0.44}{500}} = 0.0222$

$CI = 0.56 \pm 1.96 \times 0.0222 = [0.516, 0.604]$

Python

1from scipy import stats
2import numpy as np
3
4x = 280  # successes
5n = 500
6p_hat = x / n
7confidence = 0.95
8
9z_star = stats.norm.ppf((1 + confidence) / 2)
10se = np.sqrt(p_hat * (1 - p_hat) / n)
11margin_error = z_star * se
12
13ci_lower = p_hat - margin_error
14ci_upper = p_hat + margin_error
15
16print(f"p̂ = {p_hat:.3f}")
17print(f"95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]")
18print(f"As percentage: [{ci_lower*100:.1f}%, {ci_upper*100:.1f}%]")

Task 6

📏 Margin of Error và Sample Size

TB5 min

6.1 Margin of Error

$ME = z^* \cdot SE$

ME phụ thuộc vào:

Confidence level: cao hơn → ME lớn hơn
Sample size: n lớn hơn → ME nhỏ hơn
Variability: σ lớn → ME lớn

6.2 Trade-off

Các yếu tố ảnh hưởng độ rộng CI

📈Confidence ↑

📊Sample size ↑

📉Variability ↑

Width ↑

Width ↓

Width ↑

6.3 Sample Size cho desired ME

Cho Mean: $n = \left(\frac{z^* \cdot \sigma}{ME}\right)^2$

Cho Proportion: $n = \frac{(z^*)^2 \cdot p(1-p)}{ME^2}$

Python

1from scipy import stats
2import numpy as np
3
4def required_sample_size_mean(sigma, margin_error, confidence=0.95):
5    z_star = stats.norm.ppf((1 + confidence) / 2)
6    n = (z_star * sigma / margin_error) ** 2
7    return int(np.ceil(n))
8
9def required_sample_size_prop(margin_error, confidence=0.95, p=0.5):
10    z_star = stats.norm.ppf((1 + confidence) / 2)
11    n = (z_star ** 2 * p * (1-p)) / margin_error ** 2
12    return int(np.ceil(n))
13
14# Ví dụ
15print("Sample size for mean (σ=10, ME=2, 95%):", required_sample_size_mean(10, 2))
16print("Sample size for proportion (ME=3%, 95%):", required_sample_size_prop(0.03))

Task 7

🐍 Python Functions cho CI

TB5 min

Python

1import numpy as np
2from scipy import stats
3
4def ci_mean_z(data, confidence=0.95, sigma=None):
5    """CI for mean with known σ"""
6    n = len(data)
7    x_bar = np.mean(data)
8    if sigma is None:
9        sigma = np.std(data, ddof=1)
10    
11    z_star = stats.norm.ppf((1 + confidence) / 2)
12    se = sigma / np.sqrt(n)
13    me = z_star * se
14    
15    return (x_bar - me, x_bar + me)
16
17def ci_mean_t(data, confidence=0.95):
18    """CI for mean with unknown σ (t-distribution)"""
19    n = len(data)
20    x_bar = np.mean(data)
21    s = np.std(data, ddof=1)
22    df = n - 1
23    
24    t_star = stats.t.ppf((1 + confidence) / 2, df)
25    se = s / np.sqrt(n)
26    me = t_star * se
27    
28    return (x_bar - me, x_bar + me)
29
30def ci_proportion(successes, n, confidence=0.95):
31    """CI for proportion"""
32    p_hat = successes / n
33    z_star = stats.norm.ppf((1 + confidence) / 2)
34    se = np.sqrt(p_hat * (1 - p_hat) / n)
35    me = z_star * se
36    
37    return (p_hat - me, p_hat + me)
38
39# Ví dụ sử dụng
40sample = np.random.normal(100, 15, 50)
41
42print("Z-interval:", ci_mean_z(sample, sigma=15))
43print("T-interval:", ci_mean_t(sample))
44print("Proportion CI:", ci_proportion(280, 500))
45
46# Sử dụng scipy trực tiếp
47print("\nUsing scipy.stats:")
48print("T-interval:", stats.t.interval(0.95, df=49, loc=np.mean(sample), scale=stats.sem(sample)))

Task 8

✅ Giải thích CI đúng cách

TB5 min

8.1 Cách nói ĐÚNG ✅

"Chúng ta 95% confident rằng population mean nằm trong khoảng [a, b]"

"Nếu lấy nhiều mẫu và tính CI, 95% các CI sẽ chứa population mean"

8.2 Cách nói SAI ❌

"95% xác suất μ nằm trong CI này" (μ cố định, không có xác suất)

"95% dữ liệu nằm trong CI" (CI cho parameter, không phải data)

Task 9

🧩 Bài tập thực hành

TB5 min

Bài tập 1: Z-interval

σ = 12. Mẫu n = 100, x̄ = 85.

Tính 90% CI
Tính 99% CI
So sánh độ rộng

Bài tập 2: T-interval

Mẫu: [78, 82, 85, 89, 74, 91, 80, 77, 86, 83]

Tính 95% CI cho mean
Cần thêm bao nhiêu mẫu để giảm ME một nửa?

Bài tập 3: Proportion CI

300/400 khách hài lòng:

Tính 95% CI cho proportion
CI có chứa 80% không?

Task 10

📝 Tổng Kết

TB5 min

CI Type	Formula	When to use
Z (mean)	x̄ ± z* × σ/√n	σ known
T (mean)	x̄ ± t* × s/√n	σ unknown
Proportion	p̂ ± z* × √(p̂q̂/n)	Proportions

Câu hỏi tự kiểm tra

Confidence Level 95% có ý nghĩa gì — nó KHÔNG có nghĩa là gì?
Khi nào nên dùng T-distribution thay vì Z-distribution để tính khoảng tin cậy?
Làm thế nào để thu hẹp khoảng tin cậy mà không giảm confidence level?
Margin of Error và Sample Size có mối quan hệ như thế nào?

Key Takeaways

CI = Point Estimate ± Margin of Error
Confidence level ≠ Probability of μ in CI
T-distribution khi không biết σ
Width tăng khi confidence tăng, giảm khi n tăng
Sample size ~ 1/(ME)² để giảm margin of error

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Khoảng tin cậy (Confidence Intervals)!

Tiếp theo: Chúng ta sẽ học về Kiểm định giả thuyết (Hypothesis Testing) — phương pháp đưa ra quyết định dựa trên dữ liệu.

Task 11