Phân phối liên tục | MinAI Learning

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

✅ Hiểu Normal Distribution và các tính chất

✅ Sử dụng Z-table và Standard Normal

✅ Nắm vững Central Limit Theorem

✅ Hiểu Exponential Distribution

Thời gian: 1.5 giờ | Độ khó: Intermediate | Yêu cầu: Bài 08

Task 0

📖 Bảng Thuật Ngữ Quan Trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
PDF	Hàm mật độ xác suất	f(x), diện tích dưới đường cong = 1
Normal Distribution	Phân phối chuẩn	Hình chuông, đối xứng
Standard Normal	Chuẩn tắc	N(0,1)
Z-score	Điểm chuẩn	(X-μ)/σ
CLT	Định lý giới hạn trung tâm	Sample mean → Normal khi n lớn
Standard Error	Sai số chuẩn	σ/√n
Exponential	Phân phối mũ	Thời gian chờ, memoryless

Checkpoint

CLT: Dù population có phân phối gì, sample mean → Normal khi n ≥ 30.

Task 1

📊 Continuous Random Variable

TB5 min

1.1 Khác biệt với Discrete

Discrete	Continuous
P(X = x)	P(X = x) = 0
PMF: P(X = x)	PDF: f(x)
$\sum P(X=x) = 1$	$\int f(x)dx = 1$

1.2 Probability Density Function (PDF)

$P(a \leq X \leq b) = \int_a^b f(x) dx$

Lưu ý quan trọng

Với biến liên tục, P(X = x) = 0 với mọi x cụ thể. Chúng ta chỉ tính xác suất trong một khoảng.

Task 2

📀 Normal Distribution

TB5 min

Phân phối chuẩn

2.1 Công thức PDF

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

Ký hiệu: $X \sim N(\mu, \sigma^2)$

2.2 Tính chất

Tính chất Normal Distribution

📈Normal Distribution

2.3 Các tham số

Tham số	Ảnh hưởng
μ (mean)	Vị trí trung tâm
σ (std)	Độ rộng/phân tán

2.4 Code minh họa

Python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Các normal distribution khác nhau
6fig, axes = plt.subplots(1, 2, figsize=(14, 5))
7
8# Thay đổi μ
9x = np.linspace(-10, 15, 1000)
10for mu in [0, 3, 6]:
11    axes[0].plot(x, stats.norm.pdf(x, mu, 2), label=f'μ={mu}, σ=2')
12axes[0].set_title('Effect of μ (mean)')
13axes[0].legend()
14axes[0].grid(True)
15
16# Thay đổi σ
17for sigma in [1, 2, 3]:
18    axes[1].plot(x, stats.norm.pdf(x, 0, sigma), label=f'μ=0, σ={sigma}')
19axes[1].set_title('Effect of σ (std)')
20axes[1].legend()
21axes[1].grid(True)
22
23plt.tight_layout()
24plt.show()

Task 3

🎯 Standard Normal Distribution

TB5 min

3.1 Định nghĩa

Standard Normal có $\mu = 0$ và $\sigma = 1$ :

$Z \sim N(0, 1)$

3.2 Standardization (Chuẩn hóa)

Chuyển đổi bất kỳ Normal nào thành Standard Normal:

$Z = \frac{X - \mu}{\sigma}$

3.3 Sử dụng Z-table

Z-table cho biết $P(Z \leq z)$ = CDF(z)

Z	P(Z ≤ z)
-2	0.0228
-1	0.1587
0	0.5000
1	0.8413
2	0.9772

3.4 Ví dụ tính toán

Điểm IQ ~ N(100, 15²). P(IQ > 130)?

Bước 1: Chuẩn hóa $Z = \frac{130 - 100}{15} = 2$

Bước 2: Tra bảng $P(Z > 2) = 1 - P(Z \leq 2) = 1 - 0.9772 = 0.0228$

→ Khoảng 2.28% người có IQ > 130

3.5 Code Python

Python

1from scipy import stats
2
3# IQ ~ N(100, 15)
4mu, sigma = 100, 15
5normal = stats.norm(mu, sigma)
6
7# P(IQ > 130)
8P_greater_130 = 1 - normal.cdf(130)
9print(f"P(IQ > 130) = {P_greater_130:.4f}")
10
11# P(85 < IQ < 115)
12P_between = normal.cdf(115) - normal.cdf(85)
13print(f"P(85 < IQ < 115) = {P_between:.4f}")
14
15# Percentile: IQ của top 5%?
16iq_top5 = normal.ppf(0.95)
17print(f"Top 5% IQ: {iq_top5:.1f}")
18
19# Sử dụng Standard Normal
20Z = (130 - mu) / sigma
21P_z = 1 - stats.norm.cdf(Z)
22print(f"Using Z-score: P(Z > {Z}) = {P_z:.4f}")

3.6 Tính xác suất các khoảng

Python

1from scipy import stats
2
3norm = stats.norm(0, 1)  # Standard Normal
4
5# P(Z < 1.5)
6print(f"P(Z < 1.5) = {norm.cdf(1.5):.4f}")
7
8# P(Z > -1)
9print(f"P(Z > -1) = {1 - norm.cdf(-1):.4f}")
10
11# P(-1 < Z < 1)
12print(f"P(-1 < Z < 1) = {norm.cdf(1) - norm.cdf(-1):.4f}")
13
14# Tìm z cho P(Z < z) = 0.95
15z_95 = norm.ppf(0.95)
16print(f"z for P(Z < z) = 0.95: {z_95:.4f}")

Checkpoint

IQ ~ N(100, 15²). P(IQ > 130) = ? Z = (130-100)/15 = 2 → P ≈ 2.28%

Task 4

🔔 Empirical Rule (68-95-99.7)

TB5 min

4.1 Quy tắc

Khoảng	% dữ liệu
μ ± 1σ	68.27%
μ ± 2σ	95.45%
μ ± 3σ	99.73%

4.2 Code minh họa

Python

1from scipy import stats
2import numpy as np
3
4norm = stats.norm(0, 1)
5
6# Verify Empirical Rule
7print("=== Empirical Rule ===")
8print(f"P(-1 < Z < 1) = {norm.cdf(1) - norm.cdf(-1):.4f}")  # ~0.6827
9print(f"P(-2 < Z < 2) = {norm.cdf(2) - norm.cdf(-2):.4f}")  # ~0.9545
10print(f"P(-3 < Z < 3) = {norm.cdf(3) - norm.cdf(-3):.4f}")  # ~0.9973

Task 5

⭐ Central Limit Theorem (CLT)

TB5 min

5.1 Phát biểu

Central Limit Theorem

Nếu lấy mẫu ngẫu nhiên kích thước n từ một tổng thể bất kỳ có mean μ và std σ, khi n đủ lớn (thường n ≥ 30), phân phối của sample mean $\bar{X}$ sẽ xấp xỉ Normal:

$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$

5.2 Standard Error

$SE = \frac{\sigma}{\sqrt{n}}$

SE = Standard Deviation của sample mean

5.3 Ý nghĩa

Central Limit Theorem

📊Any Distribution

📈Take samples → Sample Means

🌟Normal Distribution (n large)

🎯Mean = μ

📏SE = σ/√n

5.4 Code minh họa CLT

Python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Tổng thể không phải Normal (Uniform distribution)
6population = np.random.uniform(0, 10, 100000)
7
8fig, axes = plt.subplots(2, 3, figsize=(15, 8))
9
10# Population distribution
11axes[0, 0].hist(population, bins=50, density=True, alpha=0.7)
12axes[0, 0].set_title('Population (Uniform)')
13
14# Sample means với n khác nhau
15sample_sizes = [5, 10, 30, 50, 100]
16
17for i, n in enumerate(sample_sizes):
18    sample_means = [np.mean(np.random.choice(population, n)) for _ in range(1000)]
19    
20    row, col = (i+1) // 3, (i+1) % 3
21    axes[row, col].hist(sample_means, bins=30, density=True, alpha=0.7)
22    axes[row, col].set_title(f'Sample Means (n={n})')
23    
24    # Overlay theoretical normal
25    x = np.linspace(min(sample_means), max(sample_means), 100)
26    theoretical = stats.norm.pdf(x, 5, 10/np.sqrt(12)/np.sqrt(n))
27    axes[row, col].plot(x, theoretical, 'r-', linewidth=2)
28
29plt.tight_layout()
30plt.show()
31
32# Verify CLT
33print("=== CLT Verification ===")
34print(f"Population mean: {np.mean(population):.4f}")
35print(f"Population std: {np.std(population):.4f}")
36
37for n in [30, 100]:
38    sample_means = [np.mean(np.random.choice(population, n)) for _ in range(10000)]
39    print(f"\nn = {n}:")
40    print(f"Mean of sample means: {np.mean(sample_means):.4f}")
41    print(f"SE (theoretical): {np.std(population)/np.sqrt(n):.4f}")
42    print(f"SE (empirical): {np.std(sample_means):.4f}")

5.5 Ứng dụng CLT

Ví dụ: Thời gian chờ trung bình ~ N(10 phút, 4²). Với mẫu 64 khách hàng:

$\bar{X} \sim N\left(10, \frac{16}{64}\right) = N(10, 0.25)$

SE = $\frac{4}{\sqrt{64}}$ = 0.5

Python

1mu, sigma = 10, 4
2n = 64
3SE = sigma / np.sqrt(n)
4
5# P(sample mean > 11 phút)?
6P = 1 - stats.norm.cdf(11, mu, SE)
7print(f"P(X̄ > 11) = {P:.4f}")

Checkpoint

CLT là nền tảng của Confidence Intervals và Hypothesis Testing. Bạn hiểu tại sao chưa?

Task 6

⏰ Exponential Distribution

TB5 min

6.1 Định nghĩa

Mô hình hóa thời gian chờ giữa các sự kiện trong Poisson process.

$f(x) = \lambda e^{-\lambda x}, \quad x \geq 0$

6.2 Tính chất

Thống kê	Công thức
E(X)	1/λ
Var(X)	1/λ²
CDF	$1 - e^{-\lambda x}$

6.3 Memoryless Property

$P(X > s + t | X > s) = P(X > t)$

Xác suất chờ thêm t không phụ thuộc vào đã chờ s.

6.4 Code Python

Python

1from scipy import stats
2import numpy as np
3import matplotlib.pyplot as plt
4
5# λ = 0.5 (trung bình 2 phút/sự kiện)
6lam = 0.5
7exp = stats.expon(scale=1/lam)  # scipy dùng scale = 1/λ
8
9# P(X < 3)
10print(f"P(X < 3) = {exp.cdf(3):.4f}")
11
12# P(X > 5)
13print(f"P(X > 5) = {1 - exp.cdf(5):.4f}")
14
15# E(X) và Var(X)
16print(f"E(X) = {exp.mean():.4f}")  # 2
17print(f"Var(X) = {exp.var():.4f}")  # 4
18
19# Visualization
20x = np.linspace(0, 10, 1000)
21plt.figure(figsize=(10, 5))
22plt.plot(x, exp.pdf(x), label='PDF')
23plt.fill_between(x, exp.pdf(x), where=(x < 3), alpha=0.3, label='P(X < 3)')
24plt.xlabel('Time')
25plt.ylabel('Density')
26plt.title(f'Exponential Distribution (λ={lam})')
27plt.legend()
28plt.grid(True)
29plt.show()

Task 7

🔍 Kiểm tra tính Normal

TB5 min

7.1 Visual Methods

Python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Generate data
6np.random.seed(42)
7normal_data = np.random.normal(50, 10, 1000)
8skewed_data = np.random.exponential(10, 1000)
9
10fig, axes = plt.subplots(2, 2, figsize=(12, 10))
11
12# Histogram + Normal curve
13axes[0, 0].hist(normal_data, bins=30, density=True, alpha=0.7)
14x = np.linspace(10, 90, 100)
15axes[0, 0].plot(x, stats.norm.pdf(x, 50, 10), 'r-', linewidth=2)
16axes[0, 0].set_title('Normal Data: Histogram')
17
18axes[0, 1].hist(skewed_data, bins=30, density=True, alpha=0.7)
19axes[0, 1].set_title('Skewed Data: Histogram')
20
21# Q-Q Plot
22stats.probplot(normal_data, dist="norm", plot=axes[1, 0])
23axes[1, 0].set_title('Normal Data: Q-Q Plot')
24
25stats.probplot(skewed_data, dist="norm", plot=axes[1, 1])
26axes[1, 1].set_title('Skewed Data: Q-Q Plot')
27
28plt.tight_layout()
29plt.show()

7.2 Statistical Tests

Python

1from scipy import stats
2
3# Shapiro-Wilk Test (n < 5000)
4stat, p_value = stats.shapiro(normal_data[:500])
5print(f"Shapiro-Wilk Test:")
6print(f"Statistic: {stat:.4f}, p-value: {p_value:.4f}")
7print(f"Normal: {p_value > 0.05}")
8
9# D'Agostino's K² Test
10stat, p_value = stats.normaltest(normal_data)
11print(f"\nD'Agostino's K² Test:")
12print(f"Statistic: {stat:.4f}, p-value: {p_value:.4f}")

Task 8

🧩 Bài tập thực hành

TB5 min

Bài tập 1: Standard Normal

P(Z < 1.64)?
P(Z > -0.5)?
P(-1.96 < Z < 1.96)?
Tìm z sao cho P(Z > z) = 0.10

Bài tập 2: Normal Distribution

Chiều cao nam sinh viên ~ N(170cm, 6²):

P(cao hơn 180cm)?
P(cao từ 165-175cm)?
Top 10% cao bao nhiêu cm?

Bài tập 3: CLT

Thời gian xử lý đơn hàng: μ = 15 phút, σ = 5 phút. Với mẫu 100 đơn:

SE = ?
P(thời gian trung bình > 16 phút)?

Task 9

📝 Tổng Kết

TB5 min

Phân phối	PDF/Tính chất	E(X)	Var(X)
Normal	Bell curve, symmetric	μ	σ²
Standard Normal	N(0,1)	0	1
Exponential	Memoryless	1/λ	1/λ²

Câu hỏi tự kiểm tra

Tại sao phân phối Normal được coi là quan trọng nhất trong thống kê?
Định lý Giới hạn Trung tâm (CLT) phát biểu điều gì và có ý nghĩa thực tế gì?
Standard Error (SE) khác gì với Standard Deviation (SD)?
Phân phối Exponential có tính chất "memoryless" — điều này có nghĩa gì?

Key Takeaways

Normal là phân phối quan trọng nhất trong thống kê
Z-score chuẩn hóa về Standard Normal
CLT: sample mean → Normal khi n lớn
SE = σ/√n giảm khi n tăng
Exponential cho thời gian chờ

🎉 Tuyệt vời! Bạn đã hoàn thành bài học Phân phối liên tục!

Tiếp theo: Chúng ta sẽ học về Lấy mẫu (Sampling) — cách chọn mẫu đại diện và ước lượng chính xác.

Task 10