📈 Phân phối liên tục (Continuous Distributions)

Mục tiêu bài học

Sau bài học này, bạn sẽ:

Hiểu Normal Distribution và các tính chất
Sử dụng Z-table và Standard Normal
Nắm vững Central Limit Theorem
Hiểu Exponential Distribution

1. Continuous Random Variable

1.1 Khác biệt với Discrete

Discrete	Continuous
P(X = x)	P(X = x) = 0
PMF: P(X = x)	PDF: f(x)
$\sum P(X=x) = 1$	$\int f(x)dx = 1$

1.2 Probability Density Function (PDF)

$P(a \leq X \leq b) = \int_a^b f(x) dx$

Lưu ý quan trọng

Với biến liên tục, P(X = x) = 0 với mọi x cụ thể. Chúng ta chỉ tính xác suất trong một khoảng.

2. Normal Distribution (Phân phối chuẩn)

Phân phối chuẩn

2.1 Công thức PDF

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$

Ký hiệu: $X \sim N(\mu, \sigma^2)$

2.2 Tính chất

Tính chất Normal Distribution

📈Normal Distribution

2.3 Các tham số

Tham số	Ảnh hưởng
μ (mean)	Vị trí trung tâm
σ (std)	Độ rộng/phân tán

2.4 Code minh họa

Python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Các normal distribution khác nhau
6fig, axes = plt.subplots(1, 2, figsize=(14, 5))
7
8# Thay đổi μ
9x = np.linspace(-10, 15, 1000)
10for mu in [0, 3, 6]:
11    axes[0].plot(x, stats.norm.pdf(x, mu, 2), label=f'μ={mu}, σ=2')
12axes[0].set_title('Effect of μ (mean)')
13axes[0].legend()
14axes[0].grid(True)
15
16# Thay đổi σ
17for sigma in [1, 2, 3]:
18    axes[1].plot(x, stats.norm.pdf(x, 0, sigma), label=f'μ=0, σ={sigma}')
19axes[1].set_title('Effect of σ (std)')
20axes[1].legend()
21axes[1].grid(True)
22
23plt.tight_layout()
24plt.show()

3. Standard Normal Distribution

3.1 Định nghĩa

Standard Normal có $\mu = 0$ và $\sigma = 1$ :

$Z \sim N(0, 1)$

3.2 Standardization (Chuẩn hóa)

Chuyển đổi bất kỳ Normal nào thành Standard Normal:

$Z = \frac{X - \mu}{\sigma}$

3.3 Sử dụng Z-table

Z-table cho biết $P(Z \leq z)$ = CDF(z)

Z	P(Z ≤ z)
-2	0.0228
-1	0.1587
0	0.5000
1	0.8413
2	0.9772

3.4 Ví dụ tính toán

Điểm IQ ~ N(100, 15²). P(IQ > 130)?

Bước 1: Chuẩn hóa $Z = \frac{130 - 100}{15} = 2$

Bước 2: Tra bảng $P(Z > 2) = 1 - P(Z \leq 2) = 1 - 0.9772 = 0.0228$

→ Khoảng 2.28% người có IQ > 130

3.5 Code Python

Python

1from scipy import stats
2
3# IQ ~ N(100, 15)
4mu, sigma = 100, 15
5normal = stats.norm(mu, sigma)
6
7# P(IQ > 130)
8P_greater_130 = 1 - normal.cdf(130)
9print(f"P(IQ > 130) = {P_greater_130:.4f}")
10
11# P(85 < IQ < 115)
12P_between = normal.cdf(115) - normal.cdf(85)
13print(f"P(85 < IQ < 115) = {P_between:.4f}")
14
15# Percentile: IQ của top 5%?
16iq_top5 = normal.ppf(0.95)
17print(f"Top 5% IQ: {iq_top5:.1f}")
18
19# Sử dụng Standard Normal
20Z = (130 - mu) / sigma
21P_z = 1 - stats.norm.cdf(Z)
22print(f"Using Z-score: P(Z > {Z}) = {P_z:.4f}")

3.6 Tính xác suất các khoảng

Python

1from scipy import stats
2
3norm = stats.norm(0, 1)  # Standard Normal
4
5# P(Z < 1.5)
6print(f"P(Z < 1.5) = {norm.cdf(1.5):.4f}")
7
8# P(Z > -1)
9print(f"P(Z > -1) = {1 - norm.cdf(-1):.4f}")
10
11# P(-1 < Z < 1)
12print(f"P(-1 < Z < 1) = {norm.cdf(1) - norm.cdf(-1):.4f}")
13
14# Tìm z cho P(Z < z) = 0.95
15z_95 = norm.ppf(0.95)
16print(f"z for P(Z < z) = 0.95: {z_95:.4f}")

4. Empirical Rule (68-95-99.7)

4.1 Quy tắc

Khoảng	% dữ liệu
μ ± 1σ	68.27%
μ ± 2σ	95.45%
μ ± 3σ	99.73%

4.2 Code minh họa

Python

1from scipy import stats
2import numpy as np
3
4norm = stats.norm(0, 1)
5
6# Verify Empirical Rule
7print("=== Empirical Rule ===")
8print(f"P(-1 < Z < 1) = {norm.cdf(1) - norm.cdf(-1):.4f}")  # ~0.6827
9print(f"P(-2 < Z < 2) = {norm.cdf(2) - norm.cdf(-2):.4f}")  # ~0.9545
10print(f"P(-3 < Z < 3) = {norm.cdf(3) - norm.cdf(-3):.4f}")  # ~0.9973

5. Central Limit Theorem (CLT)

5.1 Phát biểu

Central Limit Theorem

Nếu lấy mẫu ngẫu nhiên kích thước n từ một tổng thể bất kỳ có mean μ và std σ, khi n đủ lớn (thường n ≥ 30), phân phối của sample mean $\bar{X}$ sẽ xấp xỉ Normal:

$\bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right)$

5.2 Standard Error

$SE = \frac{\sigma}{\sqrt{n}}$

SE = Standard Deviation của sample mean

5.3 Ý nghĩa

Central Limit Theorem

📊Any Distribution

📈Take samples → Sample Means

🌟Normal Distribution (n large)

🎯Mean = μ

📏SE = σ/√n

5.4 Code minh họa CLT

Python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Tổng thể không phải Normal (Uniform distribution)
6population = np.random.uniform(0, 10, 100000)
7
8fig, axes = plt.subplots(2, 3, figsize=(15, 8))
9
10# Population distribution
11axes[0, 0].hist(population, bins=50, density=True, alpha=0.7)
12axes[0, 0].set_title('Population (Uniform)')
13
14# Sample means với n khác nhau
15sample_sizes = [5, 10, 30, 50, 100]
16
17for i, n in enumerate(sample_sizes):
18    sample_means = [np.mean(np.random.choice(population, n)) for _ in range(1000)]
19    
20    row, col = (i+1) // 3, (i+1) % 3
21    axes[row, col].hist(sample_means, bins=30, density=True, alpha=0.7)
22    axes[row, col].set_title(f'Sample Means (n={n})')
23    
24    # Overlay theoretical normal
25    x = np.linspace(min(sample_means), max(sample_means), 100)
26    theoretical = stats.norm.pdf(x, 5, 10/np.sqrt(12)/np.sqrt(n))
27    axes[row, col].plot(x, theoretical, 'r-', linewidth=2)
28
29plt.tight_layout()
30plt.show()
31
32# Verify CLT
33print("=== CLT Verification ===")
34print(f"Population mean: {np.mean(population):.4f}")
35print(f"Population std: {np.std(population):.4f}")
36
37for n in [30, 100]:
38    sample_means = [np.mean(np.random.choice(population, n)) for _ in range(10000)]
39    print(f"\nn = {n}:")
40    print(f"Mean of sample means: {np.mean(sample_means):.4f}")
41    print(f"SE (theoretical): {np.std(population)/np.sqrt(n):.4f}")
42    print(f"SE (empirical): {np.std(sample_means):.4f}")

5.5 Ứng dụng CLT

Ví dụ: Thời gian chờ trung bình ~ N(10 phút, 4²). Với mẫu 64 khách hàng:

$\bar{X} \sim N\left(10, \frac{16}{64}\right) = N(10, 0.25)$

SE = $\frac{4}{\sqrt{64}}$ = 0.5

Python

1mu, sigma = 10, 4
2n = 64
3SE = sigma / np.sqrt(n)
4
5# P(sample mean > 11 phút)?
6P = 1 - stats.norm.cdf(11, mu, SE)
7print(f"P(X̄ > 11) = {P:.4f}")

6. Exponential Distribution

6.1 Định nghĩa

Mô hình hóa thời gian chờ giữa các sự kiện trong Poisson process.

$f(x) = \lambda e^{-\lambda x}, \quad x \geq 0$

6.2 Tính chất

Thống kê	Công thức
E(X)	1/λ
Var(X)	1/λ²
CDF	$1 - e^{-\lambda x}$

6.3 Memoryless Property

$P(X > s + t | X > s) = P(X > t)$

Xác suất chờ thêm t không phụ thuộc vào đã chờ s.

6.4 Code Python

Python

1from scipy import stats
2import numpy as np
3import matplotlib.pyplot as plt
4
5# λ = 0.5 (trung bình 2 phút/sự kiện)
6lam = 0.5
7exp = stats.expon(scale=1/lam)  # scipy dùng scale = 1/λ
8
9# P(X < 3)
10print(f"P(X < 3) = {exp.cdf(3):.4f}")
11
12# P(X > 5)
13print(f"P(X > 5) = {1 - exp.cdf(5):.4f}")
14
15# E(X) và Var(X)
16print(f"E(X) = {exp.mean():.4f}")  # 2
17print(f"Var(X) = {exp.var():.4f}")  # 4
18
19# Visualization
20x = np.linspace(0, 10, 1000)
21plt.figure(figsize=(10, 5))
22plt.plot(x, exp.pdf(x), label='PDF')
23plt.fill_between(x, exp.pdf(x), where=(x < 3), alpha=0.3, label='P(X < 3)')
24plt.xlabel('Time')
25plt.ylabel('Density')
26plt.title(f'Exponential Distribution (λ={lam})')
27plt.legend()
28plt.grid(True)
29plt.show()

7. Kiểm tra tính Normal

7.1 Visual Methods

Python

1import numpy as np
2import matplotlib.pyplot as plt
3from scipy import stats
4
5# Generate data
6np.random.seed(42)
7normal_data = np.random.normal(50, 10, 1000)
8skewed_data = np.random.exponential(10, 1000)
9
10fig, axes = plt.subplots(2, 2, figsize=(12, 10))
11
12# Histogram + Normal curve
13axes[0, 0].hist(normal_data, bins=30, density=True, alpha=0.7)
14x = np.linspace(10, 90, 100)
15axes[0, 0].plot(x, stats.norm.pdf(x, 50, 10), 'r-', linewidth=2)
16axes[0, 0].set_title('Normal Data: Histogram')
17
18axes[0, 1].hist(skewed_data, bins=30, density=True, alpha=0.7)
19axes[0, 1].set_title('Skewed Data: Histogram')
20
21# Q-Q Plot
22stats.probplot(normal_data, dist="norm", plot=axes[1, 0])
23axes[1, 0].set_title('Normal Data: Q-Q Plot')
24
25stats.probplot(skewed_data, dist="norm", plot=axes[1, 1])
26axes[1, 1].set_title('Skewed Data: Q-Q Plot')
27
28plt.tight_layout()
29plt.show()

7.2 Statistical Tests

Python

1from scipy import stats
2
3# Shapiro-Wilk Test (n < 5000)
4stat, p_value = stats.shapiro(normal_data[:500])
5print(f"Shapiro-Wilk Test:")
6print(f"Statistic: {stat:.4f}, p-value: {p_value:.4f}")
7print(f"Normal: {p_value > 0.05}")
8
9# D'Agostino's K² Test
10stat, p_value = stats.normaltest(normal_data)
11print(f"\nD'Agostino's K² Test:")
12print(f"Statistic: {stat:.4f}, p-value: {p_value:.4f}")

8. Bài tập thực hành

Bài tập 1: Standard Normal

P(Z < 1.64)?
P(Z > -0.5)?
P(-1.96 < Z < 1.96)?
Tìm z sao cho P(Z > z) = 0.10

Bài tập 2: Normal Distribution

Chiều cao nam sinh viên ~ N(170cm, 6²):

P(cao hơn 180cm)?
P(cao từ 165-175cm)?
Top 10% cao bao nhiêu cm?

Bài tập 3: CLT

Thời gian xử lý đơn hàng: μ = 15 phút, σ = 5 phút. Với mẫu 100 đơn:

SE = ?
P(thời gian trung bình > 16 phút)?

Tóm tắt

Phân phối	PDF/Tính chất	E(X)	Var(X)
Normal	Bell curve, symmetric	μ	σ²
Standard Normal	N(0,1)	0	1
Exponential	Memoryless	1/λ	1/λ²

Key Takeaways

Normal là phân phối quan trọng nhất trong thống kê
Z-score chuẩn hóa về Standard Normal
CLT: sample mean → Normal khi n lớn
SE = σ/√n giảm khi n tăng
Exponential cho thời gian chờ