🎯 Sampling và Sampling Distribution

Mục tiêu bài học

Sau bài học này, bạn sẽ:

Hiểu các phương pháp sampling
Nắm vững Sampling Distribution
Phân biệt Parameter vs Statistic
Tính toán Standard Error

1. Population vs Sample

1.1 Định nghĩa

Khái niệm	Population	Sample
Định nghĩa	Toàn bộ đối tượng nghiên cứu	Một phần được chọn từ population
Ký hiệu	N	n
Ví dụ	Tất cả sinh viên VN	1000 sinh viên được khảo sát

1.2 Parameter vs Statistic

	Parameter	Statistic
Nguồn	Population	Sample
Mean	μ (mu)	x̄ (x-bar)
Std	σ (sigma)	s
Proportion	p	p̂ (p-hat)
Tính chất	Cố định	Biến thiên

Key Insight

Statistic là ước lượng của Parameter. Mục tiêu thống kê suy luận là sử dụng statistic để suy ra parameter.

2. Các phương pháp Sampling

2.1 Probability Sampling

Phương pháp Probability Sampling

🎲Probability Sampling

a) Simple Random Sampling

Mỗi phần tử có cơ hội được chọn bằng nhau
Dùng random number generator

Python

1import numpy as np
2
3population = list(range(1, 1001))  # 1000 người
4sample = np.random.choice(population, size=100, replace=False)
5print(f"Sample size: {len(sample)}")
6print(f"First 10: {sample[:10]}")

b) Stratified Sampling

Chia population thành strata (nhóm)
Lấy mẫu từ mỗi stratum

Python

1import pandas as pd
2import numpy as np
3
4# Giả sử có 60% nam, 40% nữ
5np.random.seed(42)
6population = pd.DataFrame({
7    'id': range(1000),
8    'gender': np.random.choice(['M', 'F'], 1000, p=[0.6, 0.4])
9})
10
11# Stratified sampling: lấy 10% từ mỗi nhóm
12sample = population.groupby('gender').apply(
13    lambda x: x.sample(frac=0.1)
14).reset_index(drop=True)
15
16print(f"Sample size: {len(sample)}")
17print(sample['gender'].value_counts())

c) Cluster Sampling

Chia thành clusters (cụm)
Chọn ngẫu nhiên một số clusters
Lấy tất cả phần tử trong clusters được chọn

d) Systematic Sampling

Chọn mỗi k-th phần tử
k = N/n

Python

1population = list(range(1, 1001))
2n = 100
3k = len(population) // n  # k = 10
4
5# Chọn điểm bắt đầu ngẫu nhiên
6start = np.random.randint(0, k)
7systematic_sample = population[start::k]
8print(f"Sample size: {len(systematic_sample)}")

2.2 Non-Probability Sampling

Phương pháp	Mô tả	Nhược điểm
Convenience	Chọn thuận tiện	Bias cao
Purposive	Chọn có chủ đích	Chủ quan
Snowball	Giới thiệu lẫn nhau	Không representative
Quota	Đạt quota từng nhóm	Không random

3. Sampling Bias

3.1 Các loại Bias

Loại	Nguyên nhân	Ví dụ
Selection Bias	Chọn mẫu không ngẫu nhiên	Khảo sát online bỏ qua người không dùng internet
Non-response Bias	Người không trả lời khác người trả lời	Người bận rộn không tham gia
Survivorship Bias	Chỉ xét những ai "còn sống"	Phân tích công ty thành công, bỏ qua thất bại

3.2 Cách giảm Bias

✅ Sử dụng random sampling
✅ Tăng sample size
✅ Đảm bảo representative
✅ Kiểm tra response rate

4. Sampling Distribution

4.1 Định nghĩa

Sampling Distribution là phân phối xác suất của một statistic từ nhiều samples.

Sampling Distribution

🌍Population

Sample 1 → x̄₁

Sample 2 → x̄₂

Sample 3 → x̄₃

Sample n → x̄ₙ

📊Distribution of x̄

4.2 Sampling Distribution of the Mean

Nếu population có mean μ và std σ:

Statistic	Value
Mean of x̄	μ
Std of x̄ (SE)	σ/√n

4.3 Code minh họa

Python

1import numpy as np
2import matplotlib.pyplot as plt
3
4# Population (không normal - uniform)
5np.random.seed(42)
6population = np.random.uniform(0, 100, 100000)
7pop_mean = np.mean(population)
8pop_std = np.std(population)
9
10print(f"Population Mean: {pop_mean:.2f}")
11print(f"Population Std: {pop_std:.2f}")
12
13# Lấy nhiều samples và tính mean
14sample_sizes = [10, 30, 100]
15fig, axes = plt.subplots(1, 3, figsize=(15, 4))
16
17for i, n in enumerate(sample_sizes):
18    sample_means = [np.mean(np.random.choice(population, n)) for _ in range(1000)]
19    
20    axes[i].hist(sample_means, bins=30, density=True, alpha=0.7)
21    axes[i].axvline(pop_mean, color='red', linestyle='--', label=f'μ = {pop_mean:.1f}')
22    axes[i].set_title(f'n = {n}\nSE = {pop_std/np.sqrt(n):.2f}')
23    axes[i].legend()
24
25plt.suptitle('Sampling Distribution of the Mean')
26plt.tight_layout()
27plt.show()

5. Standard Error (SE)

5.1 Công thức

$SE = \frac{\sigma}{\sqrt{n}}$

Khi không biết σ, dùng s (sample std):

$SE = \frac{s}{\sqrt{n}}$

5.2 Ý nghĩa

SE đo lường độ chính xác của sample mean trong việc ước lượng population mean.

n tăng	SE giảm	Ước lượng chính xác hơn
n = 25	SE = σ/5
n = 100	SE = σ/10	↓ 50%
n = 400	SE = σ/20	↓ 75%

Quy luật căn bậc hai

Muốn giảm SE một nửa, cần tăng n gấp 4 lần!

5.3 Code tính SE

Python

1import numpy as np
2
3def calculate_se(data=None, sigma=None, n=None):
4    """Tính Standard Error"""
5    if data is not None:
6        return np.std(data, ddof=1) / np.sqrt(len(data))
7    elif sigma is not None and n is not None:
8        return sigma / np.sqrt(n)
9    else:
10        raise ValueError("Cần cung cấp data hoặc (sigma, n)")
11
12# Ví dụ
13sample = [45, 52, 48, 55, 50, 53, 47, 51, 49, 54]
14se = calculate_se(data=sample)
15print(f"SE from sample: {se:.4f}")
16
17# Nếu biết population std
18se_known = calculate_se(sigma=10, n=100)
19print(f"SE with known σ: {se_known:.4f}")

6. Sampling Distribution of Proportion

6.1 Công thức

Nếu population proportion là p:

$\hat{p} \sim N\left(p, \frac{p(1-p)}{n}\right)$

6.2 Standard Error of Proportion

$SE_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}}$

6.3 Điều kiện

np ≥ 10
n(1-p) ≥ 10

6.4 Ví dụ

Tỷ lệ ủng hộ ứng viên A trong population là 52%. Khảo sát 500 người:

Python

1import numpy as np
2from scipy import stats
3
4p = 0.52
5n = 500
6
7SE = np.sqrt(p * (1-p) / n)
8print(f"SE = {SE:.4f}")
9
10# P(p̂ > 0.55)?
11z = (0.55 - p) / SE
12prob = 1 - stats.norm.cdf(z)
13print(f"P(p̂ > 0.55) = {prob:.4f}")

7. Sample Size Determination

7.1 Cho Mean

$n = \left(\frac{z \cdot \sigma}{E}\right)^2$

Trong đó:

z = z-score cho confidence level
σ = population std (hoặc ước lượng)
E = margin of error mong muốn

7.2 Cho Proportion

$n = \frac{z^2 \cdot p(1-p)}{E^2}$

Nếu không biết p, dùng p = 0.5 (worst case)

7.3 Code tính sample size

Python

1from scipy import stats
2import numpy as np
3
4def sample_size_mean(sigma, margin_error, confidence=0.95):
5    """Tính sample size cho mean"""
6    z = stats.norm.ppf((1 + confidence) / 2)
7    n = (z * sigma / margin_error) ** 2
8    return int(np.ceil(n))
9
10def sample_size_proportion(margin_error, confidence=0.95, p=0.5):
11    """Tính sample size cho proportion"""
12    z = stats.norm.ppf((1 + confidence) / 2)
13    n = z**2 * p * (1-p) / margin_error**2
14    return int(np.ceil(n))
15
16# Ví dụ: Mean
17# σ = 15, margin of error = 3, 95% confidence
18n_mean = sample_size_mean(sigma=15, margin_error=3, confidence=0.95)
19print(f"Sample size for mean: {n_mean}")
20
21# Ví dụ: Proportion
22# Margin of error = 3%, 95% confidence
23n_prop = sample_size_proportion(margin_error=0.03, confidence=0.95)
24print(f"Sample size for proportion: {n_prop}")

8. Bài tập thực hành

Bài tập 1: Standard Error

Chiều cao dân số: μ = 170cm, σ = 8cm.

SE với n = 25?
SE với n = 100?
Cần n bao nhiêu để SE ≤ 1cm?

Bài tập 2: Sampling Distribution

Thời gian xử lý: μ = 10 phút, σ = 3 phút. Mẫu n = 36:

E(x̄) = ?
SE = ?
P(x̄ > 11)?

Bài tập 3: Sample Size

Khảo sát tỷ lệ hài lòng:

Margin of error: 2%
Confidence: 95%
Cần khảo sát bao nhiêu người?

Tóm tắt

Khái niệm	Công thức	Ý nghĩa
SE (mean)	σ/√n	Độ chính xác của x̄
SE (proportion)	√(p(1-p)/n)	Độ chính xác của p̂
Sample size (mean)	(zσ/E)²	n cần thiết
Sample size (prop)	z²p(1-p)/E²	n cần thiết

Key Takeaways

Random sampling giảm bias
Sampling distribution của mean → Normal (CLT)
SE giảm khi n tăng (tỷ lệ 1/√n)
Sample size phụ thuộc vào margin of error và confidence
Gấp đôi precision cần gấp 4 lần sample size