MinAI - Về trang chủ
Hướng dẫn
7/132.5 giờ
Đang tải...

Time Series Analysis

Phân tích trends, seasonality, và forecasting trong dữ liệu theo thời gian

0

🎯 Mục tiêu bài học

TB5 min
Sau bài học này, bạn sẽ:
  • Chuẩn bị và xử lý time series data (datetime operations, missing dates)
  • Thực hiện resampling, moving averages, rolling statistics
  • Phân tích trend và seasonality (decomposition, STL, seasonal indices)
  • Xây dựng forecasting models (Naive, SES, Holt-Winters)
📋 Thông tin bài học
Thông tinChi tiết
⏱️ Thời lượng2.5 giờ
📖 Chủ đề chínhTime Series Decomposition, Forecasting
💡 Kiến thức cần cóPandas, Matplotlib cơ bản
🎯 OutputPhân tích và forecast time series data thực tế
1

📖 Thuật ngữ quan trọng

TB5 min
Thuật ngữTiếng ViệtMô tả
Time SeriesChuỗi thời gianDữ liệu có thứ tự theo thời gian (daily sales, monthly revenue)
TrendXu hướngHướng dài hạn — upward, downward, hoặc flat
SeasonalityTính mùa vụPatterns lặp lại theo chu kỳ (daily, weekly, yearly)
ResidualPhần dưRandom noise sau khi loại bỏ trend và seasonality
ResamplingLấy mẫu lạiChuyển đổi tần suất (daily → monthly, monthly → daily)
Rolling WindowCửa sổ trượtTính statistics trên N rows gần nhất (moving average)
EMATrung bình mũExponential Moving Average — weighted trung bình gần đây hơn
DecompositionPhân táchTách time series thành Trend + Seasonal + Residual
STLPhân tách STLSeasonal-Trend decomposition using LOESS — robust hơn
Holt-WintersMô hình HWTriple Exponential Smoothing — forecast trend + seasonality

Checkpoint

Time series = Trend + Seasonality + Residual. Decomposition tách 3 components để hiểu pattern. Holt-Winters forecast cả trend và seasonality. Time series analysis là foundation của business analytics — sales forecasting, inventory management, demand planning đều dùng nó!

2

📊 Time Series Data Preparation

TB5 min
Time Series trong Business

Time series analysis là foundation của business analytics — từ sales forecasting, inventory management đến demand planning. Master kỹ năng này để predict và plan hiệu quả.

Creating Time Series

Python
1import pandas as pd
2import numpy as np
3import matplotlib.pyplot as plt
4
5# Create datetime index
6dates = pd.date_range(start='2022-01-01', end='2024-12-31', freq='D')
7
8# Generate sample data: trend + seasonality + noise
9np.random.seed(42)
10n = len(dates)
11trend = np.linspace(100, 200, n)
12seasonality = 30 * np.sin(2 * np.pi * np.arange(n) / 365)
13noise = np.random.normal(0, 10, n)
14
15sales = trend + seasonality + noise
16
17df = pd.DataFrame({'date': dates, 'sales': sales})
18df.set_index('date', inplace=True)
19
20print(df.head())
21print(f"\nShape: {df.shape}")
22print(f"Date range: {df.index.min()} to {df.index.max()}")

Datetime Operations

Python
1# Date components
2df['year'] = df.index.year
3df['month'] = df.index.month
4df['quarter'] = df.index.quarter
5df['day_of_week'] = df.index.dayofweek
6df['week_of_year'] = df.index.isocalendar().week
7df['is_weekend'] = df.index.dayofweek >= 5
8df['is_month_start'] = df.index.is_month_start
9df['is_month_end'] = df.index.is_month_end
10
11# Parse dates from strings
12df['date'] = pd.to_datetime(df['date_string'], format='%Y-%m-%d')
13
14# Set datetime index
15df = df.set_index('date').sort_index()

Handling Missing Dates

Python
1# Create complete date range
2full_range = pd.date_range(df.index.min(), df.index.max(), freq='D')
3
4# Reindex to fill missing dates
5df = df.reindex(full_range)
6
7# Fill missing values
8df['sales'] = df['sales'].fillna(method='ffill') # Forward fill
9df['sales'] = df['sales'].interpolate(method='linear') # Interpolate
10df['sales'] = df['sales'].fillna(df['sales'].mean()) # Mean
11
12# Check for gaps
13missing_dates = full_range.difference(df.index)
14print(f"Missing dates: {len(missing_dates)}")
Ví dụ
1┌─────────────────────────────────────────────────────────┐
2│ Time Series Decomposition │
3├─────────────────────────────────────────────────────────┤
4│ Original = Trend + Seasonality + Residual │
5│ │
6│ Trend: Long-term direction (↗↘→) │
7│ Seasonality: Regular patterns (daily/weekly/yearly) │
8│ Residual: Random noise, unpredictable │
9└─────────────────────────────────────────────────────────┘

Checkpoint

Time series cần datetime index. pd.date_range tạo complete date range. .reindex() + fillna()/interpolate() xử lý missing dates. Components: Trend + Seasonality + Residual. Missing dates là vấn đề phổ biến — bạn chọn forward fill hay interpolation tùy vào data type!

3

📈 Visualization & Resampling

TB5 min

Basic Time Series Plots

Python
1fig, axes = plt.subplots(3, 1, figsize=(14, 10))
2
3# Line plot
4axes[0].plot(df.index, df['sales'], linewidth=0.8)
5axes[0].set_title('Daily Sales')
6
7# Monthly resampling
8monthly = df['sales'].resample('M').mean()
9axes[1].plot(monthly.index, monthly.values, marker='o')
10axes[1].set_title('Monthly Average Sales')
11
12# Year-over-year comparison
13for year in df.index.year.unique():
14 yearly = df[df.index.year == year]['sales']
15 yearly.index = yearly.index.dayofyear
16 axes[2].plot(yearly.index, yearly.values, label=str(year), alpha=0.7)
17axes[2].set_title('Year-over-Year Comparison')
18axes[2].legend()
19
20plt.tight_layout()
21plt.show()

Heatmap

Python
1heatmap_data = df.pivot_table(
2 values='sales',
3 index=df.index.year,
4 columns=df.index.month,
5 aggfunc='mean'
6)
7
8plt.figure(figsize=(12, 6))
9plt.imshow(heatmap_data, cmap='YlOrRd', aspect='auto')
10plt.colorbar(label='Average Sales')
11plt.xticks(range(12), ['Jan','Feb','Mar','Apr','May','Jun',
12 'Jul','Aug','Sep','Oct','Nov','Dec'])
13plt.title('Sales Heatmap by Year and Month')
14plt.show()

Downsampling

Python
1# Daily → Weekly
2weekly = df['sales'].resample('W').agg(['mean', 'sum', 'min', 'max'])
3
4# Daily → Monthly
5monthly = df['sales'].resample('M').agg(['mean', 'std', 'min', 'max'])
6
7# Daily → Quarterly
8quarterly = df['sales'].resample('Q').sum()
9
10# Custom aggregation
11monthly_stats = df.resample('M').agg({
12 'sales': ['sum', 'mean', 'std']
13})

Upsampling

Python
1monthly_data = pd.DataFrame({
2 'date': pd.date_range('2024-01-01', periods=12, freq='M'),
3 'value': [100, 120, 115, 130, 145, 160, 155, 170, 165, 180, 190, 200]
4})
5monthly_data.set_index('date', inplace=True)
6
7# Forward fill
8daily = monthly_data.resample('D').ffill()
9
10# Linear interpolation
11daily_interp = monthly_data.resample('D').interpolate(method='linear')
12
13# Spline interpolation (smoother)
14daily_spline = monthly_data.resample('D').interpolate(method='spline', order=3)

Checkpoint

YoY comparison plots hiệu quả nhất cho seasonal data. .resample() chuyển đổi tần suất — downsampling dùng agg, upsampling dùng ffill/interpolate. Heatmap là visualization tuyệt vời để spot seasonal patterns — bạn thấy pattern nào trong data của mình?

4

📉 Moving Statistics & Trend Analysis

TB5 min

Rolling Windows

Python
1# Simple moving average
2df['SMA_7'] = df['sales'].rolling(window=7).mean()
3df['SMA_30'] = df['sales'].rolling(window=30).mean()
4
5# Exponential moving average
6df['EMA_7'] = df['sales'].ewm(span=7, adjust=False).mean()
7df['EMA_30'] = df['sales'].ewm(span=30, adjust=False).mean()
8
9# Rolling statistics
10df['rolling_std'] = df['sales'].rolling(window=30).std()
11df['rolling_min'] = df['sales'].rolling(window=30).min()
12df['rolling_max'] = df['sales'].rolling(window=30).max()
13
14# Plot comparison
15fig, ax = plt.subplots(figsize=(14, 6))
16ax.plot(df.index, df['sales'], alpha=0.3, label='Actual')
17ax.plot(df.index, df['SMA_30'], label='30-day SMA')
18ax.plot(df.index, df['EMA_30'], label='30-day EMA')
19ax.legend()
20ax.set_title('Sales with Moving Averages')
21plt.show()

Expanding Windows

Python
1# Cumulative statistics
2df['cumsum'] = df['sales'].cumsum()
3df['cummean'] = df['sales'].expanding().mean()
4df['cummax'] = df['sales'].expanding().max()
5
6# Year-to-date
7df['ytd_sum'] = df.groupby(df.index.year)['sales'].cumsum()

Linear Trend

Python
1from scipy import stats
2
3x = np.arange(len(df))
4y = df['sales'].values
5
6slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
7df['trend'] = intercept + slope * x
8
9print(f"Slope: {slope:.4f} per day")
10print(f"R-squared: {r_value**2:.4f}")
11print(f"P-value: {p_value:.4e}")
12
13plt.figure(figsize=(12, 6))
14plt.plot(df.index, df['sales'], alpha=0.5, label='Actual')
15plt.plot(df.index, df['trend'], 'r-', linewidth=2, label='Trend')
16plt.legend()
17plt.title(f'Sales Trend (slope: {slope:.4f}/day)')
18plt.show()

Detrending

Python
1# Remove linear trend
2df['detrended'] = df['sales'] - df['trend']
3
4# Using differencing
5df['diff_1'] = df['sales'].diff(1) # First difference
6df['diff_7'] = df['sales'].diff(7) # Weekly difference
7
8# Percentage change
9df['pct_change'] = df['sales'].pct_change()
10df['pct_change_7'] = df['sales'].pct_change(periods=7)

Checkpoint

SMA smooth noise, EMA giữ gần recent values hơn. Rolling operations cho real-time statistics. Linear regression xác định trend slope. Differencing giúp detrend data. SMA vs EMA — bạn chọn cái nào? EMA react nhanh hơn với changes gần đây!

5

🔬 Seasonality Analysis

TB5 min

Decomposition

Python
1from statsmodels.tsa.seasonal import seasonal_decompose
2
3# Additive: Original = Trend + Seasonal + Residual
4result_add = seasonal_decompose(df['sales'], model='additive', period=365)
5
6# Multiplicative: Original = Trend × Seasonal × Residual
7result_mult = seasonal_decompose(df['sales'], model='multiplicative', period=365)
8
9# Plot components
10fig = result_add.plot()
11fig.set_size_inches(14, 10)
12plt.tight_layout()
13plt.show()
14
15# Extract components
16trend = result_add.trend
17seasonal = result_add.seasonal
18residual = result_add.resid

STL Decomposition (More Robust)

Python
1from statsmodels.tsa.seasonal import STL
2
3stl = STL(df['sales'], period=365, robust=True)
4result = stl.fit()
5
6fig = result.plot()
7fig.set_size_inches(14, 10)
8plt.show()
9
10# Get seasonally adjusted data
11df['seasonally_adjusted'] = df['sales'] - result.seasonal

Seasonal Indices

Python
1def calculate_seasonal_indices(df, column, freq='month'):
2 """Calculate seasonal indices"""
3 if freq == 'month':
4 grouped = df.groupby(df.index.month)[column].mean()
5 elif freq == 'dayofweek':
6 grouped = df.groupby(df.index.dayofweek)[column].mean()
7 elif freq == 'quarter':
8 grouped = df.groupby(df.index.quarter)[column].mean()
9
10 overall_mean = df[column].mean()
11 seasonal_index = grouped / overall_mean
12 return seasonal_index
13
14monthly_index = calculate_seasonal_indices(df, 'sales', 'month')
15print("Monthly Seasonal Indices:")
16print(monthly_index)
17
18# Visualize
19plt.figure(figsize=(10, 5))
20monthly_index.plot(kind='bar')
21plt.axhline(y=1, color='r', linestyle='--')
22plt.title('Monthly Seasonal Indices')
23plt.xlabel('Month')
24plt.ylabel('Index (1.0 = average)')
25plt.show()

Checkpoint

Additive decomposition khi seasonal amplitude cố định, multiplicative khi amplitude tăng theo trend. STL robust hơn với outliers. Seasonal indices > 1 = above average. Seasonal index = 1.2 nghĩa là tháng đó cao hơn trung bình 20%. Bạn có thể dùng nó để adjust forecasts!

6

🔮 Forecasting

TB5 min

Simple Methods

Python
1# Naive forecast (last value)
2df['forecast_naive'] = df['sales'].shift(1)
3
4# Seasonal naive (same day last year)
5df['forecast_seasonal_naive'] = df['sales'].shift(365)
6
7# Moving average forecast
8df['forecast_ma'] = df['sales'].rolling(window=30).mean().shift(1)
9
10# Exponential smoothing forecast
11df['forecast_ewm'] = df['sales'].ewm(span=30).mean().shift(1)

Simple Exponential Smoothing (SES)

Python
1from statsmodels.tsa.holtwinters import SimpleExpSmoothing
2
3model = SimpleExpSmoothing(df['sales']).fit(smoothing_level=0.2)
4df['ses_fitted'] = model.fittedvalues
5forecast = model.forecast(30) # Next 30 days
6print(f"Smoothing level (alpha): {model.params['smoothing_level']:.4f}")

Holt-Winters (Triple Exponential Smoothing)

Python
1from statsmodels.tsa.holtwinters import ExponentialSmoothing
2
3model = ExponentialSmoothing(
4 df['sales'],
5 trend='add', # Additive trend
6 seasonal='add', # Additive seasonality
7 seasonal_periods=365 # Yearly seasonality
8).fit()
9
10df['hw_fitted'] = model.fittedvalues
11forecast = model.forecast(90) # Next 90 days
12
13# Plot
14fig, ax = plt.subplots(figsize=(14, 6))
15ax.plot(df.index[-365:], df['sales'][-365:], label='Actual')
16ax.plot(forecast.index, forecast.values, 'r--', label='Forecast')
17ax.legend()
18ax.set_title('Holt-Winters Forecast')
19plt.show()

Forecast Evaluation

Python
1from sklearn.metrics import mean_absolute_error, mean_squared_error
2
3def evaluate_forecast(actual, predicted, model_name='Model'):
4 """Calculate forecast metrics"""
5 mask = ~(np.isnan(actual) | np.isnan(predicted))
6 actual, predicted = actual[mask], predicted[mask]
7
8 mae = mean_absolute_error(actual, predicted)
9 rmse = np.sqrt(mean_squared_error(actual, predicted))
10 mape = np.mean(np.abs((actual - predicted) / actual)) * 100
11
12 print(f"\n{model_name} Performance:")
13 print(f" MAE: {mae:.2f}")
14 print(f" RMSE: {rmse:.2f}")
15 print(f" MAPE: {mape:.2f}%")
16 return {'mae': mae, 'rmse': rmse, 'mape': mape}
17
18# Compare methods
19evaluate_forecast(df['sales'], df['forecast_naive'], 'Naive')
20evaluate_forecast(df['sales'], df['forecast_ma'], 'Moving Average')
21evaluate_forecast(df['sales'], df['hw_fitted'], 'Holt-Winters')

Checkpoint

Naive/seasonal naive là baseline. SES cho no-trend data. Holt-Winters forecast cả trend + seasonality. Evaluate với MAE, RMSE, MAPE. MAPE < 10% = excellent forecast. Bạn nên so sánh model phức tạp với Naive baseline — nếu không better thì không đáng dùng!

7

💾 Time Series SQL

TB5 min

Date Functions

SQL
1-- PostgreSQL date operations
2SELECT
3 order_date,
4 EXTRACT(YEAR FROM order_date) as year,
5 EXTRACT(MONTH FROM order_date) as month,
6 DATE_TRUNC('month', order_date) as month_start,
7 order_date - INTERVAL '7 days' as week_ago
8FROM orders;
9
10-- Generate date series
11SELECT generate_series(
12 '2024-01-01'::date,
13 '2024-12-31'::date,
14 '1 day'::interval
15)::date as date;

Time Series Aggregation

SQL
1-- Monthly aggregation
2SELECT
3 DATE_TRUNC('month', order_date) as month,
4 COUNT(*) as orders,
5 SUM(amount) as revenue
6FROM orders
7GROUP BY DATE_TRUNC('month', order_date)
8ORDER BY month;
9
10-- Moving average with window functions
11SELECT
12 order_date, amount,
13 AVG(amount) OVER (
14 ORDER BY order_date
15 ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
16 ) as moving_avg_7d
17FROM orders;
18
19-- YoY comparison
20WITH monthly_sales AS (
21 SELECT DATE_TRUNC('month', order_date) as month,
22 SUM(amount) as revenue
23 FROM orders GROUP BY 1
24)
25SELECT month, revenue,
26 LAG(revenue, 12) OVER (ORDER BY month) as revenue_last_year,
27 ROUND((revenue - LAG(revenue, 12) OVER (ORDER BY month)) /
28 LAG(revenue, 12) OVER (ORDER BY month) * 100, 2) as yoy_growth
29FROM monthly_sales;

Fill Missing Dates

SQL
1WITH date_series AS (
2 SELECT generate_series(
3 (SELECT MIN(order_date) FROM orders),
4 (SELECT MAX(order_date) FROM orders),
5 '1 day'::interval
6 )::date as date
7),
8daily_sales AS (
9 SELECT order_date::date as date, SUM(amount) as revenue
10 FROM orders GROUP BY 1
11)
12SELECT ds.date, COALESCE(d.revenue, 0) as revenue
13FROM date_series ds
14LEFT JOIN daily_sales d ON ds.date = d.date
15ORDER BY ds.date;

Checkpoint

DATE_TRUNC và EXTRACT cho time grouping. Window functions cho moving averages và YoY growth trong SQL. generate_series + LEFT JOIN cho fill missing dates. Kết hợp SQL time series aggregation với Python forecasting cho pipeline analytics hoàn chỉnh!

8

📋 Tổng kết

TB5 min

Kiến thức đã học

Chủ đềNội dung chính
Data Preparationdatetime index, date components, missing dates handling
VisualizationLine plots, YoY comparison, heatmaps, seasonal plots
Resamplingresample() — downsampling (agg), upsampling (interpolate)
Moving Statisticsrolling(), ewm(), expanding(), cumsum
Trend AnalysisLinear regression, polynomial fit, detrending, differencing
Seasonalityseasonal_decompose(), STL, seasonal indices
ForecastingNaive, SES, Holt-Winters, forecast evaluation
SQL FunctionsDATE_TRUNC, EXTRACT, generate_series, window functions
MetricsMAE, RMSE, MAPE

Key Takeaways

  1. ✅ Time Series = Trend + Seasonality + Residual
  2. STL decomposition robust hơn standard decomposition
  3. Holt-Winters forecast cả trend và seasonality
  4. ✅ Luôn so sánh model với Naive baseline
  5. MAPE là metric dễ hiểu nhất (% error)
  6. ✅ Kết hợp SQL aggregation + Python analysis cho pipeline hoàn chỉnh

Câu hỏi tự kiểm tra

  1. Trend, Seasonality, Residual là gì?
  2. STL decomposition khác seasonal_decompose() thế nào?
  3. Holt-Winters forecast được những thành phần nào?
  4. MAPE dễ hiểu hơn MAE/RMSE vì sao?

Bài tiếp theo: Cohort Analysis →

🎉 Tuyệt vời! Bạn đã nắm vững Time Series Analysis!

Nhớ: Time Series = Trend + Seasonality + Residual. Decompose trước, forecast sau — luôn so với Naive baseline!