Advanced SQL: Window Functions, CTEs & Optimization

🎯 Mục tiêu bài học

TB5 min

Sau bài học này, bạn sẽ:

Sử dụng thành thạo Window Functions (ranking, value, aggregate)
Viết Recursive CTEs cho hierarchical data và graph traversal
Đọc hiểu EXPLAIN plans và tối ưu query performance
Áp dụng indexing strategies và query rewriting techniques

📋 Thông tin bài học

Thông tin	Chi tiết
⏱️ Thời lượng	3 giờ
📖 Chủ đề chính	Window Functions, CTEs, Query Optimization
💡 Kiến thức cần có	SQL cơ bản (SELECT, JOIN, GROUP BY)
🎯 Output	Viết được advanced SQL queries cho business analytics

Task 0

📖 Thuật ngữ quan trọng

TB5 min

Thuật ngữ	Tiếng Việt	Mô tả
Window Function	Hàm cửa sổ	Tính toán trên tập rows liên quan đến current row mà không collapse kết quả
PARTITION BY	Phân nhóm	Chia data thành partitions cho window function (tương tự GROUP BY)
Frame Clause	Khung dữ liệu	Xác định rows nào được include trong window calculation
ROW_NUMBER	Số thứ tự hàng	Gán unique sequential number cho mỗi row trong partition
RANK / DENSE_RANK	Xếp hạng	Xếp hạng rows — RANK có gap, DENSE_RANK không gap
LAG / LEAD	Giá trị trước/sau	Truy cập value từ row trước (LAG) hoặc sau (LEAD)
CTE	Biểu thức bảng chung	Common Table Expression — temporary named result set
Recursive CTE	CTE đệ quy	CTE tự tham chiếu cho hierarchical/sequential data
EXPLAIN	Giải thích query	Hiển thị execution plan của database engine
Sargable	Tìm kiếm với index	Query có thể tận dụng index (Search ARGument able)

Checkpoint

Window Functions cho phép tính toán across rows mà giữ nguyên all rows (khác GROUP BY collapse rows). CTEs giúp organize complex queries, recursive CTEs xử lý hierarchical data. EXPLAIN giúp đọc execution plans để optimize. Bạn đã từng gặp trường hợp cần tính rank hoặc running total trong SQL chưa? Window Functions sẽ giải quyết những bài toán đó!

Task 1

🏆 Ranking Functions

TB5 min

Window Functions là gì?

Window Functions thực hiện calculations across rows liên quan đến current row, mà không collapse kết quả thành 1 row như GROUP BY. Đây là một trong những features mạnh nhất của SQL.

Window Function Syntax

Cú pháp: function_name(column) OVER (PARTITION BY ... ORDER BY ... frame_clause)

Thành phần	Mô tả
PARTITION BY	Define groups (like GROUP BY)
ORDER BY	Define row order within partition
frame_clause	Define which rows to include

ROW_NUMBER()

SQL

1-- Unique sequential number for each row
2SELECT 
3    employee_id, name, department, salary,
4    ROW_NUMBER() OVER (ORDER BY salary DESC) as salary_rank
5FROM employees;
6
7-- Row number within each department
8SELECT 
9    employee_id, name, department, salary,
10    ROW_NUMBER() OVER (
11        PARTITION BY department 
12        ORDER BY salary DESC
13    ) as dept_rank
14FROM employees;
15
16-- Use Case: Get top N per group
17WITH ranked AS (
18    SELECT 
19        *,
20        ROW_NUMBER() OVER (
21            PARTITION BY department 
22            ORDER BY salary DESC
23        ) as rn
24    FROM employees
25)
26SELECT * FROM ranked WHERE rn <= 3;  -- Top 3 per department

RANK() và DENSE_RANK()

SQL

1-- RANK: Ties get same rank, skip numbers
2-- DENSE_RANK: Ties get same rank, no gaps
3SELECT 
4    name, department, salary,
5    ROW_NUMBER() OVER (ORDER BY salary DESC) as row_num,
6    RANK() OVER (ORDER BY salary DESC) as rank,
7    DENSE_RANK() OVER (ORDER BY salary DESC) as dense_rank
8FROM employees;
9
10-- Example output:
11-- | name    | salary | row_num | rank | dense_rank |
12-- |---------|--------|---------|------|------------|
13-- | Alice   | 100000 |    1    |  1   |     1      |
14-- | Bob     | 100000 |    2    |  1   |     1      |  <- tie
15-- | Charlie |  90000 |    3    |  3   |     2      |  <- rank skips to 3
16-- | Diana   |  80000 |    4    |  4   |     3      |

NTILE() và PERCENT_RANK()

SQL

1-- Divide rows into N buckets
2SELECT 
3    name, salary,
4    NTILE(4) OVER (ORDER BY salary DESC) as salary_quartile
5FROM employees;
6
7-- PERCENT_RANK: Relative rank (0-1)
8-- CUME_DIST: Cumulative distribution
9SELECT 
10    name, salary,
11    RANK() OVER (ORDER BY salary) as rank,
12    PERCENT_RANK() OVER (ORDER BY salary) as pct_rank,
13    CUME_DIST() OVER (ORDER BY salary) as cume_dist
14FROM employees;

Checkpoint

ROW_NUMBER gán số unique, RANK/DENSE_RANK xử lý ties khác nhau, NTILE chia buckets, PERCENT_RANK cho relative position. ROW_NUMBER vs RANK vs DENSE_RANK — bạn thấy sự khác biệt khi xử lý tied values không? Khi nào dùng cái nào?

Task 2

📊 Value & Aggregate Window Functions

TB5 min

LAG() và LEAD()

SQL

1-- LAG: Access previous row | LEAD: Access next row
2SELECT 
3    date, sales,
4    LAG(sales, 1) OVER (ORDER BY date) as prev_day_sales,
5    LEAD(sales, 1) OVER (ORDER BY date) as next_day_sales,
6    sales - LAG(sales, 1) OVER (ORDER BY date) as daily_change
7FROM daily_sales;
8
9-- Multiple periods
10SELECT 
11    date, sales,
12    LAG(sales, 1) OVER (ORDER BY date) as prev_1,
13    LAG(sales, 7) OVER (ORDER BY date) as prev_7,   -- Week ago
14    LAG(sales, 30) OVER (ORDER BY date) as prev_30   -- Month ago
15FROM daily_sales;
16
17-- With default value for NULL
18SELECT 
19    date, sales,
20    LAG(sales, 1, 0) OVER (ORDER BY date) as prev_sales
21FROM daily_sales;

FIRST_VALUE() và LAST_VALUE()

SQL

1SELECT 
2    employee_id, department, salary,
3    FIRST_VALUE(salary) OVER (
4        PARTITION BY department 
5        ORDER BY salary DESC
6    ) as highest_in_dept,
7    LAST_VALUE(salary) OVER (
8        PARTITION BY department 
9        ORDER BY salary DESC
10        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
11    ) as lowest_in_dept
12FROM employees;
13-- Note: LAST_VALUE requires frame specification to work correctly!

Running Totals & Moving Averages

SQL

1-- Cumulative sum
2SELECT 
3    date, amount,
4    SUM(amount) OVER (ORDER BY date) as running_total
5FROM transactions;
6
7-- Year-to-date (reset each year)
8SELECT 
9    date, amount,
10    SUM(amount) OVER (
11        PARTITION BY YEAR(date) 
12        ORDER BY date
13    ) as ytd_total
14FROM transactions;
15
16-- 7-day moving average
17SELECT 
18    date, sales,
19    AVG(sales) OVER (
20        ORDER BY date 
21        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
22    ) as ma_7
23FROM daily_sales;
24
25-- 30-day moving average
26SELECT 
27    date, sales,
28    AVG(sales) OVER (
29        ORDER BY date 
30        ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
31    ) as ma_30
32FROM daily_sales;

Frame Clauses

Frame Specification

Frame clause xác định rows nào được include trong calculation. Default là RANGE UNBOUNDED PRECEDING khi có ORDER BY, cần explicit frame cho control tốt hơn.

SQL

1-- ROWS: Physical row count
2-- RANGE: Logical value range
3
4-- Frame boundary options:
5-- UNBOUNDED PRECEDING  → From partition start
6-- n PRECEDING          → n rows before current
7-- CURRENT ROW          → Current row
8-- n FOLLOWING          → n rows after current
9-- UNBOUNDED FOLLOWING  → To partition end
10
11-- Named Windows: Define once, use multiple times
12SELECT 
13    date, sales,
14    SUM(sales) OVER sales_window as running_total,
15    AVG(sales) OVER sales_window as running_avg,
16    COUNT(*) OVER sales_window as row_count
17FROM daily_sales
18WINDOW sales_window AS (
19    ORDER BY date 
20    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
21);

Checkpoint

LAG/LEAD truy cập rows trước/sau, FIRST_VALUE/LAST_VALUE lấy biên giá trị. Running totals dùng SUM OVER, moving averages dùng AVG với ROWS frame. Bạn có thể nghĩ ra business case nào dùng LAG để so sánh month-over-month không?

Task 3

💼 Window Functions — Ứng dụng thực tế

TB5 min

Year-over-Year Analysis

SQL

1WITH monthly_sales AS (
2    SELECT 
3        DATE_TRUNC('month', date) as month,
4        SUM(amount) as sales
5    FROM transactions
6    GROUP BY DATE_TRUNC('month', date)
7)
8SELECT 
9    month, sales,
10    LAG(sales, 12) OVER (ORDER BY month) as prev_year_sales,
11    ROUND(
12        (sales - LAG(sales, 12) OVER (ORDER BY month)) 
13        / LAG(sales, 12) OVER (ORDER BY month) * 100, 2
14    ) as yoy_growth_pct
15FROM monthly_sales
16ORDER BY month;

Session Analysis

SQL

1-- Identify user sessions (30 min gap = new session)
2WITH activity_with_gaps AS (
3    SELECT 
4        user_id, timestamp,
5        CASE 
6            WHEN timestamp - LAG(timestamp) OVER (
7                PARTITION BY user_id ORDER BY timestamp
8            ) > INTERVAL '30 minutes'
9            THEN 1 ELSE 0 
10        END as new_session
11    FROM user_activity
12),
13sessions AS (
14    SELECT *,
15        SUM(new_session) OVER (
16            PARTITION BY user_id ORDER BY timestamp
17        ) + 1 as session_id
18    FROM activity_with_gaps
19)
20SELECT 
21    user_id, session_id,
22    MIN(timestamp) as session_start,
23    MAX(timestamp) as session_end,
24    COUNT(*) as events_in_session
25FROM sessions
26GROUP BY user_id, session_id;

Gaps and Islands

SQL

1-- Identify consecutive sequences
2WITH numbered AS (
3    SELECT 
4        date, status,
5        ROW_NUMBER() OVER (ORDER BY date) -
6        ROW_NUMBER() OVER (PARTITION BY status ORDER BY date) as grp
7    FROM events
8)
9SELECT 
10    status,
11    MIN(date) as start_date,
12    MAX(date) as end_date,
13    COUNT(*) as consecutive_days
14FROM numbered
15GROUP BY status, grp
16ORDER BY start_date;

Percentile Analysis

SQL

1SELECT 
2    department, name, salary,
3    PERCENT_RANK() OVER (ORDER BY salary) as overall_percentile,
4    PERCENT_RANK() OVER (
5        PARTITION BY department ORDER BY salary
6    ) as dept_percentile,
7    CASE 
8        WHEN PERCENT_RANK() OVER (ORDER BY salary) >= 0.9 THEN 'Top 10%'
9        WHEN PERCENT_RANK() OVER (ORDER BY salary) >= 0.75 THEN 'Top 25%'
10        WHEN PERCENT_RANK() OVER (ORDER BY salary) >= 0.5 THEN 'Top 50%'
11        ELSE 'Bottom 50%'
12    END as percentile_group
13FROM employees;

Window Function Performance

Window functions có thể expensive cho large datasets. Optimize với proper indexing trên PARTITION BY và ORDER BY columns.

SQL

1-- Create index for window function optimization
2CREATE INDEX idx_sales_date ON sales(date);
3CREATE INDEX idx_sales_category_date ON sales(category, date);
4CREATE INDEX idx_transactions_user_date ON transactions(user_id, date);

Checkpoint

Window Functions giải quyết nhiều bài toán analytics thực tế: YoY growth, session analysis, gaps & islands, percentile ranking. Gaps & Islands là pattern rất powerful — bạn có thể áp dụng nó để detect consecutive active days của users không?

Task 4

🔄 CTEs & Recursive CTEs

TB5 min

Common Table Expressions

CTEs (WITH clause) tạo temporary named result sets, giúp organize complex queries, improve readability, và enable recursive operations.

Basic CTE — Multiple CTEs

SQL

1WITH 
2monthly_sales AS (
3    SELECT 
4        DATE_TRUNC('month', order_date) as month,
5        SUM(amount) as sales
6    FROM orders
7    GROUP BY DATE_TRUNC('month', order_date)
8),
9monthly_targets AS (
10    SELECT month, target_amount FROM targets
11)
12SELECT 
13    m.month, m.sales, t.target_amount,
14    m.sales - t.target_amount as variance
15FROM monthly_sales m
16JOIN monthly_targets t ON m.month = t.month;

Recursive CTE — Hierarchy Traversal

Recursive CTEs

Recursive CTEs reference themselves để iterate qua hierarchical hoặc sequential data. Cần base case và recursive case.

SQL

1-- Employee - Manager hierarchy
2WITH RECURSIVE org_chart AS (
3    -- Base case: Top-level managers
4    SELECT 
5        id, name, manager_id,
6        1 as level,
7        name as path
8    FROM employees
9    WHERE manager_id IS NULL
10    
11    UNION ALL
12    
13    -- Recursive case
14    SELECT 
15        e.id, e.name, e.manager_id,
16        oc.level + 1,
17        oc.path || ' > ' || e.name
18    FROM employees e
19    JOIN org_chart oc ON e.manager_id = oc.id
20)
21SELECT * FROM org_chart ORDER BY path;

Recursive CTE — Tree & BOM

SQL

1-- Categories with subcategories
2WITH RECURSIVE category_tree AS (
3    SELECT 
4        id, name, parent_id, 0 as depth,
5        ARRAY[id] as path_ids,
6        name as full_path
7    FROM categories
8    WHERE parent_id IS NULL
9    
10    UNION ALL
11    
12    SELECT 
13        c.id, c.name, c.parent_id,
14        ct.depth + 1,
15        ct.path_ids || c.id,
16        ct.full_path || ' > ' || c.name
17    FROM categories c
18    JOIN category_tree ct ON c.parent_id = ct.id
19)
20SELECT 
21    id,
22    REPEAT('  ', depth) || name as indented_name,
23    full_path
24FROM category_tree
25ORDER BY path_ids;

Number/Date Series Generation

SQL

1-- Generate date range
2WITH RECURSIVE dates AS (
3    SELECT DATE '2024-01-01' as date
4    UNION ALL
5    SELECT date + INTERVAL '1 day'
6    FROM dates
7    WHERE date < DATE '2024-12-31'
8)
9SELECT date FROM dates;
10
11-- Fill missing dates in time series
12WITH RECURSIVE all_dates AS (
13    SELECT MIN(date) as date FROM sales
14    UNION ALL
15    SELECT date + INTERVAL '1 day'
16    FROM all_dates
17    WHERE date < (SELECT MAX(date) FROM sales)
18),
19daily_sales AS (
20    SELECT date, SUM(amount) as sales FROM sales GROUP BY date
21)
22SELECT 
23    ad.date,
24    COALESCE(ds.sales, 0) as sales
25FROM all_dates ad
26LEFT JOIN daily_sales ds ON ad.date = ds.date
27ORDER BY ad.date;

Graph Traversal

SQL

1-- Find all paths from A to B
2WITH RECURSIVE paths AS (
3    SELECT 
4        from_node, to_node, weight,
5        ARRAY[from_node, to_node] as path,
6        weight as total_weight
7    FROM edges
8    WHERE from_node = 'A'
9    
10    UNION ALL
11    
12    SELECT 
13        p.from_node, e.to_node, e.weight,
14        p.path || e.to_node,
15        p.total_weight + e.weight
16    FROM paths p
17    JOIN edges e ON p.to_node = e.from_node
18    WHERE NOT e.to_node = ANY(p.path)  -- Avoid cycles
19)
20SELECT * FROM paths
21WHERE to_node = 'Z'
22ORDER BY total_weight;

Complex CTE Patterns

SQL

1-- CLV Analysis with multiple CTEs
2WITH 
3first_purchase AS (
4    SELECT customer_id, MIN(order_date) as first_order_date
5    FROM orders GROUP BY customer_id
6),
7customer_metrics AS (
8    SELECT 
9        o.customer_id,
10        fp.first_order_date,
11        COUNT(DISTINCT o.order_id) as total_orders,
12        SUM(o.amount) as total_revenue,
13        COUNT(DISTINCT DATE_TRUNC('month', o.order_date)) as active_months
14    FROM orders o
15    JOIN first_purchase fp ON o.customer_id = fp.customer_id
16    GROUP BY o.customer_id, fp.first_order_date
17),
18cohort_analysis AS (
19    SELECT 
20        customer_id,
21        DATE_TRUNC('month', first_order_date) as cohort_month,
22        total_orders, total_revenue,
23        total_revenue / GREATEST(active_months, 1) as monthly_value
24    FROM customer_metrics
25)
26SELECT 
27    cohort_month,
28    COUNT(*) as customers,
29    ROUND(AVG(total_orders), 1) as avg_orders,
30    ROUND(AVG(total_revenue), 2) as avg_revenue,
31    ROUND(AVG(monthly_value), 2) as avg_monthly_value
32FROM cohort_analysis
33GROUP BY cohort_month
34ORDER BY cohort_month;

Checkpoint

Basic CTEs organize complex queries. Recursive CTEs xử lý hierarchical data (org charts, trees, BOM), generate series, và traverse graphs. Luôn cần base case + termination condition. Recursive CTEs rất hữu ích cho org charts — bạn có thể nghĩ ra cách dùng nó cho product categories trên e-commerce không?

Task 5

🔍 Query Optimization — EXPLAIN & Indexing

TB5 min

Tại sao Optimization quan trọng?

Query chậm = User experience kém, server costs cao, bottlenecks trong pipelines. Query optimization là kỹ năng essential cho mọi Data professional.

Query Execution Flow

Query Execution Pipeline

📝SQL Query

🔍Parser

⚡Optimizer (Cost-based)

▶️Executor

📊Results

EXPLAIN Command

SQL

1-- PostgreSQL
2EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 100;
3
4/*
5Seq Scan on orders  (cost=0.00..1520.00 rows=50 width=120)
6  (actual time=0.015..12.450 rows=48 loops=1)
7  Filter: (customer_id = 100)
8  Rows Removed by Filter: 49952
9Planning Time: 0.085 ms
10Execution Time: 12.480 ms
11*/
12
13-- Key metrics:
14-- cost: Estimated cost (startup..total)
15-- rows: Estimated rows returned
16-- actual time: Real execution time (ms)
17-- Rows Removed by Filter: Wasted work!

Scan Types

SQL

1-- Sequential Scan: Reads entire table (slowest)
2Seq Scan on orders
3
4-- Index Scan: Uses index to find rows (fast)
5Index Scan using idx_customer_id on orders
6
7-- Index Only Scan: All data from index (fastest)
8Index Only Scan using idx_customer_date on orders
9
10-- Bitmap Scan: For multiple conditions
11Bitmap Heap Scan on orders
12  -> Bitmap Index Scan on idx_customer_id

Indexing Strategies

SQL

1-- Single column index
2CREATE INDEX idx_customer_id ON orders(customer_id);
3
4-- Composite index (column order matters!)
5CREATE INDEX idx_customer_date ON orders(customer_id, order_date);
6
7-- Partial index (PostgreSQL)
8CREATE INDEX idx_active_orders ON orders(order_date)
9WHERE status = 'active';
10
11-- Covering index
12CREATE INDEX idx_covering ON orders(customer_id)
13INCLUDE (order_date, amount);

Composite Index — Leftmost Prefix Rule

SQL

1CREATE INDEX idx_a_b_c ON table(a, b, c);
2
3-- This index supports:
4WHERE a = ?                      -- ✅ Uses index
5WHERE a = ? AND b = ?            -- ✅ Uses index  
6WHERE a = ? AND b = ? AND c = ?  -- ✅ Uses index
7WHERE b = ? AND c = ?            -- ❌ Cannot use index (missing 'a')
8WHERE a = ? AND c = ?            -- ⚠️ Partial use (only 'a')

When to Create Index

Nên tạo Index ✅	Không nên ❌
Primary keys	Small tables (dưới 1000 rows)
Foreign keys	Low cardinality columns
WHERE/JOIN columns	Frequently updated columns
ORDER BY columns	Every column
High-cardinality columns

Checkpoint

EXPLAIN ANALYZE cho biết execution plan thực tế. Index Scan >> Seq Scan cho large tables. Composite index tuân theo leftmost prefix rule. Covering index cho Index Only Scan. Bạn có biết tại sao column order trong composite index lại quan trọng không?

Task 6

⚡ Query Optimization — Kỹ thuật tối ưu

TB5 min

Sargable Queries

SARG = Search ARGument. Sargable queries can use indexes. Non-sargable queries force full table scans.

SQL

1-- ❌ Non-sargable (can't use index)
2SELECT * FROM orders WHERE YEAR(order_date) = 2024;
3SELECT * FROM users WHERE UPPER(email) = 'TEST@EMAIL.COM';
4SELECT * FROM products WHERE price + 10 > 100;
5
6-- ✅ Sargable (can use index)
7SELECT * FROM orders 
8WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
9SELECT * FROM users WHERE email = 'test@email.com';
10SELECT * FROM products WHERE price > 90;

EXISTS vs IN

SQL

1-- For large subqueries, EXISTS often faster
2-- ❌ IN with large result set
3SELECT * FROM orders 
4WHERE customer_id IN (SELECT id FROM customers WHERE region = 'North');
5
6-- ✅ EXISTS (stops at first match)
7SELECT * FROM orders o
8WHERE EXISTS (
9    SELECT 1 FROM customers c 
10    WHERE c.id = o.customer_id AND c.region = 'North'
11);

JOIN Optimization

SQL

1-- Filter early (before join)
2SELECT o.*, c.name
3FROM orders o
4JOIN customers c ON o.customer_id = c.id
5WHERE o.order_date >= '2024-01-01'  -- Applied before join
6AND c.status = 'active';
7
8-- Avoid unnecessary joins
9-- ❌ If you don't need customer data
10SELECT o.* FROM orders o JOIN customers c ON o.customer_id = c.id;
11-- ✅ Skip the join
12SELECT * FROM orders;

Efficient Pagination

SQL

1-- ❌ OFFSET becomes slow for large offsets
2SELECT * FROM orders ORDER BY id LIMIT 20 OFFSET 10000;
3
4-- ✅ Keyset pagination (use last seen value)
5SELECT * FROM orders 
6WHERE id > 12345  -- Last ID from previous page
7ORDER BY id LIMIT 20;

Query Rewriting

SQL

1-- Subquery → JOIN
2-- ❌ Correlated subquery (runs for each row)
3SELECT o.*,
4    (SELECT name FROM customers WHERE id = o.customer_id) as customer_name
5FROM orders o;
6
7-- ✅ JOIN (single operation)
8SELECT o.*, c.name as customer_name
9FROM orders o
10LEFT JOIN customers c ON o.customer_id = c.id;
11
12-- UNION ALL vs UNION
13-- UNION removes duplicates (requires sort) — use UNION ALL when possible

Partitioning

SQL

1-- Range partitioning (PostgreSQL)
2CREATE TABLE orders (
3    order_id SERIAL,
4    order_date DATE,
5    amount DECIMAL
6) PARTITION BY RANGE (order_date);
7
8CREATE TABLE orders_2023 PARTITION OF orders
9    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
10CREATE TABLE orders_2024 PARTITION OF orders
11    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
12
13-- Queries automatically use only relevant partition
14SELECT * FROM orders WHERE order_date = '2024-06-15';
15-- Only scans orders_2024

Common Anti-Patterns

SQL

1-- N+1 Query Problem
2-- ❌ Loop with individual queries
3-- FOR each customer: SELECT * FROM orders WHERE customer_id = ?;
4-- ✅ Single query with JOIN
5SELECT c.*, o.* FROM customers c LEFT JOIN orders o ON c.id = o.customer_id;
6
7-- Implicit Type Conversion
8-- ❌ String to number (can't use index)
9SELECT * FROM orders WHERE order_id = '12345';  -- order_id is INT
10-- ✅ Correct type
11SELECT * FROM orders WHERE order_id = 12345;

Optimization Checklist

Technique	Impact	When to Use
Indexing	High	Frequent WHERE/JOIN columns
EXPLAIN	Essential	Always before optimization
Sargable queries	High	All queries with functions
Avoid SELECT *	Medium	Always
JOIN optimization	High	Multi-table queries
Partitioning	High	Very large tables (>100M rows)

Checkpoint

Sargable queries tận dụng được index. EXISTS thường nhanh hơn IN cho large subqueries. Keyset pagination hiệu quả hơn OFFSET. Tránh N+1 problem, implicit type conversion. Optimization checklist: EXPLAIN → Check scan types → Verify index usage → Make sargable → Optimize JOINs. Bạn sẽ bắt đầu từ bước nào?

Task 7

📋 Tổng kết

TB5 min

Kiến thức đã học

Chủ đề	Nội dung chính
Ranking Functions	ROW_NUMBER, RANK, DENSE_RANK, NTILE, PERCENT_RANK
Value Functions	LAG, LEAD, FIRST_VALUE, LAST_VALUE, NTH_VALUE
Aggregate Windows	Running totals, Moving averages, Frame clauses, Named windows
Applications	YoY analysis, Session detection, Gaps & Islands, Percentiles
Basic CTEs	Multi-CTE organization, CTE vs subquery, Materialization
Recursive CTEs	Hierarchy traversal, Tree/BOM, Date series, Graph paths
EXPLAIN	Execution plans, Scan types, Join types, Cost analysis
Indexing	Single/Composite/Partial/Covering indexes, Leftmost prefix rule
Optimization	Sargable queries, EXISTS vs IN, Pagination, Partitioning

Best Practices

✅ PARTITION BY = GROUP BY without collapsing rows
✅ Always specify frame clause for aggregate window functions
✅ Use Named Windows khi dùng cùng window nhiều lần
✅ Recursive CTEs cần base case + termination condition
✅ Run EXPLAIN ANALYZE trước khi optimize
✅ Make queries sargable — không wrap indexed columns trong functions

Câu hỏi tự kiểm tra

ROW_NUMBER, RANK, DENSE_RANK khác nhau thế nào?
CTE và subquery: khi nào dùng cái nào?
Recursive CTE dùng để giải quyết bài toán gì?
EXPLAIN ANALYZE cho biết thông tin gì?

Bài tiếp theo: SQL + Python Integration →

🎉 Tuyệt vời! Bạn đã master Advanced SQL!

Nhớ: Window Functions + CTEs + Indexes = SQL Pro. Luôn chạy EXPLAIN trước khi deploy query lên production!

Task 8