Lý thuyết
Bài 7/17

Query Optimization

EXPLAIN, Indexes, và Performance Tuning cho SQL queries

Query Optimization

SQL Query Performance Optimization

1. Understanding Query Performance

Tại sao Optimization quan trọng?

Query chậm = User experience kém, server costs cao, bottlenecks trong pipelines. Query optimization là kỹ năng essential cho mọi Data professional.

1.1 Query Execution Flow

Text
1┌─────────────────────────────────────────────────────────┐
2│ Query Execution Pipeline │
3├─────────────────────────────────────────────────────────┤
4│ │
5│ SQL Query │
6│ │ │
7│ v │
8│ ┌────────────────┐ │
9│ │ Parser │ Check syntax │
10│ └───────┬────────┘ │
11│ v │
12│ ┌────────────────┐ │
13│ │ Optimizer │ Generate execution plans │
14│ │ │ Choose best plan (cost-based) │
15│ └───────┬────────┘ │
16│ v │
17│ ┌────────────────┐ │
18│ │ Executor │ Execute chosen plan │
19│ └───────┬────────┘ │
20│ v │
21│ Results │
22│ │
23└─────────────────────────────────────────────────────────┘

1.2 Common Performance Issues

IssueSymptomCommon Cause
Full Table ScanSlow on large tablesMissing index
Cartesian ProductExponential row growthMissing JOIN condition
N+1 ProblemMany small queriesApplication logic issue
Lock ContentionTimeoutsConcurrent writes
Memory IssuesOut of memoryLarge sorts/joins

2. EXPLAIN Command

2.1 Basic EXPLAIN

SQL
1-- PostgreSQL
2EXPLAIN SELECT * FROM orders WHERE customer_id = 100;
3
4-- With actual runtime stats
5EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 100;
6
7-- Verbose output
8EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
9SELECT * FROM orders WHERE customer_id = 100;
10
11-- MySQL
12EXPLAIN SELECT * FROM orders WHERE customer_id = 100;
13EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 100;

2.2 Reading EXPLAIN Output

SQL
1-- Example output (PostgreSQL)
2EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 100;
3
4/*
5Seq Scan on orders (cost=0.00..1520.00 rows=50 width=120) (actual time=0.015..12.450 rows=48 loops=1)
6 Filter: (customer_id = 100)
7 Rows Removed by Filter: 49952
8Planning Time: 0.085 ms
9Execution Time: 12.480 ms
10*/
11
12-- Key metrics:
13-- cost: Estimated cost (startup..total)
14-- rows: Estimated rows returned
15-- actual time: Real execution time (ms)
16-- loops: Times this node was executed
17-- Rows Removed by Filter: Filtered out rows (wasted work!)

2.3 Scan Types

SQL
1-- Sequential Scan: Reads entire table
2-- Worst case for large tables
3Seq Scan on orders
4
5-- Index Scan: Uses index to find rows
6-- Much faster for selective queries
7Index Scan using idx_customer_id on orders
8
9-- Index Only Scan: All data from index
10-- Fastest - no table access needed
11Index Only Scan using idx_customer_date on orders
12
13-- Bitmap Scan: For multiple conditions
14Bitmap Heap Scan on orders
15 -> Bitmap Index Scan on idx_customer_id
16 -> Bitmap Index Scan on idx_status

2.4 Join Types in EXPLAIN

SQL
1-- Nested Loop: Good for small datasets
2Nested Loop
3 -> Seq Scan on orders
4 -> Index Scan on customers
5
6-- Hash Join: Good for larger datasets
7Hash Join
8 -> Seq Scan on orders
9 -> Hash
10 -> Seq Scan on customers
11
12-- Merge Join: Good for sorted data
13Merge Join
14 -> Sort
15 -> Seq Scan on orders
16 -> Sort
17 -> Seq Scan on customers

3. Indexing Strategies

3.1 Index Basics

SQL
1-- Single column index
2CREATE INDEX idx_customer_id ON orders(customer_id);
3
4-- Composite index (column order matters!)
5CREATE INDEX idx_customer_date ON orders(customer_id, order_date);
6
7-- Unique index
8CREATE UNIQUE INDEX idx_email ON users(email);
9
10-- Partial index (PostgreSQL)
11CREATE INDEX idx_active_orders ON orders(order_date)
12WHERE status = 'active';
13
14-- Expression index
15CREATE INDEX idx_lower_email ON users(LOWER(email));

3.2 Index Selection Guidelines

Text
1┌─────────────────────────────────────────────────────────┐
2│ When to Create Index │
3├─────────────────────────────────────────────────────────┤
4│ │
5│ ✅ DO index: │
6│ • Primary keys (automatic) │
7│ • Foreign keys │
8│ • Columns in WHERE clauses │
9│ • Columns in JOIN conditions │
10│ • Columns in ORDER BY (if large result sets) │
11│ • High-cardinality columns (many unique values) │
12│ │
13│ ❌ DON'T index: │
14│ • Small tables (< 1000 rows) │
15│ • Columns with low cardinality (e.g., boolean) │
16│ • Frequently updated columns │
17│ • Wide columns (long text) │
18│ • Every column (index maintenance overhead) │
19│ │
20└─────────────────────────────────────────────────────────┘

3.3 Composite Index Order

SQL
1-- The leftmost prefix rule
2CREATE INDEX idx_a_b_c ON table(a, b, c);
3
4-- This index supports:
5WHERE a = ? Uses index
6WHERE a = ? AND b = ? Uses index
7WHERE a = ? AND b = ? AND c = ? Uses index
8WHERE b = ? AND c = ? Cannot use index (missing 'a')
9WHERE a = ? AND c = ? Partial use (only 'a')
10
11-- Best practice: Most selective column first
12-- If customer_id is more selective than status:
13CREATE INDEX idx_orders ON orders(customer_id, status);

3.4 Covering Index

SQL
1-- Include all columns needed by query
2CREATE INDEX idx_covering ON orders(customer_id)
3INCLUDE (order_date, amount);
4
5-- Query uses Index Only Scan (no table access)
6SELECT order_date, amount
7FROM orders
8WHERE customer_id = 100;

4. Query Optimization Techniques

4.1 Sargable Queries

Sargable Queries

SARG = Search ARGument. Sargable queries can use indexes. Non-sargable queries force full table scans.

SQL
1-- ❌ Non-sargable (can't use index)
2SELECT * FROM orders WHERE YEAR(order_date) = 2024;
3SELECT * FROM users WHERE UPPER(email) = 'TEST@EMAIL.COM';
4SELECT * FROM products WHERE price + 10 > 100;
5
6-- ✅ Sargable (can use index)
7SELECT * FROM orders
8WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
9
10SELECT * FROM users WHERE email = 'test@email.com';
11-- Or create expression index: CREATE INDEX idx_lower_email ON users(LOWER(email));
12
13SELECT * FROM products WHERE price > 90;

4.2 Avoid SELECT *

SQL
1-- ❌ Bad: Returns all columns
2SELECT * FROM orders WHERE customer_id = 100;
3
4-- ✅ Good: Only needed columns
5SELECT order_id, order_date, amount
6FROM orders
7WHERE customer_id = 100;
8
9-- Benefits:
10-- Less data transfer
11-- Can use covering index
12-- Less memory usage

4.3 EXISTS vs IN

SQL
1-- For large subqueries, EXISTS often faster
2-- ❌ IN with large result set
3SELECT * FROM orders
4WHERE customer_id IN (SELECT id FROM customers WHERE region = 'North');
5
6-- ✅ EXISTS (stops at first match)
7SELECT * FROM orders o
8WHERE EXISTS (
9 SELECT 1 FROM customers c
10 WHERE c.id = o.customer_id AND c.region = 'North'
11);
12
13-- For small static lists, IN is fine
14SELECT * FROM orders WHERE status IN ('pending', 'processing');

4.4 JOIN Optimization

SQL
1-- Join on indexed columns
2SELECT o.*, c.name
3FROM orders o
4JOIN customers c ON o.customer_id = c.id; -- Both should be indexed
5
6-- Filter early (before join)
7SELECT o.*, c.name
8FROM orders o
9JOIN customers c ON o.customer_id = c.id
10WHERE o.order_date >= '2024-01-01' -- Applied before join
11AND c.status = 'active';
12
13-- Avoid unnecessary joins
14-- ❌ If you don't need customer data
15SELECT o.*
16FROM orders o
17JOIN customers c ON o.customer_id = c.id;
18
19-- ✅ Skip the join
20SELECT * FROM orders;

4.5 LIMIT and Pagination

SQL
1-- Use LIMIT to reduce result set
2SELECT * FROM orders
3ORDER BY order_date DESC
4LIMIT 100;
5
6-- Efficient pagination with keyset (cursor)
7-- ❌ OFFSET becomes slow for large offsets
8SELECT * FROM orders ORDER BY id LIMIT 20 OFFSET 10000;
9
10-- ✅ Keyset pagination (use last seen value)
11SELECT * FROM orders
12WHERE id > 12345 -- Last ID from previous page
13ORDER BY id
14LIMIT 20;

5. Query Rewriting

5.1 Subquery to JOIN

SQL
1-- ❌ Correlated subquery (runs for each row)
2SELECT
3 o.*,
4 (SELECT name FROM customers WHERE id = o.customer_id) as customer_name
5FROM orders o;
6
7-- ✅ JOIN (single operation)
8SELECT o.*, c.name as customer_name
9FROM orders o
10LEFT JOIN customers c ON o.customer_id = c.id;

5.2 Aggregate Optimization

SQL
1-- Count optimization
2-- ❌ COUNT(*)
3SELECT COUNT(*) FROM orders WHERE status = 'completed';
4
5-- ✅ COUNT(indexed_column) can sometimes use index
6SELECT COUNT(order_id) FROM orders WHERE status = 'completed';
7
8-- For existence check
9-- ❌ Count all
10IF (SELECT COUNT(*) FROM orders WHERE customer_id = 100) > 0
11
12-- ✅ EXISTS (stops at first row)
13IF EXISTS (SELECT 1 FROM orders WHERE customer_id = 100)

5.3 UNION vs UNION ALL

SQL
1-- UNION: Removes duplicates (requires sort)
2SELECT customer_id FROM orders_2023
3UNION
4SELECT customer_id FROM orders_2024;
5
6-- UNION ALL: Keeps duplicates (faster)
7SELECT customer_id FROM orders_2023
8UNION ALL
9SELECT customer_id FROM orders_2024;
10
11-- Use UNION ALL when duplicates are OK or impossible

6. Table Design Impact

6.1 Normalization Trade-offs

SQL
1-- Normalized: Multiple tables, requires joins
2SELECT
3 o.order_id,
4 c.name as customer_name,
5 p.name as product_name
6FROM orders o
7JOIN customers c ON o.customer_id = c.id
8JOIN products p ON o.product_id = p.id;
9
10-- Denormalized: Redundant data, no joins
11SELECT order_id, customer_name, product_name
12FROM orders_denormalized;
13
14-- Consider denormalization for:
15-- - Read-heavy workloads
16-- - Reporting tables
17-- - Data warehouses

6.2 Partitioning

SQL
1-- Range partitioning (PostgreSQL)
2CREATE TABLE orders (
3 order_id SERIAL,
4 order_date DATE,
5 amount DECIMAL
6) PARTITION BY RANGE (order_date);
7
8CREATE TABLE orders_2023 PARTITION OF orders
9 FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
10
11CREATE TABLE orders_2024 PARTITION OF orders
12 FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
13
14-- Query automatically uses only relevant partition
15SELECT * FROM orders WHERE order_date = '2024-06-15';
16-- Only scans orders_2024

7. Monitoring & Analysis

7.1 Slow Query Log

SQL
1-- PostgreSQL: Enable slow query logging
2-- In postgresql.conf:
3-- log_min_duration_statement = 1000 -- Log queries > 1 second
4
5-- MySQL: Enable slow query log
6SET GLOBAL slow_query_log = 'ON';
7SET GLOBAL long_query_time = 1;
8
9-- View slow queries
10SELECT * FROM pg_stat_statements
11ORDER BY total_time DESC
12LIMIT 10;

7.2 Index Usage Statistics

SQL
1-- PostgreSQL: Check index usage
2SELECT
3 schemaname,
4 tablename,
5 indexname,
6 idx_scan as times_used,
7 idx_tup_read as tuples_read,
8 idx_tup_fetch as tuples_fetched
9FROM pg_stat_user_indexes
10ORDER BY idx_scan DESC;
11
12-- Find unused indexes
13SELECT
14 schemaname,
15 tablename,
16 indexname
17FROM pg_stat_user_indexes
18WHERE idx_scan = 0
19AND schemaname = 'public';

7.3 Table Statistics

SQL
1-- Update statistics (helps optimizer)
2ANALYZE orders;
3
4-- PostgreSQL: View table statistics
5SELECT
6 schemaname,
7 relname as table_name,
8 n_live_tup as live_rows,
9 n_dead_tup as dead_rows,
10 last_vacuum,
11 last_autovacuum
12FROM pg_stat_user_tables
13ORDER BY n_dead_tup DESC;
14
15-- High dead_rows = needs VACUUM
16VACUUM ANALYZE orders;

8. Common Anti-Patterns

8.1 N+1 Query Problem

SQL
1-- ❌ Anti-pattern: Loop with individual queries
2FOR each customer IN customers:
3 SELECT * FROM orders WHERE customer_id = customer.id;
4
5-- ✅ Single query with JOIN or IN
6SELECT c.*, o.*
7FROM customers c
8LEFT JOIN orders o ON c.id = o.customer_id
9WHERE c.region = 'North';

8.2 Implicit Type Conversion

SQL
1-- ❌ String to number conversion (can't use index)
2SELECT * FROM orders WHERE order_id = '12345'; -- order_id is INT
3
4-- ✅ Correct type
5SELECT * FROM orders WHERE order_id = 12345;

8.3 OR Conditions

SQL
1-- ❌ OR can prevent index usage
2SELECT * FROM orders
3WHERE customer_id = 100 OR product_id = 200;
4
5-- ✅ UNION for better index usage
6SELECT * FROM orders WHERE customer_id = 100
7UNION
8SELECT * FROM orders WHERE product_id = 200;
9
10-- Or create composite index
11CREATE INDEX idx_customer_product ON orders(customer_id, product_id);

9. Thực hành

Optimization Exercise

Exercise: Optimize Slow Queries

SQL
1-- Given table:
2-- orders(order_id, customer_id, product_id, order_date, amount, status)
3-- 10 million rows
4
5-- Optimize these queries:
6
7-- Query 1: Slow aggregate
8SELECT
9 DATE_TRUNC('month', order_date) as month,
10 COUNT(*) as order_count,
11 SUM(amount) as total_amount
12FROM orders
13WHERE YEAR(order_date) = 2024
14GROUP BY DATE_TRUNC('month', order_date);
15
16-- Query 2: Slow join
17SELECT o.*, c.name
18FROM orders o
19LEFT JOIN customers c ON o.customer_id = c.id
20WHERE o.status = 'pending';
21
22-- Query 3: Slow subquery
23SELECT *
24FROM orders
25WHERE customer_id IN (
26 SELECT customer_id
27 FROM orders
28 GROUP BY customer_id
29 HAVING SUM(amount) > 10000
30);
31
32-- YOUR OPTIMIZED QUERIES HERE
💡 Xem đáp án
SQL
1-- Query 1: Fix non-sargable WHERE, add index
2-- Create index:
3CREATE INDEX idx_orders_date ON orders(order_date);
4
5-- Optimized query (sargable):
6SELECT
7 DATE_TRUNC('month', order_date) as month,
8 COUNT(*) as order_count,
9 SUM(amount) as total_amount
10FROM orders
11WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'
12GROUP BY DATE_TRUNC('month', order_date);
13
14-- Query 2: Add indexes
15CREATE INDEX idx_orders_status ON orders(status);
16CREATE INDEX idx_orders_customer ON orders(customer_id);
17
18-- Or covering index for status + customer_id:
19CREATE INDEX idx_orders_status_customer ON orders(status, customer_id);
20
21-- Query structure is OK, just needed indexes
22
23-- Query 3: Use CTE or temp table
24-- Option A: CTE
25WITH high_value_customers AS (
26 SELECT customer_id
27 FROM orders
28 GROUP BY customer_id
29 HAVING SUM(amount) > 10000
30)
31SELECT o.*
32FROM orders o
33JOIN high_value_customers hvc ON o.customer_id = hvc.customer_id;
34
35-- Option B: EXISTS (often faster)
36SELECT o.*
37FROM orders o
38WHERE EXISTS (
39 SELECT 1
40 FROM orders o2
41 WHERE o2.customer_id = o.customer_id
42 GROUP BY o2.customer_id
43 HAVING SUM(o2.amount) > 10000
44);
45
46-- EXPLAIN output comparison
47EXPLAIN ANALYZE
48SELECT /* original query */;
49
50EXPLAIN ANALYZE
51SELECT /* optimized query */;

10. Tổng kết

TechniqueImpactWhen to Use
IndexingHighFrequent WHERE/JOIN columns
EXPLAINEssentialAlways before optimization
Sargable queriesHighAll queries with functions
**Avoid SELECT ***MediumAlways
JOIN optimizationHighMulti-table queries
PartitioningHighVery large tables

Optimization Checklist:

  1. ✅ Run EXPLAIN ANALYZE first
  2. ✅ Check for full table scans
  3. ✅ Verify index usage
  4. ✅ Make queries sargable
  5. ✅ Optimize JOIN order
  6. ✅ Use appropriate data types

Bài tiếp theo: SQL + Python Integration