Query Optimization

SQL Query Performance Optimization

1. Understanding Query Performance

Tại sao Optimization quan trọng?

Query chậm = User experience kém, server costs cao, bottlenecks trong pipelines. Query optimization là kỹ năng essential cho mọi Data professional.

1.1 Query Execution Flow

Text

1┌─────────────────────────────────────────────────────────┐
2│                Query Execution Pipeline                  │
3├─────────────────────────────────────────────────────────┤
4│                                                          │
5│   SQL Query                                              │
6│      │                                                   │
7│      v                                                   │
8│   ┌────────────────┐                                    │
9│   │    Parser      │  Check syntax                      │
10│   └───────┬────────┘                                    │
11│           v                                              │
12│   ┌────────────────┐                                    │
13│   │   Optimizer    │  Generate execution plans          │
14│   │                │  Choose best plan (cost-based)     │
15│   └───────┬────────┘                                    │
16│           v                                              │
17│   ┌────────────────┐                                    │
18│   │   Executor     │  Execute chosen plan               │
19│   └───────┬────────┘                                    │
20│           v                                              │
21│   Results                                                │
22│                                                          │
23└─────────────────────────────────────────────────────────┘

1.2 Common Performance Issues

Issue	Symptom	Common Cause
Full Table Scan	Slow on large tables	Missing index
Cartesian Product	Exponential row growth	Missing JOIN condition
N+1 Problem	Many small queries	Application logic issue
Lock Contention	Timeouts	Concurrent writes
Memory Issues	Out of memory	Large sorts/joins

2. EXPLAIN Command

2.1 Basic EXPLAIN

SQL

1-- PostgreSQL
2EXPLAIN SELECT * FROM orders WHERE customer_id = 100;
3
4-- With actual runtime stats
5EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 100;
6
7-- Verbose output
8EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
9SELECT * FROM orders WHERE customer_id = 100;
10
11-- MySQL
12EXPLAIN SELECT * FROM orders WHERE customer_id = 100;
13EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 100;

2.2 Reading EXPLAIN Output

SQL

1-- Example output (PostgreSQL)
2EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 100;
3
4/*
5Seq Scan on orders  (cost=0.00..1520.00 rows=50 width=120) (actual time=0.015..12.450 rows=48 loops=1)
6  Filter: (customer_id = 100)
7  Rows Removed by Filter: 49952
8Planning Time: 0.085 ms
9Execution Time: 12.480 ms
10*/
11
12-- Key metrics:
13-- cost: Estimated cost (startup..total)
14-- rows: Estimated rows returned
15-- actual time: Real execution time (ms)
16-- loops: Times this node was executed
17-- Rows Removed by Filter: Filtered out rows (wasted work!)

2.3 Scan Types

SQL

1-- Sequential Scan: Reads entire table
2-- Worst case for large tables
3Seq Scan on orders
4
5-- Index Scan: Uses index to find rows
6-- Much faster for selective queries
7Index Scan using idx_customer_id on orders
8
9-- Index Only Scan: All data from index
10-- Fastest - no table access needed
11Index Only Scan using idx_customer_date on orders
12
13-- Bitmap Scan: For multiple conditions
14Bitmap Heap Scan on orders
15  -> Bitmap Index Scan on idx_customer_id
16  -> Bitmap Index Scan on idx_status

2.4 Join Types in EXPLAIN

SQL

1-- Nested Loop: Good for small datasets
2Nested Loop
3  -> Seq Scan on orders
4  -> Index Scan on customers
5
6-- Hash Join: Good for larger datasets
7Hash Join
8  -> Seq Scan on orders
9  -> Hash
10       -> Seq Scan on customers
11
12-- Merge Join: Good for sorted data
13Merge Join
14  -> Sort
15       -> Seq Scan on orders
16  -> Sort
17       -> Seq Scan on customers

3. Indexing Strategies

3.1 Index Basics

SQL

1-- Single column index
2CREATE INDEX idx_customer_id ON orders(customer_id);
3
4-- Composite index (column order matters!)
5CREATE INDEX idx_customer_date ON orders(customer_id, order_date);
6
7-- Unique index
8CREATE UNIQUE INDEX idx_email ON users(email);
9
10-- Partial index (PostgreSQL)
11CREATE INDEX idx_active_orders ON orders(order_date)
12WHERE status = 'active';
13
14-- Expression index
15CREATE INDEX idx_lower_email ON users(LOWER(email));

3.2 Index Selection Guidelines

Text

1┌─────────────────────────────────────────────────────────┐
2│                 When to Create Index                     │
3├─────────────────────────────────────────────────────────┤
4│                                                          │
5│   ✅ DO index:                                          │
6│   • Primary keys (automatic)                            │
7│   • Foreign keys                                         │
8│   • Columns in WHERE clauses                            │
9│   • Columns in JOIN conditions                          │
10│   • Columns in ORDER BY (if large result sets)          │
11│   • High-cardinality columns (many unique values)       │
12│                                                          │
13│   ❌ DON'T index:                                       │
14│   • Small tables (< 1000 rows)                          │
15│   • Columns with low cardinality (e.g., boolean)        │
16│   • Frequently updated columns                           │
17│   • Wide columns (long text)                            │
18│   • Every column (index maintenance overhead)           │
19│                                                          │
20└─────────────────────────────────────────────────────────┘

3.3 Composite Index Order

SQL

1-- The leftmost prefix rule
2CREATE INDEX idx_a_b_c ON table(a, b, c);
3
4-- This index supports:
5WHERE a = ?                    ✅ Uses index
6WHERE a = ? AND b = ?          ✅ Uses index  
7WHERE a = ? AND b = ? AND c = ? ✅ Uses index
8WHERE b = ? AND c = ?          ❌ Cannot use index (missing 'a')
9WHERE a = ? AND c = ?          ⚠️ Partial use (only 'a')
10
11-- Best practice: Most selective column first
12-- If customer_id is more selective than status:
13CREATE INDEX idx_orders ON orders(customer_id, status);

3.4 Covering Index

SQL

1-- Include all columns needed by query
2CREATE INDEX idx_covering ON orders(customer_id)
3INCLUDE (order_date, amount);
4
5-- Query uses Index Only Scan (no table access)
6SELECT order_date, amount 
7FROM orders 
8WHERE customer_id = 100;

4. Query Optimization Techniques

4.1 Sargable Queries

Sargable Queries

SARG = Search ARGument. Sargable queries can use indexes. Non-sargable queries force full table scans.

SQL

1-- ❌ Non-sargable (can't use index)
2SELECT * FROM orders WHERE YEAR(order_date) = 2024;
3SELECT * FROM users WHERE UPPER(email) = 'TEST@EMAIL.COM';
4SELECT * FROM products WHERE price + 10 > 100;
5
6-- ✅ Sargable (can use index)
7SELECT * FROM orders 
8WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';
9
10SELECT * FROM users WHERE email = 'test@email.com';
11-- Or create expression index: CREATE INDEX idx_lower_email ON users(LOWER(email));
12
13SELECT * FROM products WHERE price > 90;

4.2 Avoid SELECT *

SQL

1-- ❌ Bad: Returns all columns
2SELECT * FROM orders WHERE customer_id = 100;
3
4-- ✅ Good: Only needed columns
5SELECT order_id, order_date, amount 
6FROM orders 
7WHERE customer_id = 100;
8
9-- Benefits:
10-- Less data transfer
11-- Can use covering index
12-- Less memory usage

4.3 EXISTS vs IN

SQL

1-- For large subqueries, EXISTS often faster
2-- ❌ IN with large result set
3SELECT * FROM orders 
4WHERE customer_id IN (SELECT id FROM customers WHERE region = 'North');
5
6-- ✅ EXISTS (stops at first match)
7SELECT * FROM orders o
8WHERE EXISTS (
9    SELECT 1 FROM customers c 
10    WHERE c.id = o.customer_id AND c.region = 'North'
11);
12
13-- For small static lists, IN is fine
14SELECT * FROM orders WHERE status IN ('pending', 'processing');

4.4 JOIN Optimization

SQL

1-- Join on indexed columns
2SELECT o.*, c.name
3FROM orders o
4JOIN customers c ON o.customer_id = c.id;  -- Both should be indexed
5
6-- Filter early (before join)
7SELECT o.*, c.name
8FROM orders o
9JOIN customers c ON o.customer_id = c.id
10WHERE o.order_date >= '2024-01-01'  -- Applied before join
11AND c.status = 'active';
12
13-- Avoid unnecessary joins
14-- ❌ If you don't need customer data
15SELECT o.* 
16FROM orders o
17JOIN customers c ON o.customer_id = c.id;
18
19-- ✅ Skip the join
20SELECT * FROM orders;

4.5 LIMIT and Pagination

SQL

1-- Use LIMIT to reduce result set
2SELECT * FROM orders 
3ORDER BY order_date DESC 
4LIMIT 100;
5
6-- Efficient pagination with keyset (cursor)
7-- ❌ OFFSET becomes slow for large offsets
8SELECT * FROM orders ORDER BY id LIMIT 20 OFFSET 10000;
9
10-- ✅ Keyset pagination (use last seen value)
11SELECT * FROM orders 
12WHERE id > 12345  -- Last ID from previous page
13ORDER BY id 
14LIMIT 20;

5. Query Rewriting

5.1 Subquery to JOIN

SQL

1-- ❌ Correlated subquery (runs for each row)
2SELECT 
3    o.*,
4    (SELECT name FROM customers WHERE id = o.customer_id) as customer_name
5FROM orders o;
6
7-- ✅ JOIN (single operation)
8SELECT o.*, c.name as customer_name
9FROM orders o
10LEFT JOIN customers c ON o.customer_id = c.id;

5.2 Aggregate Optimization

SQL

1-- Count optimization
2-- ❌ COUNT(*)  
3SELECT COUNT(*) FROM orders WHERE status = 'completed';
4
5-- ✅ COUNT(indexed_column) can sometimes use index
6SELECT COUNT(order_id) FROM orders WHERE status = 'completed';
7
8-- For existence check
9-- ❌ Count all
10IF (SELECT COUNT(*) FROM orders WHERE customer_id = 100) > 0
11
12-- ✅ EXISTS (stops at first row)
13IF EXISTS (SELECT 1 FROM orders WHERE customer_id = 100)

5.3 UNION vs UNION ALL

SQL

1-- UNION: Removes duplicates (requires sort)
2SELECT customer_id FROM orders_2023
3UNION
4SELECT customer_id FROM orders_2024;
5
6-- UNION ALL: Keeps duplicates (faster)
7SELECT customer_id FROM orders_2023
8UNION ALL
9SELECT customer_id FROM orders_2024;
10
11-- Use UNION ALL when duplicates are OK or impossible

6. Table Design Impact

6.1 Normalization Trade-offs

SQL

1-- Normalized: Multiple tables, requires joins
2SELECT 
3    o.order_id,
4    c.name as customer_name,
5    p.name as product_name
6FROM orders o
7JOIN customers c ON o.customer_id = c.id
8JOIN products p ON o.product_id = p.id;
9
10-- Denormalized: Redundant data, no joins
11SELECT order_id, customer_name, product_name
12FROM orders_denormalized;
13
14-- Consider denormalization for:
15-- - Read-heavy workloads
16-- - Reporting tables
17-- - Data warehouses

6.2 Partitioning

SQL

1-- Range partitioning (PostgreSQL)
2CREATE TABLE orders (
3    order_id SERIAL,
4    order_date DATE,
5    amount DECIMAL
6) PARTITION BY RANGE (order_date);
7
8CREATE TABLE orders_2023 PARTITION OF orders
9    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');
10
11CREATE TABLE orders_2024 PARTITION OF orders
12    FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');
13
14-- Query automatically uses only relevant partition
15SELECT * FROM orders WHERE order_date = '2024-06-15';
16-- Only scans orders_2024

7. Monitoring & Analysis

7.1 Slow Query Log

SQL

1-- PostgreSQL: Enable slow query logging
2-- In postgresql.conf:
3-- log_min_duration_statement = 1000  -- Log queries > 1 second
4
5-- MySQL: Enable slow query log
6SET GLOBAL slow_query_log = 'ON';
7SET GLOBAL long_query_time = 1;
8
9-- View slow queries
10SELECT * FROM pg_stat_statements
11ORDER BY total_time DESC
12LIMIT 10;

7.2 Index Usage Statistics

SQL

1-- PostgreSQL: Check index usage
2SELECT 
3    schemaname,
4    tablename,
5    indexname,
6    idx_scan as times_used,
7    idx_tup_read as tuples_read,
8    idx_tup_fetch as tuples_fetched
9FROM pg_stat_user_indexes
10ORDER BY idx_scan DESC;
11
12-- Find unused indexes
13SELECT 
14    schemaname,
15    tablename,
16    indexname
17FROM pg_stat_user_indexes
18WHERE idx_scan = 0
19AND schemaname = 'public';

7.3 Table Statistics

SQL

1-- Update statistics (helps optimizer)
2ANALYZE orders;
3
4-- PostgreSQL: View table statistics
5SELECT 
6    schemaname,
7    relname as table_name,
8    n_live_tup as live_rows,
9    n_dead_tup as dead_rows,
10    last_vacuum,
11    last_autovacuum
12FROM pg_stat_user_tables
13ORDER BY n_dead_tup DESC;
14
15-- High dead_rows = needs VACUUM
16VACUUM ANALYZE orders;

8. Common Anti-Patterns

8.1 N+1 Query Problem

SQL

1-- ❌ Anti-pattern: Loop with individual queries
2FOR each customer IN customers:
3    SELECT * FROM orders WHERE customer_id = customer.id;
4
5-- ✅ Single query with JOIN or IN
6SELECT c.*, o.*
7FROM customers c
8LEFT JOIN orders o ON c.id = o.customer_id
9WHERE c.region = 'North';

8.2 Implicit Type Conversion

SQL

1-- ❌ String to number conversion (can't use index)
2SELECT * FROM orders WHERE order_id = '12345';  -- order_id is INT
3
4-- ✅ Correct type
5SELECT * FROM orders WHERE order_id = 12345;

8.3 OR Conditions

SQL

1-- ❌ OR can prevent index usage
2SELECT * FROM orders 
3WHERE customer_id = 100 OR product_id = 200;
4
5-- ✅ UNION for better index usage
6SELECT * FROM orders WHERE customer_id = 100
7UNION
8SELECT * FROM orders WHERE product_id = 200;
9
10-- Or create composite index
11CREATE INDEX idx_customer_product ON orders(customer_id, product_id);

9. Thực hành

Optimization Exercise

Exercise: Optimize Slow Queries

SQL

1-- Given table:
2-- orders(order_id, customer_id, product_id, order_date, amount, status)
3-- 10 million rows
4
5-- Optimize these queries:
6
7-- Query 1: Slow aggregate
8SELECT 
9    DATE_TRUNC('month', order_date) as month,
10    COUNT(*) as order_count,
11    SUM(amount) as total_amount
12FROM orders
13WHERE YEAR(order_date) = 2024
14GROUP BY DATE_TRUNC('month', order_date);
15
16-- Query 2: Slow join
17SELECT o.*, c.name
18FROM orders o
19LEFT JOIN customers c ON o.customer_id = c.id
20WHERE o.status = 'pending';
21
22-- Query 3: Slow subquery
23SELECT *
24FROM orders
25WHERE customer_id IN (
26    SELECT customer_id
27    FROM orders
28    GROUP BY customer_id
29    HAVING SUM(amount) > 10000
30);
31
32-- YOUR OPTIMIZED QUERIES HERE

💡 Xem đáp án

SQL

1-- Query 1: Fix non-sargable WHERE, add index
2-- Create index:
3CREATE INDEX idx_orders_date ON orders(order_date);
4
5-- Optimized query (sargable):
6SELECT 
7    DATE_TRUNC('month', order_date) as month,
8    COUNT(*) as order_count,
9    SUM(amount) as total_amount
10FROM orders
11WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'
12GROUP BY DATE_TRUNC('month', order_date);
13
14-- Query 2: Add indexes
15CREATE INDEX idx_orders_status ON orders(status);
16CREATE INDEX idx_orders_customer ON orders(customer_id);
17
18-- Or covering index for status + customer_id:
19CREATE INDEX idx_orders_status_customer ON orders(status, customer_id);
20
21-- Query structure is OK, just needed indexes
22
23-- Query 3: Use CTE or temp table
24-- Option A: CTE
25WITH high_value_customers AS (
26    SELECT customer_id
27    FROM orders
28    GROUP BY customer_id
29    HAVING SUM(amount) > 10000
30)
31SELECT o.*
32FROM orders o
33JOIN high_value_customers hvc ON o.customer_id = hvc.customer_id;
34
35-- Option B: EXISTS (often faster)
36SELECT o.*
37FROM orders o
38WHERE EXISTS (
39    SELECT 1
40    FROM orders o2
41    WHERE o2.customer_id = o.customer_id
42    GROUP BY o2.customer_id
43    HAVING SUM(o2.amount) > 10000
44);
45
46-- EXPLAIN output comparison
47EXPLAIN ANALYZE 
48SELECT /* original query */;
49
50EXPLAIN ANALYZE
51SELECT /* optimized query */;

10. Tổng kết

Technique	Impact	When to Use
Indexing	High	Frequent WHERE/JOIN columns
EXPLAIN	Essential	Always before optimization
Sargable queries	High	All queries with functions
Avoid SELECT *	Medium	Always
JOIN optimization	High	Multi-table queries
Partitioning	High	Very large tables

Optimization Checklist:

✅ Run EXPLAIN ANALYZE first
✅ Check for full table scans
✅ Verify index usage
✅ Make queries sargable
✅ Optimize JOIN order
✅ Use appropriate data types

Bài tiếp theo: SQL + Python Integration