Query Optimization
1. Understanding Query Performance
Tại sao Optimization quan trọng?
Query chậm = User experience kém, server costs cao, bottlenecks trong pipelines. Query optimization là kỹ năng essential cho mọi Data professional.
1.1 Query Execution Flow
Text
1┌─────────────────────────────────────────────────────────┐2│ Query Execution Pipeline │3├─────────────────────────────────────────────────────────┤4│ │5│ SQL Query │6│ │ │7│ v │8│ ┌────────────────┐ │9│ │ Parser │ Check syntax │10│ └───────┬────────┘ │11│ v │12│ ┌────────────────┐ │13│ │ Optimizer │ Generate execution plans │14│ │ │ Choose best plan (cost-based) │15│ └───────┬────────┘ │16│ v │17│ ┌────────────────┐ │18│ │ Executor │ Execute chosen plan │19│ └───────┬────────┘ │20│ v │21│ Results │22│ │23└─────────────────────────────────────────────────────────┘1.2 Common Performance Issues
| Issue | Symptom | Common Cause |
|---|---|---|
| Full Table Scan | Slow on large tables | Missing index |
| Cartesian Product | Exponential row growth | Missing JOIN condition |
| N+1 Problem | Many small queries | Application logic issue |
| Lock Contention | Timeouts | Concurrent writes |
| Memory Issues | Out of memory | Large sorts/joins |
2. EXPLAIN Command
2.1 Basic EXPLAIN
SQL
1-- PostgreSQL2EXPLAIN SELECT * FROM orders WHERE customer_id = 100;34-- With actual runtime stats5EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 100;67-- Verbose output8EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)9SELECT * FROM orders WHERE customer_id = 100;1011-- MySQL12EXPLAIN SELECT * FROM orders WHERE customer_id = 100;13EXPLAIN FORMAT=JSON SELECT * FROM orders WHERE customer_id = 100;2.2 Reading EXPLAIN Output
SQL
1-- Example output (PostgreSQL)2EXPLAIN ANALYZE SELECT * FROM orders WHERE customer_id = 100;34/*5Seq Scan on orders (cost=0.00..1520.00 rows=50 width=120) (actual time=0.015..12.450 rows=48 loops=1)6 Filter: (customer_id = 100)7 Rows Removed by Filter: 499528Planning Time: 0.085 ms9Execution Time: 12.480 ms10*/1112-- Key metrics:13-- cost: Estimated cost (startup..total)14-- rows: Estimated rows returned15-- actual time: Real execution time (ms)16-- loops: Times this node was executed17-- Rows Removed by Filter: Filtered out rows (wasted work!)2.3 Scan Types
SQL
1-- Sequential Scan: Reads entire table2-- Worst case for large tables3Seq Scan on orders45-- Index Scan: Uses index to find rows6-- Much faster for selective queries7Index Scan using idx_customer_id on orders89-- Index Only Scan: All data from index10-- Fastest - no table access needed11Index Only Scan using idx_customer_date on orders1213-- Bitmap Scan: For multiple conditions14Bitmap Heap Scan on orders15 -> Bitmap Index Scan on idx_customer_id16 -> Bitmap Index Scan on idx_status2.4 Join Types in EXPLAIN
SQL
1-- Nested Loop: Good for small datasets2Nested Loop3 -> Seq Scan on orders4 -> Index Scan on customers56-- Hash Join: Good for larger datasets7Hash Join8 -> Seq Scan on orders9 -> Hash10 -> Seq Scan on customers1112-- Merge Join: Good for sorted data13Merge Join14 -> Sort15 -> Seq Scan on orders16 -> Sort17 -> Seq Scan on customers3. Indexing Strategies
3.1 Index Basics
SQL
1-- Single column index2CREATE INDEX idx_customer_id ON orders(customer_id);34-- Composite index (column order matters!)5CREATE INDEX idx_customer_date ON orders(customer_id, order_date);67-- Unique index8CREATE UNIQUE INDEX idx_email ON users(email);910-- Partial index (PostgreSQL)11CREATE INDEX idx_active_orders ON orders(order_date)12WHERE status = 'active';1314-- Expression index15CREATE INDEX idx_lower_email ON users(LOWER(email));3.2 Index Selection Guidelines
Text
1┌─────────────────────────────────────────────────────────┐2│ When to Create Index │3├─────────────────────────────────────────────────────────┤4│ │5│ ✅ DO index: │6│ • Primary keys (automatic) │7│ • Foreign keys │8│ • Columns in WHERE clauses │9│ • Columns in JOIN conditions │10│ • Columns in ORDER BY (if large result sets) │11│ • High-cardinality columns (many unique values) │12│ │13│ ❌ DON'T index: │14│ • Small tables (< 1000 rows) │15│ • Columns with low cardinality (e.g., boolean) │16│ • Frequently updated columns │17│ • Wide columns (long text) │18│ • Every column (index maintenance overhead) │19│ │20└─────────────────────────────────────────────────────────┘3.3 Composite Index Order
SQL
1-- The leftmost prefix rule2CREATE INDEX idx_a_b_c ON table(a, b, c);34-- This index supports:5WHERE a = ? ✅ Uses index6WHERE a = ? AND b = ? ✅ Uses index 7WHERE a = ? AND b = ? AND c = ? ✅ Uses index8WHERE b = ? AND c = ? ❌ Cannot use index (missing 'a')9WHERE a = ? AND c = ? ⚠️ Partial use (only 'a')1011-- Best practice: Most selective column first12-- If customer_id is more selective than status:13CREATE INDEX idx_orders ON orders(customer_id, status);3.4 Covering Index
SQL
1-- Include all columns needed by query2CREATE INDEX idx_covering ON orders(customer_id)3INCLUDE (order_date, amount);45-- Query uses Index Only Scan (no table access)6SELECT order_date, amount 7FROM orders 8WHERE customer_id = 100;4. Query Optimization Techniques
4.1 Sargable Queries
Sargable Queries
SARG = Search ARGument. Sargable queries can use indexes. Non-sargable queries force full table scans.
SQL
1-- ❌ Non-sargable (can't use index)2SELECT * FROM orders WHERE YEAR(order_date) = 2024;3SELECT * FROM users WHERE UPPER(email) = 'TEST@EMAIL.COM';4SELECT * FROM products WHERE price + 10 > 100;56-- ✅ Sargable (can use index)7SELECT * FROM orders 8WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01';910SELECT * FROM users WHERE email = 'test@email.com';11-- Or create expression index: CREATE INDEX idx_lower_email ON users(LOWER(email));1213SELECT * FROM products WHERE price > 90;4.2 Avoid SELECT *
SQL
1-- ❌ Bad: Returns all columns2SELECT * FROM orders WHERE customer_id = 100;34-- ✅ Good: Only needed columns5SELECT order_id, order_date, amount 6FROM orders 7WHERE customer_id = 100;89-- Benefits:10-- Less data transfer11-- Can use covering index12-- Less memory usage4.3 EXISTS vs IN
SQL
1-- For large subqueries, EXISTS often faster2-- ❌ IN with large result set3SELECT * FROM orders 4WHERE customer_id IN (SELECT id FROM customers WHERE region = 'North');56-- ✅ EXISTS (stops at first match)7SELECT * FROM orders o8WHERE EXISTS (9 SELECT 1 FROM customers c 10 WHERE c.id = o.customer_id AND c.region = 'North'11);1213-- For small static lists, IN is fine14SELECT * FROM orders WHERE status IN ('pending', 'processing');4.4 JOIN Optimization
SQL
1-- Join on indexed columns2SELECT o.*, c.name3FROM orders o4JOIN customers c ON o.customer_id = c.id; -- Both should be indexed56-- Filter early (before join)7SELECT o.*, c.name8FROM orders o9JOIN customers c ON o.customer_id = c.id10WHERE o.order_date >= '2024-01-01' -- Applied before join11AND c.status = 'active';1213-- Avoid unnecessary joins14-- ❌ If you don't need customer data15SELECT o.* 16FROM orders o17JOIN customers c ON o.customer_id = c.id;1819-- ✅ Skip the join20SELECT * FROM orders;4.5 LIMIT and Pagination
SQL
1-- Use LIMIT to reduce result set2SELECT * FROM orders 3ORDER BY order_date DESC 4LIMIT 100;56-- Efficient pagination with keyset (cursor)7-- ❌ OFFSET becomes slow for large offsets8SELECT * FROM orders ORDER BY id LIMIT 20 OFFSET 10000;910-- ✅ Keyset pagination (use last seen value)11SELECT * FROM orders 12WHERE id > 12345 -- Last ID from previous page13ORDER BY id 14LIMIT 20;5. Query Rewriting
5.1 Subquery to JOIN
SQL
1-- ❌ Correlated subquery (runs for each row)2SELECT 3 o.*,4 (SELECT name FROM customers WHERE id = o.customer_id) as customer_name5FROM orders o;67-- ✅ JOIN (single operation)8SELECT o.*, c.name as customer_name9FROM orders o10LEFT JOIN customers c ON o.customer_id = c.id;5.2 Aggregate Optimization
SQL
1-- Count optimization2-- ❌ COUNT(*) 3SELECT COUNT(*) FROM orders WHERE status = 'completed';45-- ✅ COUNT(indexed_column) can sometimes use index6SELECT COUNT(order_id) FROM orders WHERE status = 'completed';78-- For existence check9-- ❌ Count all10IF (SELECT COUNT(*) FROM orders WHERE customer_id = 100) > 01112-- ✅ EXISTS (stops at first row)13IF EXISTS (SELECT 1 FROM orders WHERE customer_id = 100)5.3 UNION vs UNION ALL
SQL
1-- UNION: Removes duplicates (requires sort)2SELECT customer_id FROM orders_20233UNION4SELECT customer_id FROM orders_2024;56-- UNION ALL: Keeps duplicates (faster)7SELECT customer_id FROM orders_20238UNION ALL9SELECT customer_id FROM orders_2024;1011-- Use UNION ALL when duplicates are OK or impossible6. Table Design Impact
6.1 Normalization Trade-offs
SQL
1-- Normalized: Multiple tables, requires joins2SELECT 3 o.order_id,4 c.name as customer_name,5 p.name as product_name6FROM orders o7JOIN customers c ON o.customer_id = c.id8JOIN products p ON o.product_id = p.id;910-- Denormalized: Redundant data, no joins11SELECT order_id, customer_name, product_name12FROM orders_denormalized;1314-- Consider denormalization for:15-- - Read-heavy workloads16-- - Reporting tables17-- - Data warehouses6.2 Partitioning
SQL
1-- Range partitioning (PostgreSQL)2CREATE TABLE orders (3 order_id SERIAL,4 order_date DATE,5 amount DECIMAL6) PARTITION BY RANGE (order_date);78CREATE TABLE orders_2023 PARTITION OF orders9 FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');1011CREATE TABLE orders_2024 PARTITION OF orders12 FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');1314-- Query automatically uses only relevant partition15SELECT * FROM orders WHERE order_date = '2024-06-15';16-- Only scans orders_20247. Monitoring & Analysis
7.1 Slow Query Log
SQL
1-- PostgreSQL: Enable slow query logging2-- In postgresql.conf:3-- log_min_duration_statement = 1000 -- Log queries > 1 second45-- MySQL: Enable slow query log6SET GLOBAL slow_query_log = 'ON';7SET GLOBAL long_query_time = 1;89-- View slow queries10SELECT * FROM pg_stat_statements11ORDER BY total_time DESC12LIMIT 10;7.2 Index Usage Statistics
SQL
1-- PostgreSQL: Check index usage2SELECT 3 schemaname,4 tablename,5 indexname,6 idx_scan as times_used,7 idx_tup_read as tuples_read,8 idx_tup_fetch as tuples_fetched9FROM pg_stat_user_indexes10ORDER BY idx_scan DESC;1112-- Find unused indexes13SELECT 14 schemaname,15 tablename,16 indexname17FROM pg_stat_user_indexes18WHERE idx_scan = 019AND schemaname = 'public';7.3 Table Statistics
SQL
1-- Update statistics (helps optimizer)2ANALYZE orders;34-- PostgreSQL: View table statistics5SELECT 6 schemaname,7 relname as table_name,8 n_live_tup as live_rows,9 n_dead_tup as dead_rows,10 last_vacuum,11 last_autovacuum12FROM pg_stat_user_tables13ORDER BY n_dead_tup DESC;1415-- High dead_rows = needs VACUUM16VACUUM ANALYZE orders;8. Common Anti-Patterns
8.1 N+1 Query Problem
SQL
1-- ❌ Anti-pattern: Loop with individual queries2FOR each customer IN customers:3 SELECT * FROM orders WHERE customer_id = customer.id;45-- ✅ Single query with JOIN or IN6SELECT c.*, o.*7FROM customers c8LEFT JOIN orders o ON c.id = o.customer_id9WHERE c.region = 'North';8.2 Implicit Type Conversion
SQL
1-- ❌ String to number conversion (can't use index)2SELECT * FROM orders WHERE order_id = '12345'; -- order_id is INT34-- ✅ Correct type5SELECT * FROM orders WHERE order_id = 12345;8.3 OR Conditions
SQL
1-- ❌ OR can prevent index usage2SELECT * FROM orders 3WHERE customer_id = 100 OR product_id = 200;45-- ✅ UNION for better index usage6SELECT * FROM orders WHERE customer_id = 1007UNION8SELECT * FROM orders WHERE product_id = 200;910-- Or create composite index11CREATE INDEX idx_customer_product ON orders(customer_id, product_id);9. Thực hành
Optimization Exercise
Exercise: Optimize Slow Queries
SQL
1-- Given table:2-- orders(order_id, customer_id, product_id, order_date, amount, status)3-- 10 million rows45-- Optimize these queries:67-- Query 1: Slow aggregate8SELECT 9 DATE_TRUNC('month', order_date) as month,10 COUNT(*) as order_count,11 SUM(amount) as total_amount12FROM orders13WHERE YEAR(order_date) = 202414GROUP BY DATE_TRUNC('month', order_date);1516-- Query 2: Slow join17SELECT o.*, c.name18FROM orders o19LEFT JOIN customers c ON o.customer_id = c.id20WHERE o.status = 'pending';2122-- Query 3: Slow subquery23SELECT *24FROM orders25WHERE customer_id IN (26 SELECT customer_id27 FROM orders28 GROUP BY customer_id29 HAVING SUM(amount) > 1000030);3132-- YOUR OPTIMIZED QUERIES HERE💡 Xem đáp án
SQL
1-- Query 1: Fix non-sargable WHERE, add index2-- Create index:3CREATE INDEX idx_orders_date ON orders(order_date);45-- Optimized query (sargable):6SELECT 7 DATE_TRUNC('month', order_date) as month,8 COUNT(*) as order_count,9 SUM(amount) as total_amount10FROM orders11WHERE order_date >= '2024-01-01' AND order_date < '2025-01-01'12GROUP BY DATE_TRUNC('month', order_date);1314-- Query 2: Add indexes15CREATE INDEX idx_orders_status ON orders(status);16CREATE INDEX idx_orders_customer ON orders(customer_id);1718-- Or covering index for status + customer_id:19CREATE INDEX idx_orders_status_customer ON orders(status, customer_id);2021-- Query structure is OK, just needed indexes2223-- Query 3: Use CTE or temp table24-- Option A: CTE25WITH high_value_customers AS (26 SELECT customer_id27 FROM orders28 GROUP BY customer_id29 HAVING SUM(amount) > 1000030)31SELECT o.*32FROM orders o33JOIN high_value_customers hvc ON o.customer_id = hvc.customer_id;3435-- Option B: EXISTS (often faster)36SELECT o.*37FROM orders o38WHERE EXISTS (39 SELECT 140 FROM orders o241 WHERE o2.customer_id = o.customer_id42 GROUP BY o2.customer_id43 HAVING SUM(o2.amount) > 1000044);4546-- EXPLAIN output comparison47EXPLAIN ANALYZE 48SELECT /* original query */;4950EXPLAIN ANALYZE51SELECT /* optimized query */;10. Tổng kết
| Technique | Impact | When to Use |
|---|---|---|
| Indexing | High | Frequent WHERE/JOIN columns |
| EXPLAIN | Essential | Always before optimization |
| Sargable queries | High | All queries with functions |
| **Avoid SELECT *** | Medium | Always |
| JOIN optimization | High | Multi-table queries |
| Partitioning | High | Very large tables |
Optimization Checklist:
- ✅ Run EXPLAIN ANALYZE first
- ✅ Check for full table scans
- ✅ Verify index usage
- ✅ Make queries sargable
- ✅ Optimize JOIN order
- ✅ Use appropriate data types
Bài tiếp theo: SQL + Python Integration
