Last modified: September 22, 2025

This article is written in: πŸ‡ΊπŸ‡Έ

Query Optimization Techniques

Query optimization is about making SQL queries run more efficiently. The database figures out the best way to execute a query so it uses fewer resources and runs faster. This helps keep the system responsive and makes things smoother for the users and applications that depend on the data.

After reading the material, you should be able to answer the following questions:

  1. What is query optimization, and why is it essential for improving the efficiency and performance of SQL queries in a database system?
  2. What are the various query optimization techniques, such as indexing, query rewriting, join optimization, partitioning, materialized views, caching, and maintaining statistics, and how does each technique contribute to enhancing query performance?
  3. How do indexes improve query performance, and what are the best practices for selecting which columns to index and creating effective indexes in SQL?
  4. How can tools like the EXPLAIN command be used to analyze and optimize SQL queries, and what insights can they provide into query execution plans?
  5. What are the best practices for query optimization, including balancing read and write operations, avoiding excessive indexing, rewriting complex queries, and regularly reviewing and maintaining query performance?

Indexing

Indexes are like a smart, alphabetized cheat-sheet for your tables. Instead of rifling through every row (a full table scan), the database jumps straight to where the matches live (an index seek).

Without index (slow)                  With index (fast)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  customers    β”‚                     β”‚   idx(last)   β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€     scan…scan…      β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€  seek β†’ hits
β”‚ [millions…]   β”‚  ─────────────────▢ β”‚ A..B..C..S..  β”‚ ─────────────▢ rows
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

How Indexes Improve Query Performance

Quick visual on selectivity:

High selectivity (great)         Low selectivity (meh)
email β†’ 1 match                   is_active β†’ 900k matches

Creating an Index Example

CREATE INDEX idx_customers_lastname ON customers(last_name);

That helps queries that filter or sort by last_name:

SELECT * 
FROM customers 
WHERE last_name = 'Smith';

SELECT * 
FROM customers 
ORDER BY last_name;

Level up with common patterns:

Composite index (left-prefix matters):

CREATE INDEX idx_orders_cust_date ON orders(customer_id, created_at);

Helps: WHERE customer_id = ? (and optionally AND created_at >= ?) and ORDER BY created_at.

Covering index (the query lives in the index):

CREATE INDEX idx_orders_cov ON orders(customer_id, status, total_amount);
-- then a query like:
SELECT status, total_amount 
FROM orders 
WHERE customer_id = 42;

If the DB can answer from just the index, it avoids touching the table (a β€œheap/cluster” visit).

Functional/partial index (when you can’t change the query shape):

-- functional
CREATE INDEX idx_lower_email ON users(LOWER(email));

-- partial (only active users)
CREATE INDEX idx_active_users ON users(last_login) WHERE is_active = true;

Tip: Avoid functions on the left side of predicates unless you have a matching functional index:

-- Slows down (kills index use):
WHERE LOWER(email) = LOWER('A@B.COM')

-- Better:
WHERE email = 'a@b.com'   -- store emails lowercased OR use functional index

Example

EXPLAIN SELECT * FROM customers WHERE last_name = 'Smith';

Example output:

Index Scan using idx_customers_lastname on customers  (cost=0.29..8.31 rows=1 width=83)

If you check with EXPLAIN ANALYZE, you’ll see real timings. A typical before/after on a ~10M-row table:

Before (no index)  : Seq Scan on customers  (actual time=0.000..4210.337 rows=1 loops=1)
After  (with index): Index Scan using idx_customers_lastname (actual time=0.031..0.049 rows=1 loops=1)

That’s roughly 4.2s β†’ 0.04s (~105Γ— faster) in this scenario.

Real-world wins we’ve seen:

Query Rewriting

You can often get big wins by rephrasing the same question so the optimizer can pick a cheaper path.

Complicated shape                 Simpler shape
   (nested subquery)    β†’         (join/exists)  β†’   better plan

Simplifying Complex Queries

Rewriting Example

Inefficient query:

SELECT * 
FROM orders 
WHERE customer_id IN (
  SELECT customer_id 
  FROM customers 
  WHERE city = 'London'
);

Optimized with a JOIN:

SELECT o.*
FROM orders AS o
JOIN customers AS c
  ON o.customer_id = c.customer_id
WHERE c.city = 'London';

Why it’s faster (with supporting indexes like customers(city, customer_id) and orders(customer_id)):

Alternative that also performs well:

SELECT o.*
FROM orders AS o
WHERE EXISTS (
  SELECT 1
  FROM customers AS c
  WHERE c.customer_id = o.customer_id
    AND c.city = 'London'
);

EXISTS can short-circuit on the first match and often pairs nicely with indexes.

Potential duplicate rows? If customers.customer_id isn’t unique in your schema, either fix the model or add SELECT DISTINCT o.*.

Handy rewrites that routinely help:

OR β†’ UNION ALL (when predicates are selective and independent):

-- Before
WHERE (status = 'paid' OR shipped_at IS NULL)

-- After (lets each branch use its own index)
SELECT ... WHERE status = 'paid'
UNION ALL
SELECT ... WHERE shipped_at IS NULL AND status <> 'paid';

Pre-aggregate then join (shrinks data early):

-- Before: join then group (heavy)
SELECT c.id, SUM(o.total)
FROM customers c
JOIN orders o ON o.customer_id = c.id
GROUP BY c.id;

-- After: group first, then join (lighter)
WITH sums AS (
  SELECT customer_id, SUM(total) AS total_sum
  FROM orders
  GROUP BY customer_id
)
SELECT c.id, s.total_sum
FROM customers c
JOIN sums s ON s.customer_id = c.id;

Window vs subquery (often clearer + faster):

-- Top order per customer
SELECT *
FROM (
  SELECT o.*,
         ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY total DESC) AS rn
  FROM orders o
) x
WHERE rn = 1;

Measured improvements from tidy rewrites:

Tiny plan-reading cheat sheet:

Seq Scan         β†’ table scan (usually slow on big tables)
Index Scan/Seek  β†’ using index (good)
Bitmap Index/Heap→ many matches; sometimes still OK
Hash/Sort        β†’ watch for big memory use; try to index to avoid
Nested Loop      β†’ great when outer is small & inner is indexed
Merge/Hash Join  β†’ better for big sets; order/hash considerations

Pair the right indexes with query shapes that can use them, and you’ll usually see order-of-magnitude wins. When in doubt, run EXPLAIN ANALYZE, compare before/after timings, and check whether the plan switched from Seq Scan to an index-driven path.

Join Optimization

Joins are common in SQL queries but can be resource-intensive. Optimizing joins can have a substantial impact on performance.

High level mental model
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”   join key   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LEFT   │◀────────────▢│  RIGHT  β”‚
β”‚  rows   β”‚              β”‚  rows   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   filter early, index keys, pick join that matches row counts

Choosing the Right Join Type

Different join types (INNER, LEFT, RIGHT, FULL) serve different purposes. Selecting the appropriate type ensures that only the necessary data is processed.

INNER:   L β‹‚ R     keep matches only
LEFT:    L βŸ• R     keep all L + matches from R (NULLs when no match)
RIGHT:   L βŸ– R     keep all R + matches from L
FULL:    L βŸ— R     keep everything, matched or not
SEMI:    L where a match exists in R (EXISTS)
ANTI:    L where no match exists in R (NOT EXISTS)

Quick tips:

Join algorithms (what the engine actually runs):

Nested Loop  : great when outer is small & inner is indexed (fast seeks)
Hash Join    : shines on large, unsorted sets; needs memory for hash
Merge Join   : fast if both inputs are pre-sorted on join keys (or can be)

Example of Join Order Impact

Suppose you have two tables, large_table and small_table. Joining small_table to large_table can be more efficient than the reverse when the engine uses a nested loop and can seek into large_table by key.

Optimized join:

SELECT lt.*, st.info
FROM small_table AS st
JOIN large_table AS lt
  ON st.id = lt.st_id;

But the bigger lever is indexes on the join keys:

-- On the large side, index the join key it’s probed on:
CREATE INDEX idx_large_st_id ON large_table(st_id);

-- If you filter the small side first, index its filter too:
CREATE INDEX idx_small_status ON small_table(status);

The heuristic β€œsmaller table first when JOIN” helps reduce work in nested loop joins. Modern databases usually reorder joins themselves, so this rule is mostly for engines with weaker optimizers or cases where you force a join order.

Extra patterns that help:

Filter early on the driving table:

SELECT lt.*, st.info
FROM (SELECT id FROM small_table WHERE status = 'active') st
JOIN large_table lt ON st.id = lt.st_id;

Semi-join for existence checks:

SELECT lt.*
FROM large_table lt
WHERE EXISTS (
  SELECT 1 FROM small_table st WHERE st.id = lt.st_id AND st.status='active'
);

Tiny cardinality + index checklist:

[ ] Index join keys on BOTH sides
[ ] Apply filters BEFORE the join (CTE/derived table ok)
[ ] Return only needed columns (enables covering indexes)
[ ] Watch out for exploding rows (1:N:N); aggregate early if possible

Measured wins from join tuning:

Using EXPLAIN to Analyze Queries

Most databases provide an EXPLAIN command that shows how a query will be executed. This tool is invaluable for understanding and optimizing query performance.

EXPLAIN SELECT * FROM customers WHERE last_name = 'Smith';

Example output:

Seq Scan on customers  (cost=0.00..12.00 rows=1 width=83)
  Filter: (last_name = 'Smith')

Level up the analysis:

Use runtime variants to see actual work:

Watch for:

Handy EXPLAIN interpretation cheat:

Node          What it hints
------------  --------------------------------------------
Seq Scan      Missing/ignored index or low selectivity
Index Scan    Good selectivity / usable index
Bitmap Heap   Many matches; ok but maybe add composite index
Nested Loop   Outer small + inner indexed; or missing index (slow)
Hash Join     Large sets; ensure enough memory to avoid spills
Merge Join    Inputs sorted; consider indexes to keep them sorted

Measured β€œexplain-driven” fixes:

Partitioning

Partitioning divides a large table into smaller, more manageable pieces. This can improve query performance by allowing the database to scan only relevant partitions (aka partition pruning), and can make maintenance (loads, archiving) safer and faster.

One big table β†’ many smaller, date-sliced chunks
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ orders ────────────────────────┐
β”‚ 2021 | 2022 | 2023 | 2024 | 2025 | default(fallback)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   ↑ prune to only what your WHERE clause needs

Common strategies:

Benefits:

Trade-offs:

Partitioning Example

Partitioning a table by date:

-- PostgreSQL style (parent + partitions)
CREATE TABLE orders (
  id BIGSERIAL PRIMARY KEY,
  customer_id BIGINT NOT NULL,
  order_date DATE NOT NULL,
  total NUMERIC(12,2) NOT NULL
) PARTITION BY RANGE (order_date);

CREATE TABLE orders_2021 PARTITION OF orders
  FOR VALUES FROM ('2021-01-01') TO ('2022-01-01');

-- Always keep a current + default partition
CREATE TABLE orders_2025 PARTITION OF orders
  FOR VALUES FROM ('2025-01-01') TO ('2026-01-01');

CREATE TABLE orders_default PARTITION OF orders DEFAULT;

-- Index per partition (or on parent in engines that propagate)
CREATE INDEX ON orders_2021 (order_date, customer_id);
CREATE INDEX ON orders_2025 (order_date, customer_id);

Now queries that filter by date can target the specific partition, reducing the amount of data scanned:

SELECT customer_id, SUM(total)
FROM orders
WHERE order_date >= DATE '2025-01-01'
  AND order_date <  DATE '2025-02-01'
GROUP BY customer_id;

Engine notes:

Operational patterns:

Measured partitioning wins:

Materialized Views

Materialized views store the result of a query on disk so future lookups skip heavy joins/aggregations.

Raw tables ──(expensive query)──▢ result
                 β–²                     β”‚
                 └─────── stored as materialized view β”€β”€β”€β”€β”˜

When they shine:

Trade-offs:

Creating a Materialized View Example

-- Base example
CREATE MATERIALIZED VIEW sales_summary AS
SELECT
  product_id,
  SUM(quantity) AS total_quantity
FROM sales
GROUP BY product_id;

-- Index it like a real table (fast lookups/joins):
CREATE INDEX ON sales_summary (product_id);

Level-up variants you’ll likely want:

-- (PostgreSQL) Create without initial load, then refresh later off-peak
CREATE MATERIALIZED VIEW sales_summary WITH NO DATA AS
SELECT product_id, SUM(quantity) AS total_quantity
FROM sales
GROUP BY product_id;

-- Track freshness for UIs/tooling
ALTER TABLE sales_summary ADD COLUMN last_refreshed timestamptz;
UPDATE sales_summary SET last_refreshed = clock_timestamp();

-- Typical β€œrollup with time buckets”
CREATE MATERIALIZED VIEW sales_summary_daily AS
SELECT
  product_id,
  date_trunc('day', sold_at) AS day,
  SUM(quantity) AS qty,
  SUM(price * quantity) AS revenue
FROM sales
GROUP BY product_id, date_trunc('day', sold_at);

CREATE INDEX ON sales_summary_daily (product_id, day);

Tip: Query the MV directly (FROM sales_summary_daily) or wire a view/feature flag so you can flip between the MV and the base query during rollout.

Refreshing the Materialized View

-- Basic refresh (locks readers briefly in some engines)
REFRESH MATERIALIZED VIEW sales_summary;

-- (PostgreSQL) Concurrent refresh (readers don’t block).
-- Requires a UNIQUE index that covers all rows.
CREATE UNIQUE INDEX ON sales_summary (product_id);
REFRESH MATERIALIZED VIEW CONCURRENTLY sales_summary;

-- Keep the freshness column updated post-refresh
UPDATE sales_summary SET last_refreshed = clock_timestamp();

Other refresh patterns (engine-specific):

Common β€œgotchas” checklist:

[ ] Index the MV for your read patterns
[ ] Guarantee a UNIQUE key if you want concurrent/fast refresh
[ ] Size the refresh window to fit your SLA
[ ] Document staleness (UI badge: β€œUpdated 08:15”)
[ ] Schedule around load; stagger multiple MVs

Measured wins we’ve seen:

Caching

Caching keeps recent/frequent results close to your app so you avoid repeating expensive work.

client β†’ app β†’ (cache?) β†’ db
           └── hit β†’ fast
           └── miss β†’ compute β†’ store β†’ fast next time

Where to cache:

When it shines:

Risks & mitigations:

Application-Level Caching Example

Basic cache-aside with Redis (Python):

import json, time, random
import redis
cache = redis.Redis(host='localhost', port=6379, decode_responses=True)

def get_product_details(product_id):
    key = f"product:{product_id}:v1"             # versioned key for schema changes
    val = cache.get(key)
    if val:
        return json.loads(val)                   # cache hit

    # stampede guard: short lock so only one worker recomputes
    lock_key = f"lock:{key}"
    if cache.set(lock_key, "1", nx=True, ex=15):  # acquire
        try:
            product = fetch_product_from_db(product_id)  # slow path
            ttl = 3600 + int(random.random() * 300)      # jitter to avoid herd
            cache.set(key, json.dumps(product), ex=ttl)
            return product
        finally:
            cache.delete(lock_key)
    else:
        # another worker is fetching; brief wait then retry
        time.sleep(0.05)
        val = cache.get(key)
        return json.loads(val) if val else fetch_product_from_db(product_id)

Useful variations:

Patterns & pitfalls:

[ ] Version keys (product:{id}:v2) for painless schema changes
[ ] Add TTL jitter (Β±5–10%) to avoid synchronized expiries
[ ] Protect heavy keys with a lock/single-flight
[ ] Size & evict wisely (LRU/LFU); monitor hit rate & memory
[ ] Don’t cache giant blobs; compress or split

Measured wins:

Statistics and Histograms

Optimizers aren’t psychicβ€”they bet on plans using statistics about your data. When those stats are fresh and detailed, the planner picks smarter joins, uses the right indexes, and avoids ugly scans.

Planner’s crystal ball
   data sample  β†’  MCV list + histogram  β†’  cardinality guess  β†’  plan

What Postgres tracks (via pg_stats):

Tiny visual:

MCV (top-k):    ['New York'(0.18), 'LA'(0.12), 'Chicago'(0.09), ...]
Histogram:      |---|----|--|-----|---|----|---|--|-----|---|
Value space β†’   A                B          C           D
(equi-height buckets β‰ˆ same row count per bucket)

Why you care:

Updating Statistics Example

In PostgreSQL:

ANALYZE customers;              -- quick, safe, runs online
-- or see what it's doing:
ANALYZE VERBOSE customers;

You can raise detail for skewed columns:

ALTER TABLE customers ALTER COLUMN city SET STATISTICS 1000;
ANALYZE customers (city);

(Per-column settings override default_statistics_target.)

Auto-analyze will kick in after changes: roughly autovacuum_analyze_threshold + autovacuum_analyze_scale_factor * reltuples. If tables churn a lot, consider lowering those for just the hot tables.

Verifying Updated Statistics

SELECT attname, n_distinct, most_common_vals, most_common_freqs
FROM pg_stats
WHERE schemaname = 'public'
  AND tablename  = 'customers';

Handy ops views:

SELECT relname, last_analyze, n_live_tup
FROM pg_stat_all_tables
WHERE schemaname = 'public' AND relname IN ('customers','orders');

If correlation is very high on an important index key, you can sometimes boost locality with:

-- Cautious: exclusive lock on table during operation; schedule off-peak
CLUSTER customers USING idx_customers_city;  -- physically order table by index

Practical Examples

Let’s bring it together.

Optimizing a Slow Query

SELECT o.*
FROM orders o
JOIN customers c
  ON o.customer_id = c.customer_id
WHERE c.city = 'New York';

Initial Execution Plan

EXPLAIN
SELECT o.*
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.city = 'New York';

Example output:

Nested Loop  (cost=0.00..5000.00 rows=100 width=...)
  -> Seq Scan on customers c  (cost=0.00..1000.00 rows=50 width=...)
        Filter: (city = 'New York')
  -> Seq Scan on orders o     (cost=0.00..80.00 rows=1 width=...)
        Filter: (o.customer_id = c.customer_id)

Step-by-step Fix

I. Add/confirm the right indexes

Think of this as laying down fast lanes. We add one index to jump straight to β€œcustomers in this city” and grab their IDs, and another to find all their orders without wandering the whole table. Do it so the database stops scanning everything; expect quick, targeted lookups instead of sloggy full scans.

-- Filter & join key on customers
CREATE INDEX IF NOT EXISTS idx_customers_city_id
  ON customers(city, customer_id);

-- Join key on orders
CREATE INDEX IF NOT EXISTS idx_orders_customer_id
  ON orders(customer_id);

II. Update stats (and make them richer on skewed columns)

This is giving the planner fresh glasses. We bump the stats target for city (because some cities are way more common) and re-run ANALYZE. Do it so the optimizer guesses row counts realistically; expect better join choices and fewer β€œwhy did it do that?” plans.

ALTER TABLE customers ALTER COLUMN city SET STATISTICS 1000;
ANALYZE customers;        -- refresh customers stats
ANALYZE orders;           -- refresh orders stats too

III. (Optional) Extended statistics for multi-column estimates (PG 10+)

Now we clue the planner in on how columns relate. By telling it about distinctness across (city, customer_id), it stops making naive assumptions. Do it if your data has relationships across columns; expect smarter estimates and, in turn, smarter plans.

-- Helps the planner understand distinctness across columns
CREATE STATISTICS stat_c_city_id (ndistinct) ON city, customer_id FROM customers;
ANALYZE customers;

IV. Shape the query so the plan can win

Same result, clearer intent. A plain join is fine, but EXISTS often nudges the planner into a lean β€œjust check if there’s a matching customer” mindset. Do it to avoid generating extra rows or over-complicating the join; expect simpler, index-friendly execution.

-- Equivalent, often better estimates:
SELECT o.*
FROM orders o
WHERE EXISTS (
  SELECT 1
  FROM customers c
  WHERE c.customer_id = o.customer_id
    AND c.city = 'New York'
);

V. Re-check the plan

Quick sanity check. We run EXPLAIN to make sure our indexes and stats actually changed the strategy. Do it to confirm we’re not guessing; expect to see index scans and either a tidy nested loop (when NYC is small) or a hash/bitmap plan (when NYC is big).

EXPLAIN
SELECT o.*
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.city = 'New York';

Typical improved plan A (seek + nested loop):

Nested Loop
  -> Index Only Scan using idx_customers_city_id on customers c
        Index Cond: (city = 'New York')             -- fast, few rows
  -> Index Scan using idx_orders_customer_id on orders o
        Index Cond: (o.customer_id = c.customer_id) -- fast per customer

Typical improved plan B (bitmap/hash, if NYC is β€œbig”):

Hash Join
  Hash Cond: (o.customer_id = c.customer_id)
  -> Index Scan using idx_orders_customer_id on orders o
  -> Bitmap Heap Scan on customers c
       Recheck Cond: (city = 'New York')
       -> Bitmap Index Scan on idx_customers_city_id

Table of Contents

    Query Optimization Techniques
    1. Indexing
      1. How Indexes Improve Query Performance
      2. Creating an Index Example
      3. Example
    2. Query Rewriting
      1. Simplifying Complex Queries
      2. Rewriting Example
    3. Join Optimization
      1. Choosing the Right Join Type
      2. Example of Join Order Impact
    4. Using EXPLAIN to Analyze Queries
    5. Partitioning
      1. Partitioning Example
    6. Materialized Views
      1. Creating a Materialized View Example
      2. Refreshing the Materialized View
    7. Caching
      1. Application-Level Caching Example
    8. Statistics and Histograms
      1. Updating Statistics Example
      2. Verifying Updated Statistics
    9. Practical Examples
    10. Step-by-step Fix