Database Indexing Strategies: A Comprehensive Guide to PostgreSQL Performance

An index is not a silver bullet. I've seen engineers slap indexes on every column and wonder why their database is slower than before. I've also seen production systems brought to their knees by missing indexes on tables with millions of rows. The difference between these outcomes is understanding.

This guide covers the indexing strategies that matter for PostgreSQL. We'll examine the data structures, analyze execution plans, and establish clear criteria for when to index—and when not to.

Understanding Index Fundamentals

An index is a separate data structure that maintains pointers to rows in your table, organized for fast lookup. Without an index, the database must perform a sequential scan—reading every row to find matches. With the right index, it can jump directly to relevant rows.

Consider this table:

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    order_date TIMESTAMP NOT NULL,
    total_amount DECIMAL(10,2) NOT NULL,
    status VARCHAR(20) NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

A query filtering by customer_id on a table with 10 million rows will scan all 10 million rows without an index. With an index, it might read only a few dozen pages.

B-Tree Indexes: The Default Workhorse

B-tree (balanced tree) indexes are PostgreSQL's default and most versatile index type. They maintain data in sorted order, enabling efficient lookups for equality and range queries.

Structure and Performance Characteristics

A B-tree index organizes data in a balanced tree structure with a configurable branching factor. For a typical page size, each internal node can reference hundreds of child nodes, meaning a B-tree with millions of entries rarely exceeds 3-4 levels deep.

Lookup complexity: O(log n) Insert complexity: O(log n) Space overhead: Typically 2-3x the size of the indexed column data

When to Use B-Tree Indexes

B-tree indexes excel at:

Equality comparisons: WHERE customer_id = 123
Range queries: WHERE order_date > '2026-01-01'
Sorting: ORDER BY created_at DESC
Pattern matching with left-anchored patterns: WHERE name LIKE 'John%'

Create a B-tree index:

CREATE INDEX idx_orders_customer_id ON orders(customer_id);

Let's verify it's being used:

EXPLAIN ANALYZE
SELECT * FROM orders WHERE customer_id = 12345;

Output (on a table with 5 million rows):

Index Scan using idx_orders_customer_id on orders  (cost=0.43..8.45 rows=1 width=52) (actual time=0.028..0.031 rows=3 loops=1)
  Index Cond: (customer_id = 12345)
Planning Time: 0.152 ms
Execution Time: 0.058 ms

Compare to without the index:

Seq Scan on orders  (cost=0.00..142857.00 rows=1 width=52) (actual time=847.231..1523.456 rows=3 loops=1)
  Filter: (customer_id = 12345)
  Rows Removed by Filter: 4999997
Planning Time: 0.089 ms
Execution Time: 1523.512 ms

The difference: 0.058ms versus 1523ms. That's a 26,000x improvement.

Hash Indexes

Hash indexes use a hash function to map values to buckets. They're optimized for equality comparisons only.

When to Use Hash Indexes

Hash indexes are appropriate when:

You only need equality lookups (no ranges, no sorting)
The column has high cardinality (many distinct values)
Memory is not a constraint (hash indexes can be larger than B-tree)

CREATE INDEX idx_orders_status_hash ON orders USING HASH (status);

In practice, I rarely use hash indexes. B-tree indexes handle equality comparisons nearly as fast while also supporting range queries. The marginal performance benefit of hash indexes seldom justifies losing that flexibility.

However, for columns where you will never need range queries—like UUIDs or API keys—hash indexes can provide a small performance advantage.

Composite Indexes: Multiple Columns

A composite index covers multiple columns. Column order matters significantly.

CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);

This index efficiently supports:

-- Uses the full index
SELECT * FROM orders
WHERE customer_id = 123 AND order_date > '2026-01-01';

-- Uses only the first column of the index
SELECT * FROM orders WHERE customer_id = 123;

-- Cannot use this index efficiently
SELECT * FROM orders WHERE order_date > '2026-01-01';

The Leftmost Prefix Rule

A composite index can be used for queries that filter on a leftmost prefix of the indexed columns. An index on (a, b, c) can be used for queries filtering on:

a
a, b
a, b, c

It cannot be used efficiently for queries filtering only on b or c.

Column Order Guidelines

Place columns used in equality conditions first
Place columns used in range conditions last
Consider the selectivity of each column

Example: For a query WHERE status = 'pending' AND order_date > '2026-01-01':

-- Better: equality column first
CREATE INDEX idx_status_date ON orders(status, order_date);

-- Worse: range column first
CREATE INDEX idx_date_status ON orders(order_date, status);

With status first, the index narrows to all 'pending' orders, then scans the date range within that subset. With order_date first, it scans all orders in the date range, then filters by status.

Partial Indexes: Index Only What Matters

A partial index includes only rows matching a specified condition. This reduces index size and maintenance overhead.

CREATE INDEX idx_pending_orders ON orders(customer_id, order_date)
WHERE status = 'pending';

This index is smaller than a full index and is used only for queries that include WHERE status = 'pending':

-- Uses the partial index
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE status = 'pending' AND customer_id = 123;

Index Scan using idx_pending_orders on orders  (cost=0.29..8.30 rows=1 width=52) (actual time=0.019..0.021 rows=2 loops=1)
  Index Cond: (customer_id = 123)
Planning Time: 0.183 ms
Execution Time: 0.042 ms

Effective Partial Index Use Cases

Status columns where you query specific values: WHERE is_active = true
Soft deletes: WHERE deleted_at IS NULL
Recent data: WHERE created_at > '2025-01-01'
Specific partitions of data: WHERE tenant_id = 1

Partial indexes are underutilized. If 90% of your queries filter by status = 'active', a partial index on active rows will be smaller and faster than a full index.

Covering Indexes: Avoiding Table Lookups

When an index contains all columns needed by a query, PostgreSQL can satisfy the query using only the index. This is called an index-only scan.

CREATE INDEX idx_orders_covering ON orders(customer_id, order_date, total_amount);

For this query:

SELECT order_date, total_amount FROM orders WHERE customer_id = 123;

The execution plan:

Index Only Scan using idx_orders_covering on orders  (cost=0.43..4.45 rows=1 width=16) (actual time=0.025..0.027 rows=3 loops=1)
  Index Cond: (customer_id = 123)
  Heap Fetches: 0
Planning Time: 0.121 ms
Execution Time: 0.048 ms

Note Heap Fetches: 0. The query never touched the table—all data came from the index.

The INCLUDE Clause (PostgreSQL 11+)

PostgreSQL 11 introduced the INCLUDE clause for covering indexes without affecting the index's sort order:

CREATE INDEX idx_orders_customer_include ON orders(customer_id)
INCLUDE (order_date, total_amount, status);

The included columns are stored in the index leaf pages but aren't part of the B-tree structure. This provides covering index benefits without the overhead of maintaining sort order for those columns.

Expression Indexes

You can index the result of an expression or function:

CREATE INDEX idx_orders_year ON orders(EXTRACT(YEAR FROM order_date));

CREATE INDEX idx_customers_email_lower ON customers(LOWER(email));

These indexes are used when queries match the exact expression:

-- Uses idx_orders_year
SELECT * FROM orders WHERE EXTRACT(YEAR FROM order_date) = 2026;

-- Uses idx_customers_email_lower
SELECT * FROM customers WHERE LOWER(email) = '[email protected]';

Expression indexes are essential for case-insensitive searches and queries on computed values.

Reading EXPLAIN ANALYZE Output

The ability to read execution plans separates competent database work from guesswork. Let's analyze a complex example:

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.*, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'pending'
  AND o.order_date > '2026-01-01'
ORDER BY o.order_date DESC
LIMIT 100;

Output:

Limit  (cost=1234.56..1234.81 rows=100 width=128) (actual time=2.341..2.367 rows=100 loops=1)
  Buffers: shared hit=892
  ->  Sort  (cost=1234.56..1247.89 rows=5332 width=128) (actual time=2.339..2.351 rows=100 loops=1)
        Sort Key: o.order_date DESC
        Sort Method: top-N heapsort  Memory: 47kB
        Buffers: shared hit=892
        ->  Nested Loop  (cost=0.87..1089.23 rows=5332 width=128) (actual time=0.045..1.892 rows=2847 loops=1)
              Buffers: shared hit=892
              ->  Index Scan using idx_pending_orders on orders o  (cost=0.43..456.78 rows=5332 width=52) (actual time=0.023..0.634 rows=2847 loops=1)
                    Index Cond: (order_date > '2026-01-01'::timestamp)
                    Buffers: shared hit=234
              ->  Index Scan using customers_pkey on customers c  (cost=0.43..0.12 rows=1 width=76) (actual time=0.003..0.003 rows=1 loops=2847)
                    Index Cond: (id = o.customer_id)
                    Buffers: shared hit=658
Planning Time: 0.412 ms
Execution Time: 2.421 ms

Key observations:

Index Scan on idx_pending_orders: The partial index is being used effectively
Nested Loop join: For each order, it looks up the customer—appropriate for this cardinality
Buffers: shared hit=892: All data came from cache; no disk reads
top-N heapsort: Smart optimization for LIMIT queries—it doesn't sort everything

Warning Signs in Execution Plans

Watch for these indicators of performance problems:

Seq Scan on large tables: Missing index or outdated statistics
Sort Method: external merge Disk: Sorting exceeded work_mem
Nested Loop with high loops count: Consider a hash or merge join
Rows estimates far from actual: Run ANALYZE to update statistics
Buffers: shared read: Data not cached, expect slower queries

When NOT to Index

More indexes are not always better. Every index has costs:

Storage: Indexes consume disk space
Write overhead: Every INSERT, UPDATE, DELETE must update all relevant indexes
Maintenance: Indexes require VACUUM and can become bloated
Planning overhead: More indexes means more options for the planner to evaluate

Don't Index When:

Low selectivity columns: A boolean column with 50/50 distribution rarely benefits from an index. The database will likely choose a sequential scan anyway.

-- Probably wasteful
CREATE INDEX idx_orders_is_active ON orders(is_active);

Small tables: Tables under ~10,000 rows often scan faster than index lookup due to overhead. Sequential scan of 1,000 rows is negligible.

Write-heavy tables with read-light patterns: If a table receives 1,000 writes per second but only 10 reads, indexes hurt more than they help.

Columns rarely used in WHERE, JOIN, or ORDER BY: Only index columns you actually filter or sort by.

Identifying Unused Indexes

PostgreSQL tracks index usage:

SELECT
    schemaname,
    tablename,
    indexname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;

This query reveals indexes that have never been used since the last statistics reset. Before dropping them, verify over a sufficient time period that covers all query patterns.

Index Maintenance

Indexes degrade over time. Dead tuples from updates and deletes cause bloat.

Monitor Index Bloat

SELECT
    schemaname,
    tablename,
    indexname,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size,
    idx_scan,
    idx_tup_read
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY pg_relation_size(indexrelid) DESC;

Rebuild Bloated Indexes

-- Blocking rebuild
REINDEX INDEX idx_orders_customer_id;

-- Non-blocking rebuild (PostgreSQL 12+)
REINDEX INDEX CONCURRENTLY idx_orders_customer_id;

REINDEX CONCURRENTLY is essential for production systems—it rebuilds the index without blocking writes.

Practical Indexing Checklist

Before creating an index, answer these questions:

What queries will use this index? Verify with EXPLAIN.
What's the selectivity? High-cardinality columns benefit more.
Is there an existing index that could be extended instead?
What's the write volume on this table?
Can a partial index suffice?
Would a covering index eliminate heap fetches?

After creating an index:

Run ANALYZE on the table
Verify usage with EXPLAIN ANALYZE
Monitor pg_stat_user_indexes for actual usage
Check for increased write latency

Conclusion

Effective indexing is empirical work. You form a hypothesis about which index will help, create it, and measure the result. EXPLAIN ANALYZE is your primary tool. Statistics are your foundation.

Start with the queries that matter most—the slow ones, the frequent ones. Add indexes incrementally. Monitor their usage. Remove the ones that don't pull their weight.

The goal is not maximum indexes. The goal is optimal performance with minimal overhead. Every index should justify its existence with measurable query improvement.

There's no substitute for understanding your data and your access patterns. The strategies in this guide will help, but ultimately, your execution plans tell the truth.