Optimizing PostgreSQL Query Performance: From 30 Seconds to 50 Milliseconds
Optimizing PostgreSQL Query Performance: From 30 Seconds to 50 Milliseconds
Our analytics dashboard was basically broken. Users would click to load their engagement metrics and then... wait. And wait some more. Thirty seconds later, they'd get their data. Some gave up before then. I watched one product manager close the tab, mutter something unprintable, and go back to manually exporting CSVs. Not exactly the "data-driven culture" we were aiming for.
The database server wasn't having a great time either—pinned at 90% CPU, connections piling up, and our on-call rotation dreading the inevitable "dashboard is down again" alerts at 2am.
Two weeks of digging through query plans and fighting with indexes later, that same 30-second query runs in 50 milliseconds. The dashboard feels instant. The database server is practically idle. Here's how we got there, including the dead ends and false starts.
What We Were Up Against
The main query powering our dashboard seemed reasonable at first glance. Pull user info, count their events and sessions, calculate some averages. Simple stuff:
SELECT u.id, u.email, u.created_at, COUNT(DISTINCT e.id) as event_count, COUNT(DISTINCT s.id) as session_count, MAX(e.created_at) as last_activity, AVG(s.duration) as avg_session_duration FROM users u LEFT JOIN events e ON e.user_id = u.id LEFT JOIN sessions s ON s.user_id = u.id WHERE u.created_at >= '2023-01-01' AND u.status = 'active' GROUP BY u.id, u.email, u.created_at ORDER BY event_count DESC LIMIT 100;
Turns out "simple" doesn't mean "fast" when you're joining 500,000 users against 10 million events and 2 million sessions. Execution time clocked in at 32,456 milliseconds. More than half a minute to show 100 rows.
Actually Understanding the Problem
My first instinct was to just throw some indexes at it and call it done. That's what you do with slow queries, right? Add indexes, problem solved?
Yeah, that didn't work. I created indexes on every foreign key, reran the query, and it was still taking 30+ seconds. Time to actually figure out what was happening.
EXPLAIN ANALYZE became my new best friend. If you're not familiar with it, EXPLAIN ANALYZE runs your query and shows you exactly what PostgreSQL is doing under the hood—which indexes it's using (or not using), how many rows it's scanning, how long each step takes. It's like getting X-ray vision into your database.
EXPLAIN (ANALYZE, BUFFERS, VERBOSE) SELECT ...
The output was... not encouraging:
Hash Left Join (cost=458392.23..892341.67 rows=100 width=88) (actual time=28453.234..31234.892 rows=100 loops=1)
Hash Cond: (u.id = e.user_id)
Buffers: shared hit=12543 read=234567
-> Seq Scan on users u (cost=0.00..12453.22 rows=156789 width=44)
Filter: ((created_at >= '2023-01-01'::date) AND (status = 'active'::text))
Rows Removed by Filter: 300000
-> Hash (cost=234567.45..234567.45 rows=9876543 width=16)
Buckets: 131072 Batches: 128 Memory Usage: 4567kB
-> Seq Scan on events e
See all those "Seq Scan" entries? That's PostgreSQL-speak for "reading the entire table." We were scanning the full users table, the full events table, and the full sessions table just to return 100 rows. The hash join had 128 batches, meaning it was spilling to disk because it couldn't fit everything in memory. Over 230,000 blocks read from disk for a single query.
No wonder it was slow.
Getting Indexes Right
So my initial assumption about indexes wasn't wrong—I just did it badly. Creating random indexes and hoping one sticks isn't a strategy. You need to think about what the query planner needs to make good decisions.
The users table was getting filtered by both status and created_at. Instead of separate indexes on each column, we needed them together:
CREATE INDEX idx_users_status_created ON users (status, created_at) WHERE status = 'active';
That WHERE status = 'active' clause is a partial index—it only indexes the active users, which was about 60% smaller than indexing everyone. Smaller index means more of it fits in memory, which means faster lookups. The sequential scan on users dropped from 2,345ms to 12ms. Not bad.
But here's where it gets interesting—those foreign keys I indexed earlier? They weren't actually helping. I'd created them, sure, but PostgreSQL wasn't using them. Turns out the query planner wasn't convinced they'd be helpful given the query structure. We needed more specific indexes:
CREATE INDEX idx_events_user_id ON events (user_id); CREATE INDEX idx_sessions_user_id ON sessions (user_id); CREATE INDEX idx_events_user_created ON events (user_id, created_at);
That last one is what PostgreSQL calls a "covering index"—it includes both columns we need from the events table, so the database can answer our query entirely from the index without touching the actual table data. Index-only scans are stupidly fast.
After adding these indexes, we were down to about 4 seconds. Much better than 32, but still not great. The problem was the query itself.
Rewriting the Query
Here's something I learned the hard way: even with perfect indexes, a poorly structured query is still slow. Our original query was doing COUNT(DISTINCT) on millions of rows, which is expensive. Multiple aggregations from the same large tables, all in one big messy join.
We rewrote it using CTEs (Common Table Expressions—basically subqueries with names) to break the work into manageable chunks:
WITH user_metrics AS ( SELECT user_id, COUNT(*) as event_count, MAX(created_at) as last_activity FROM events WHERE user_id IN ( SELECT id FROM users WHERE status = 'active' AND created_at >= '2023-01-01' ) GROUP BY user_id ), session_metrics AS ( SELECT user_id, COUNT(*) as session_count, AVG(duration) as avg_session_duration FROM sessions WHERE user_id IN ( SELECT id FROM users WHERE status = 'active' AND created_at >= '2023-01-01' ) GROUP BY user_id ) SELECT u.id, u.email, u.created_at, COALESCE(em.event_count, 0) as event_count, COALESCE(sm.session_count, 0) as session_count, em.last_activity, sm.avg_session_duration FROM users u LEFT JOIN user_metrics em ON em.user_id = u.id LEFT JOIN session_metrics sm ON sm.user_id = u.id WHERE u.status = 'active' AND u.created_at >= '2023-01-01' ORDER BY event_count DESC LIMIT 100;
Now we're filtering first (using those nice indexes), then aggregating on smaller datasets, then joining the results. PostgreSQL can also parallelize these CTEs, so multiple CPU cores can work on different parts simultaneously.
Execution time: 892ms. Getting closer.
The Materialized View Trick
At this point, I had to ask a question: does this dashboard really need real-time data? Are our metrics changing so fast that a 5-minute delay would matter?
The answer was no. Nobody was using the dashboard to make split-second decisions. They were checking trends, looking at aggregate patterns. Data from 5 minutes ago was just as valuable as data from right now.
So we built a materialized view—essentially a cached version of our query results that refreshes periodically:
CREATE MATERIALIZED VIEW user_engagement_summary AS WITH user_metrics AS ( SELECT user_id, COUNT(*) as event_count, MAX(created_at) as last_activity FROM events GROUP BY user_id ), session_metrics AS ( SELECT user_id, COUNT(*) as session_count, AVG(duration) as avg_session_duration FROM sessions GROUP BY user_id ) SELECT u.id, u.email, u.created_at, u.status, COALESCE(em.event_count, 0) as event_count, COALESCE(sm.session_count, 0) as session_count, em.last_activity, sm.avg_session_duration FROM users u LEFT JOIN user_metrics em ON em.user_id = u.id LEFT JOIN session_metrics sm ON sm.user_id = u.id; CREATE INDEX idx_engagement_status_events ON user_engagement_summary (status, event_count DESC) WHERE status = 'active';
We set up a cron job to refresh it every 5 minutes. Now the dashboard query is trivial:
SELECT * FROM user_engagement_summary WHERE status = 'active' AND created_at >= '2023-01-01' ORDER BY event_count DESC LIMIT 100;
Eighteen milliseconds. We'd basically won at this point, but there was one more problem to fix.
Connection Chaos
While the query itself was fast, users were still seeing delays. Turns out we had a different problem: connection overhead. Every time someone loaded the dashboard, the app would open a new database connection, run the query, then close it. Opening a connection to PostgreSQL takes 50-200ms of handshaking and authentication. With hundreds of concurrent users, we were hitting connection limits and the database was spending more time managing connections than running queries.
Enter PgBouncer, a lightweight connection pooler. It sits between your application and PostgreSQL, maintaining a pool of open connections and routing queries through them. Instead of each request opening its own connection, they all share from a pool of maybe 25-30 connections.
[databases] mydb = host=localhost port=5432 dbname=mydb [pgbouncer] pool_mode = transaction max_client_conn = 10000 default_pool_size = 25
Connection time dropped from 150ms to 2ms. The database went from handling 1,000+ simultaneous connections to about 30. CPU usage plummeted.
We also tuned PostgreSQL's own configuration. The defaults are conservative because they need to work on any hardware, from a Raspberry Pi to a datacenter server. But we knew our hardware—32GB of RAM, SSDs, 8 cores—so we adjusted accordingly:
shared_buffers = 8GB effective_cache_size = 24GB work_mem = 50MB maintenance_work_mem = 2GB max_parallel_workers_per_gather = 4 random_page_cost = 1.1
That random_page_cost setting is interesting—it tells PostgreSQL how expensive random disk access is. The default assumes spinning hard drives (where random access is slow), but we're on SSDs where random and sequential access are nearly equivalent. Lowering this value made the query planner more willing to use indexes.
Final total time, including all network overhead: 50 milliseconds.
Keeping It Fast
Database performance isn't a "set it and forget it" thing. It degrades over time as data grows and access patterns change. We set up monitoring to catch problems early.
PostgreSQL has a built-in extension called pg_stat_statements that tracks every query execution:
CREATE EXTENSION IF NOT EXISTS pg_stat_statements; SELECT query, calls, total_exec_time / 1000 as total_time_seconds, mean_exec_time / 1000 as mean_time_seconds FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 20;
This shows us which queries are slow over time, not just right now. Sometimes a query is fine on Monday but terrible on Friday after the weekly data import. This catches those patterns.
We also check for unused indexes every month:
SELECT schemaname, tablename, indexname, idx_scan, pg_size_pretty(pg_relation_size(indexrelid)) as size FROM pg_stat_user_indexes WHERE idx_scan = 0 ORDER BY pg_relation_size(indexrelid) DESC;
Indexes aren't free—they take up disk space and slow down writes. If an index isn't being used, it's just dead weight. We've dropped several over time.
What Actually Mattered
Looking back, the biggest wins came from understanding the problem before trying to fix it. EXPLAIN ANALYZE showed us exactly what was slow. Most of the improvement (maybe 60%) came from proper indexing. Another 25% from restructuring the query. The materialized view gave us another 10%, and connection pooling plus configuration tuning accounted for the last 5%.
But here's the thing—if we'd guessed at solutions without measuring first, we'd probably still be stuck with a slow dashboard. I spent the first day trying "obvious" fixes that did nothing because I was solving imaginary problems instead of real ones.
The partial index trick was something I'd never used before this. I always thought of indexes as "index the whole column or don't," but partial indexes can be massively more efficient when you're usually filtering on the same values. That WHERE status = 'active' clause on the index saved gigabytes of space and made queries faster.
Materialized views feel almost like cheating. You're trading freshness for speed, but for a lot of use cases, that's exactly the right tradeoff. Real-time data is overrated if nobody's actually making real-time decisions with it.
The Numbers
Before: 32,456ms query time, 90% CPU usage, 1,200+ connections, database server constantly on fire.
After: 50ms query time, 15% CPU usage, 30 connections, database server basically idle.
That's 649 times faster. The dashboard loads in under a second now instead of 35 seconds. Users are actually using it. The product manager who was manually exporting CSVs? She's happy. The on-call rotation? Much quieter.
Tools That Helped
A few things made this possible: EXPLAIN ANALYZE for understanding query plans, pg_stat_statements for long-term monitoring, PgBouncer for connection pooling, and pgBadger for analyzing log files. All free, all essential.
If you're dealing with slow PostgreSQL queries, start with EXPLAIN ANALYZE. Don't guess. Look at what's actually happening, find the bottleneck, fix that specific thing, then measure again. Add indexes strategically, not randomly. Consider materialized views for read-heavy workloads. Use connection pooling. Tune your configuration for your actual hardware.
And maybe most importantly: question your assumptions. We built this dashboard assuming everything needed to be real-time, but that assumption was costing us 30 seconds per page load. Sometimes the best optimization is realizing you don't need to optimize everything.