For PGSQL Phriday #016, Ryan’s asked us to “…discuss your process for tuning difficult queries. Specifically, try to focus on that one problematic query that really challenged you and you always use it as an example when helping or teaching others your methods.”
(Yes, I am two months late publishing this.)
Here are the generic steps I take, mostly in order.
- Pause and reflect on the holy trinity of troubleshooting questions: what did you expect to happen, what actually happened, and how long has it been acting this way?
- Just like working on any other codebase, check the query into source control so you can track your changes.
- Run it through
pg_format
(https://github.com/darold/pgFormatter) for readability. This is especially useful if multiple people have worked on the query and brought their own ideas about SQL style. - Verify that the tables used in the query have been ANALYZEd recently.
- Create a few passing and failing tests. This will help verify that you are still getting the results you expect. I use PgTAP (https://pgtap.org/).
- Give the query a quick scan for obvious things like LIMIT without ORDER BY, DISTINCT on columns that already have unique constraints, and UNION where a UNION ALL would work.
- If I can get the output of
EXPLAIN (ANALYZE, BUFFERS, TIMING)
from a production or production-like system, great. (Don’t forget to run it in a transaction if it will alter data.) Plug it into https://explain.depesz.com, and look for wildly differing estimated vs actual row counts, loops > 1 on large datasets, and any red areas. All of those are good starting points.* - Unless it’s a fairly small query to start with, break it into its component pieces. CTEs, subselects, UNIONs are all good breakpoints. This is a critical step to finding slow points if I don’t have access to EXPLAIN output.
- For each piece, review the FROM clause first, since it’s first in SQL order of operations. Check for tables which have no columns in the SELECT statement. If it’s not needed as a join table, try removing it. (This is where it’s crucial to have reliable tests.)
After that I just make it up as I go.
As a specific example, there’s a query that certainly wasn’t the worst I’ve ever worked on, but it was the most memorable, because I learned about join_collapse_limit
.
It was part of a migration from another RDBMS, and took 20 minutes to return a hundred or so rows on Postgres. It featured multiple UNIONs, and each query had INNER and OUTER JOINs on a dozen+ tables. Throw in very similarly-named tables (imagine “service_server_servers”, “service_servers_server”, “servers_service_servers” etc), aliases that bore no apparent relation to their table names, and multiple different SQL styles, and I knew immediately I would be entertained for days.
I had a couple of things going for me: 1) a fixed dataset (so I could be sure I wasn’t getting incorrect values back) and 2) the query was written by actual humans who were really invested in making it better.
I’d gone through my usual steps, I’d set up a whiteboard and used every color of dry erase marker I owned to diagram the JOINs, and had removed some unneeded JOINs and converted some subqueries to JOINs. This shaved off a few minutes, but didn’t provide the gains I was hoping for.
Looking at the EXPLAIN plan for one of the UNIONed queries, the working dataset from the first two joined tables was several hundred thousand records. I checked pg_stats
for the various tables and query parameters and thought that if I re-ordered the JOINs to select a smaller dataset first, that would limit the size of the subsequent datasets. But it didn’t have any reliably positive effect.
A coworker (thanks, Jeff!) tipped me off to join_collapse_limit
; if you’re JOINing more than join_collapse_limit
tables, the planner won’t try every single potential plan, because it’s not a good use of system resources. The default value is 8.
I set join_collapse_limit
to 1 in my session (don’t set this globally) to force Postgres to respect my written JOIN order. This made a significant improvement in speed. I’ve use this method successfuly a few times since then, but note that if the distribution of data changes significantly, you may need to re-order the joins.
* Henrietta Dombrovskaya’s book is a good reference to learn about different scan and join types; this is helpful for interpreting EXPLAIN plans beyond these three tips. https://www.goodreads.com/book/show/57285570-postgresql-query-optimization