Grouping pg_stat_statements, the right way
pg_stat_statements is one of the most used extensions in PostgreSQL, and for a good reason: it gives fast and very valuable insights into how the backend is spending its precious time.There is however at least one issue that has been nagging at me for a while, enough so that i recently set aside an evening to, once and for all i hope, solve it.
The problem.
First make sure that the pg_stat_statements extension is installed:SELECT * FROM pg_available_extensions WHERE name = 'pg_stat_statements' ;
-- if it is not installed, do
CREATE EXTENSION pg_stat_statements;
if that didn't work, google it. The problem is probably a missing postgresqlNN-contrib package install (NN is version number) or a missing
shared_preload_libraries = 'pg_stat_statements'
in the postgres.conf.Now that that is out of the way, lets set up the test.
We make a table test with some test data:
CREATE TABLE test (
id SERIAL PRIMARY KEY,
a INT,
b INT
);
WITH generated_data AS (select generate_series(0,99) AS i)
INSERT INTO test(a,b)
SELECT 1+FLOOR(generated_data.i/8), 1+FLOOR(generated_data.i%9) FROM generated_data;
SELECT pg_stat_statements_reset(); -- clean up the statistics to get less clutter later on. Not per se needed, but practical.
And on this test table we run some sample queries:
CREATE TABLE test (
SELECT count(id) FROM test WHERE a IN (1) AND b IN (1);
SELECT count(id) FROM test WHERE a IN (2) AND b IN (2);
SELECT count(id) FROM test WHERE a IN (3) AND b IN (3);
SELECT count(id) FROM test WHERE a IN (1,2) AND b IN (1);
SELECT count(id) FROM test WHERE a IN (2,3) AND b IN (2);
SELECT count(id) FROM test WHERE a IN (3,4) AND b IN (3);
SELECT count(id) FROM test WHERE a IN (1) AND b IN (1,2);
SELECT count(id) FROM test WHERE a IN (2) AND b IN (2,3);
SELECT count(id) FROM test WHERE a IN (3) AND b IN (3,4);
SELECT count(id) FROM test WHERE a IN (1, 2) AND b IN (1,2);
SELECT count(id) FROM test WHERE a IN (2,3) AND b IN (2, 3);
SELECT count(id) FROM test WHERE a IN (3 , 4 ) AND b IN (3 , 4 );
SELECT count(id) FROM test WHERE a IN (3,4,5, 6,7) AND b IN (3, 4,5,6,7);
Note the funky spaces in some places, that was to trip up bad regexes. Sloppy data in tests leads to better code.A typical scan of the current pg_stat_statements may look something like:
SELECT userid, dbid, toplevel, queryid, query, calls, total_exec_time, min_exec_time, max_exec_time, mean_exec_time, stddev_exec_time, rows
FROM pg_stat_statements WHERE query ILIKE 'SELECT count(id)%';
Where you see something like
+------+-----+--------+--------------------+-----------------------------------------------------------------------------------+-----+------------------+------------------+------------------+------------------+--------------------+----+
|userid|dbid |toplevel|queryid |query |calls|total_exec_time |min_exec_time |max_exec_time |mean_exec_time |stddev_exec_time |rows|
+------+-----+--------+--------------------+-----------------------------------------------------------------------------------+-----+------------------+------------------+------------------+------------------+--------------------+----+
|10 |16384|true |5932379513563548079 |SELECT count(id) FROM test WHERE a IN ($1) AND b IN ($2,$3) |3 |3.298 |0.5394 |1.6688 |1.0993333333333333|0.4611248951085692 |3 |
|10 |16384|true |-3349543303082417606|SELECT count(id) FROM test WHERE a IN ($1,$2,$3, $4,$5) AND b IN ($6, $7,$8,$9,$10)|1 |0.6315999999999999|0.6315999999999999|0.6315999999999999|0.6315999999999999|0 |1 |
|10 |16384|true |302831636317716902 |SELECT count(id) FROM test WHERE a IN ($1) AND b IN ($2) |3 |2.473 |0.6579 |1.0988 |0.8243333333333334|0.19553619159179256 |3 |
|10 |16384|true |5465811682035262379 |SELECT count(id) FROM test WHERE a IN ($1,$2) AND b IN ($3) |3 |1.7217000000000002|0.4392 |0.7081000000000001|0.5739000000000001|0.10977853463526774 |3 |
|10 |16384|true |4173960216733908951 |SELECT count(id) FROM test WHERE a IN ($1, $2) AND b IN ($3,$4) |3 |2.1121 |0.6792 |0.7444 |0.7040333333333333|0.028794945544120876|3 |
+------+-----+--------+--------------------+-----------------------------------------------------------------------------------+-----+------------------+------------------+------------------+------------------+--------------------+----+
Which is annoying, since it is basically 5 times the same query with a different number of parameters.In this trivial example it is quite obvious what is going on, but imagine running a top 100 on a loaded system, you can easily either miss a query that has many mutations like this, or be swamped by dozens of times the same query, neither of which is ideal.
Solution
To resolve all this in one swoop I created the view below. I admit it is a bit convoluted.It is leaning on 3 principles:
- A regex to hunt down parameter-lists,
- The aggregation of statistics (explained in two previous blog-posts part one and part two) which shows by the formulas that are used to combine the stddev_plan_time and the stddev_exec_time values
- And the use of CTEs.
CREATE OR REPLACE VIEW pg_stat_statements_grouped AS
WITH pgsss AS (
WITH pgss AS (
SELECT queryid,
length(regexp_replace(query, '[^\$]', '', 'g'))::int2 AS nr_of_params,
REGEXP_REPLACE(query, '\((\$[0-9]+)(\ *,\ *\$[0-9]+)*\)', '(###)', 'g') AS abstract_query,
plans, total_plan_time, min_plan_time, max_plan_time, mean_plan_time, stddev_plan_time,
calls, total_exec_time, min_exec_time, max_exec_time, mean_exec_time, stddev_exec_time,
rows, shared_blks_hit, shared_blks_read, shared_blks_dirtied, shared_blks_written,
local_blks_hit, local_blks_read, local_blks_dirtied, local_blks_written,
temp_blks_read, temp_blks_written, blk_read_time, blk_write_time, temp_blk_read_time, temp_blk_write_time, wal_records, wal_fpi, wal_bytes,
jit_functions, jit_generation_time, jit_inlining_count, jit_inlining_time, jit_optimization_count, jit_optimization_time, jit_emission_count, jit_emission_time
FROM pg_stat_statements
ORDER BY abstract_query ASC, nr_of_params ASC)
SELECT array_agg(queryid) AS queryid,
count(DISTINCT pgss.queryid) AS nr_queryids,
MIN(nr_of_params) AS min_nr_of_params,
MAX(nr_of_params) AS max_nr_of_params,
SUM(plans) AS plans,
SUM(total_plan_time) AS total_plan_time,
MIN(min_plan_time) AS min_plan_time,
MAX(max_plan_time) AS max_plan_time,
SUM(total_plan_time)/SUM(calls) AS mean_plan_time,
CASE WHEN SUM(plans)=0 THEN 0 ELSE
SQRT (
( SUM((POWER(stddev_plan_time,2) + POWER(mean_plan_time,2)) * plans) / SUM( plans ) )
- POWER(( SUM(total_plan_time) / SUM(plans) ), 2)
) END AS stddev_plan_time,
SUM(calls) AS calls,
SUM(total_exec_time) AS total_exec_time,
MIN(min_exec_time) AS min_exec_time,
MAX(max_exec_time) AS max_exec_time,
SUM(total_exec_time)/SUM(calls) AS mean_exec_time,
CASE WHEN SUM(calls)=0 THEN 0 ELSE
SQRT (
( SUM((POWER(stddev_exec_time,2) + POWER(mean_exec_time,2)) * calls) / SUM( calls ) )
- POWER(( SUM(total_exec_time) / SUM(calls) ), 2)
) END AS stddev_exec_time,
SUM(rows) AS rows,
SUM(shared_blks_hit) AS shared_blks_hit,
SUM(shared_blks_read) AS shared_blks_read,
SUM(shared_blks_dirtied) AS shared_blks_dirtied,
SUM(shared_blks_written) AS shared_blks_written,
SUM(local_blks_hit) AS local_blks_hit,
SUM(local_blks_read) AS local_blks_read,
SUM(local_blks_dirtied) AS local_blks_dirtied,
SUM(local_blks_written) AS local_blks_written,
SUM(temp_blks_read) AS temp_blks_read,
SUM(temp_blks_written) AS temp_blks_written,
SUM(blk_read_time) AS blk_read_time,
SUM(blk_write_time) AS blk_write_time,
SUM(wal_records) AS wal_records,
SUM(wal_fpi) AS wal_fpi,
SUM(wal_bytes) AS wal_bytes_m,
SUM(jit_functions) AS jit_functions,
SUM(jit_generation_time) AS jit_generation_time,
SUM(jit_inlining_count) AS jit_inlining_count,
SUM(jit_inlining_time) AS jit_inlining_time,
SUM(jit_optimization_count) AS jit_optimization_count,
SUM(jit_optimization_time) AS jit_optimization_time,
SUM(jit_emission_count) AS jit_emission_count,
SUM(jit_emission_time) AS jit_emission_time
FROM pgss
GROUP BY pgss.abstract_query)
SELECT pss.userid AS rep_userid,
pss.dbid AS rep_dbid,
pss.toplevel AS rep_toplevel,
pgsss.queryid[1] AS rep_queryid,
pss.query AS rep_query,
array_to_string(pgsss.queryid, ',') queryids,
pgsss.nr_queryids,
pgsss.min_nr_of_params,
pgsss.max_nr_of_params,
pgsss.plans, pgsss.total_plan_time, pgsss.min_plan_time, pgsss.max_plan_time, pgsss.mean_plan_time, pgsss.stddev_plan_time,
pgsss.calls, pgsss.total_exec_time, pgsss.min_exec_time, pgsss.max_exec_time, pgsss.mean_exec_time, pgsss.stddev_exec_time,
pgsss.rows, pgsss.shared_blks_hit, pgsss.shared_blks_read, pgsss.shared_blks_dirtied, pgsss.shared_blks_written,
pgsss.local_blks_hit, pgsss.local_blks_read, pgsss.local_blks_dirtied, pgsss.local_blks_written,
pgsss.temp_blks_read, pgsss.temp_blks_written, pgsss.blk_read_time, pgsss.blk_write_time,
pgsss.wal_records, pgsss.wal_fpi, pgsss.wal_bytes_m,
pgsss.jit_functions, pgsss.jit_generation_time, pgsss.jit_inlining_count, pgsss.jit_inlining_time,
pgsss.jit_optimization_count, pgsss.jit_optimization_time, pgsss.jit_emission_count, pgsss.jit_emission_time
FROM pgsss INNER JOIN pg_stat_statements pss ON pss.queryid = pgsss.queryid[1];
GRANT SELECT ON pg_stat_statements_grouped TO PUBLIC;
It is perhaps not the most elegant query I have made of late (well its a pig okay), but it seems to work well.SELECT * FROM pg_stat_statements_grouped WHERE rep_query ILIKE 'SELECT count(id)%';
Gives a result like this: (everything about planning and also everything after 'rows' omitted for clarity)
+----------+--------+------------+------------------+--------------------------------------------------------+---------------------------------------------------------------------------------------------------+-----------+----------------+----------------+-----+---------------+-------------+-------------+--------------+----------------+-----+------------------+-------------+-------------+------------------+-------------------+----+
|rep_userid|rep_dbid|rep_toplevel|rep_queryid |rep_query |queryids |nr_queryids|min_nr_of_params|max_nr_of_params|plans|total_plan_time|min_plan_time|max_plan_time|mean_plan_time|stddev_plan_time|calls|total_exec_time |min_exec_time|max_exec_time|mean_exec_time |stddev_exec_time |rows|
+----------+--------+------------+------------------+--------------------------------------------------------+---------------------------------------------------------------------------------------------------+-----------+----------------+----------------+-----+---------------+-------------+-------------+--------------+----------------+-----+------------------+-------------+-------------+------------------+-------------------+----+
|10 |16384 |true |302831636317716902|SELECT count(id) FROM test WHERE a IN ($1) AND b IN ($2)|302831636317716902,5932379513563548079,5465811682035262379,4173960216733908951,-3349543303082417606|5 |2 |10 |0 |0 |0 |0 |0 |0 |13 |10.236400000000001|0.4392 |1.6688 |0.7874153846153847|0.31244974450339436|13 |
+----------+--------+------------+------------------+--------------------------------------------------------+---------------------------------------------------------------------------------------------------+-----------+----------------+----------------+-----+---------------+-------------+-------------+--------------+----------------+-----+------------------+-------------+-------------+------------------+-------------------+----+
Note that there are a couple columns that are different from the traditional pg_stat_statements: Everything with the rep_ prefix is a representative of the grouped rows (where applicable). The representative is chosen by selecting the version of the query (within a group) that has the lowest number of parameters. This is for convenience, especially when trying to test generic query-plans. (more on that subject in a later blog-post)
- rep_userid -> the userid field as it is in the representative row
- rep_dbid -> the dbid field as it is in the representative row
- rep_toplevel -> the toplevel field as it is in the representative row
- rep_queryid -> the queryid of the representative row
- rep_query -> the query of the representative row
- queryids -> a comma-separated list of the queryids in this group
- nr_queryids -> the number of queries in pg_stat_statements that this group is based on
- min_nr_of_params -> the minimum number of parameters found in this query-group. Note that this is a count of all parameters found in the query-body, including values that one would normaly see as static.
- max_nr_of_params -> the maximum number of parameters found in this query-group.
I hope this helps, and if it does, or if you encounter a problem or spot an error, please do not hesitate to let me know.
Ellert van Koperen, January 2023.