Grouping pg_stat_statements, the right way

pg_stat_statements is one of the most used extensions in PostgreSQL, and for a good reason: it gives fast and very valuable insights into how the backend is spending its precious time.
There is however at least one issue that has been nagging at me for a while, enough so that i recently set aside an evening to, once and for all i hope, solve it.

The problem.

First make sure that the pg_stat_statements extension is installed:
SELECT * FROM pg_available_extensions WHERE name = 'pg_stat_statements' ;
-- if it is not installed, do
CREATE EXTENSION pg_stat_statements;
if that didn't work, google it. The problem is probably a missing postgresqlNN-contrib package install (NN is version number) or a missing
shared_preload_libraries = 'pg_stat_statements'
in the postgres.conf.

Now that that is out of the way, lets set up the test.
We make a table test with some test data:
	a INT,
	b INT

WITH generated_data AS (select generate_series(0,99) AS i)
INSERT INTO test(a,b)
SELECT 1+FLOOR(generated_data.i/8), 1+FLOOR(generated_data.i%9) FROM generated_data;

SELECT pg_stat_statements_reset(); -- clean up the statistics to get less clutter later on. Not per se needed, but practical.

And on this test table we run some sample queries:
SELECT count(id) FROM test WHERE a IN (1) AND b IN (1);
SELECT count(id) FROM test WHERE a IN (2) AND b IN (2);
SELECT count(id) FROM test WHERE a IN (3) AND b IN (3);

SELECT count(id) FROM test WHERE a IN (1,2) AND b IN (1);
SELECT count(id) FROM test WHERE a IN (2,3) AND b IN (2);
SELECT count(id) FROM test WHERE a IN (3,4) AND b IN (3);

SELECT count(id) FROM test WHERE a IN (1) AND b IN (1,2);
SELECT count(id) FROM test WHERE a IN (2) AND b IN (2,3);
SELECT count(id) FROM test WHERE a IN (3) AND b IN (3,4);

SELECT count(id) FROM test WHERE a IN (1, 2) AND b IN (1,2);
SELECT count(id) FROM test WHERE a IN (2,3) AND b IN (2, 3);
SELECT count(id) FROM test WHERE a IN (3 , 4 ) AND b IN (3 , 4 );

SELECT count(id) FROM test WHERE a IN (3,4,5, 6,7) AND b IN (3, 4,5,6,7);
Note the funky spaces in some places, that was to trip up bad regexes. Sloppy data in tests leads to better code.

A typical scan of the current pg_stat_statements may look something like:
SELECT userid, dbid, toplevel, queryid, query, calls, total_exec_time, min_exec_time, max_exec_time, mean_exec_time, stddev_exec_time, rows 
												FROM pg_stat_statements WHERE query ILIKE 'SELECT count(id)%';

Where you see something like
|userid|dbid |toplevel|queryid             |query                                                                              |calls|total_exec_time   |min_exec_time     |max_exec_time     |mean_exec_time    |stddev_exec_time    |rows|
|10    |16384|true    |5932379513563548079 |SELECT count(id) FROM test WHERE a IN ($1) AND b IN ($2,$3)                        |3    |3.298             |0.5394            |1.6688            |1.0993333333333333|0.4611248951085692  |3   |
|10    |16384|true    |-3349543303082417606|SELECT count(id) FROM test WHERE a IN ($1,$2,$3, $4,$5) AND b IN ($6, $7,$8,$9,$10)|1    |0.6315999999999999|0.6315999999999999|0.6315999999999999|0.6315999999999999|0                   |1   |
|10    |16384|true    |302831636317716902  |SELECT count(id) FROM test WHERE a IN ($1) AND b IN ($2)                           |3    |2.473             |0.6579            |1.0988            |0.8243333333333334|0.19553619159179256 |3   |
|10    |16384|true    |5465811682035262379 |SELECT count(id) FROM test WHERE a IN ($1,$2) AND b IN ($3)                        |3    |1.7217000000000002|0.4392            |0.7081000000000001|0.5739000000000001|0.10977853463526774 |3   |
|10    |16384|true    |4173960216733908951 |SELECT count(id) FROM test WHERE a IN ($1, $2) AND b IN ($3,$4)                    |3    |2.1121            |0.6792            |0.7444            |0.7040333333333333|0.028794945544120876|3   |
Which is annoying, since it is basically 5 times the same query with a different number of parameters.
In this trivial example it is quite obvious what is going on, but imagine running a top 100 on a loaded system, you can easily either miss a query that has many mutations like this, or be swamped by dozens of times the same query, neither of which is ideal.


To resolve all this in one swoop I created the view below. I admit it is a bit convoluted.
It is leaning on 3 principles:
- A regex to hunt down parameter-lists,
- The aggregation of statistics (explained in two previous blog-posts part one and part two) which shows by the formulas that are used to combine the stddev_plan_time and the stddev_exec_time values
- And the use of CTEs.

CREATE OR REPLACE VIEW pg_stat_statements_grouped AS
WITH pgsss AS (
	WITH pgss AS (
		SELECT	queryid,
				length(regexp_replace(query, '[^\$]', '', 'g'))::int2 AS nr_of_params,
				REGEXP_REPLACE(query, '\((\$[0-9]+)(\ *,\ *\$[0-9]+)*\)', '(###)', 'g') AS abstract_query,
				plans, total_plan_time, min_plan_time, max_plan_time, mean_plan_time, stddev_plan_time,
				calls, total_exec_time, min_exec_time, max_exec_time, mean_exec_time, stddev_exec_time,
				rows, shared_blks_hit, shared_blks_read, shared_blks_dirtied, shared_blks_written,
				local_blks_hit, local_blks_read, local_blks_dirtied, local_blks_written,
				temp_blks_read, temp_blks_written, blk_read_time, blk_write_time, temp_blk_read_time, temp_blk_write_time, wal_records, wal_fpi, wal_bytes,
				jit_functions, jit_generation_time, jit_inlining_count, jit_inlining_time, jit_optimization_count, jit_optimization_time, jit_emission_count, jit_emission_time
		FROM pg_stat_statements
		ORDER BY abstract_query ASC, nr_of_params ASC)
	SELECT	array_agg(queryid) AS queryid,
			count(DISTINCT pgss.queryid) AS nr_queryids,
			MIN(nr_of_params) AS min_nr_of_params,
			MAX(nr_of_params) AS max_nr_of_params,
			SUM(plans) AS plans,
			SUM(total_plan_time) AS total_plan_time,
			MIN(min_plan_time) AS min_plan_time,
			MAX(max_plan_time) AS max_plan_time,
			SUM(total_plan_time)/SUM(calls) AS mean_plan_time,
				SQRT (
					( SUM((POWER(stddev_plan_time,2) + POWER(mean_plan_time,2)) * plans) / SUM( plans ) )
					- POWER(( SUM(total_plan_time) / SUM(plans) ), 2)
				) END AS stddev_plan_time,
			SUM(calls) AS calls,
			SUM(total_exec_time) AS total_exec_time,
			MIN(min_exec_time) AS min_exec_time,
			MAX(max_exec_time) AS max_exec_time,
			SUM(total_exec_time)/SUM(calls) AS mean_exec_time,
				SQRT (
					( SUM((POWER(stddev_exec_time,2) + POWER(mean_exec_time,2)) * calls) / SUM( calls ) )
					- POWER(( SUM(total_exec_time) / SUM(calls) ), 2)
				) END AS stddev_exec_time,
			SUM(rows) AS rows,
			SUM(shared_blks_hit) AS shared_blks_hit,
			SUM(shared_blks_read) AS shared_blks_read,
			SUM(shared_blks_dirtied) AS shared_blks_dirtied,
			SUM(shared_blks_written) AS shared_blks_written,
			SUM(local_blks_hit) AS local_blks_hit,
			SUM(local_blks_read) AS local_blks_read,
			SUM(local_blks_dirtied) AS local_blks_dirtied,
			SUM(local_blks_written) AS local_blks_written,
			SUM(temp_blks_read) AS temp_blks_read,
			SUM(temp_blks_written) AS temp_blks_written,
			SUM(blk_read_time) AS blk_read_time,
			SUM(blk_write_time) AS blk_write_time,
			SUM(wal_records) AS wal_records,
			SUM(wal_fpi) AS wal_fpi,
			SUM(wal_bytes) AS wal_bytes_m,
			SUM(jit_functions) AS jit_functions,
			SUM(jit_generation_time) AS jit_generation_time,
			SUM(jit_inlining_count) AS jit_inlining_count,
			SUM(jit_inlining_time) AS jit_inlining_time,
			SUM(jit_optimization_count) AS jit_optimization_count,
			SUM(jit_optimization_time) AS jit_optimization_time,
			SUM(jit_emission_count) AS jit_emission_count,
			SUM(jit_emission_time) AS jit_emission_time
	FROM pgss
	GROUP BY pgss.abstract_query)
SELECT	pss.userid AS rep_userid,
		pss.dbid AS rep_dbid,
		pss.toplevel AS rep_toplevel,
		pgsss.queryid[1] AS rep_queryid,
		pss.query AS rep_query,
		array_to_string(pgsss.queryid, ',') queryids,
		pgsss.plans, pgsss.total_plan_time, pgsss.min_plan_time, pgsss.max_plan_time, pgsss.mean_plan_time, pgsss.stddev_plan_time,
		pgsss.calls, pgsss.total_exec_time, pgsss.min_exec_time, pgsss.max_exec_time, pgsss.mean_exec_time, pgsss.stddev_exec_time,
		pgsss.rows, pgsss.shared_blks_hit, pgsss.shared_blks_read, pgsss.shared_blks_dirtied, pgsss.shared_blks_written,
		pgsss.local_blks_hit, pgsss.local_blks_read, pgsss.local_blks_dirtied, pgsss.local_blks_written,
		pgsss.temp_blks_read, pgsss.temp_blks_written, pgsss.blk_read_time, pgsss.blk_write_time,
		pgsss.wal_records, pgsss.wal_fpi, pgsss.wal_bytes_m,
		pgsss.jit_functions, pgsss.jit_generation_time, pgsss.jit_inlining_count, pgsss.jit_inlining_time,
		pgsss.jit_optimization_count, pgsss.jit_optimization_time, pgsss.jit_emission_count, pgsss.jit_emission_time
FROM pgsss INNER JOIN pg_stat_statements pss ON pss.queryid = pgsss.queryid[1];

GRANT SELECT ON pg_stat_statements_grouped TO PUBLIC;
It is perhaps not the most elegant query I have made of late (well its a pig okay), but it seems to work well.

SELECT * FROM pg_stat_statements_grouped WHERE rep_query ILIKE 'SELECT count(id)%';
Gives a result like this: (everything about planning and also everything after 'rows' omitted for clarity)
|rep_userid|rep_dbid|rep_toplevel|rep_queryid       |rep_query                                               |queryids                                                                                           |nr_queryids|min_nr_of_params|max_nr_of_params|plans|total_plan_time|min_plan_time|max_plan_time|mean_plan_time|stddev_plan_time|calls|total_exec_time   |min_exec_time|max_exec_time|mean_exec_time    |stddev_exec_time   |rows|
|10        |16384   |true        |302831636317716902|SELECT count(id) FROM test WHERE a IN ($1) AND b IN ($2)|302831636317716902,5932379513563548079,5465811682035262379,4173960216733908951,-3349543303082417606|5          |2               |10              |0    |0              |0            |0            |0             |0               |13   |10.236400000000001|0.4392       |1.6688       |0.7874153846153847|0.31244974450339436|13  |

Note that there are a couple columns that are different from the traditional pg_stat_statements: Everything with the rep_ prefix is a representative of the grouped rows (where applicable). The representative is chosen by selecting the version of the query (within a group) that has the lowest number of parameters. This is for convenience, especially when trying to test generic query-plans. (more on that subject in a later blog-post)
  • rep_userid -> the userid field as it is in the representative row
  • rep_dbid -> the dbid field as it is in the representative row
  • rep_toplevel -> the toplevel field as it is in the representative row
  • rep_queryid -> the queryid of the representative row
  • rep_query -> the query of the representative row
  • queryids -> a comma-separated list of the queryids in this group
  • nr_queryids -> the number of queries in pg_stat_statements that this group is based on
  • min_nr_of_params -> the minimum number of parameters found in this query-group. Note that this is a count of all parameters found in the query-body, including values that one would normaly see as static.
  • max_nr_of_params -> the maximum number of parameters found in this query-group.
All other parameters have the same name and the same meaning as their corresponding brethren in pg_stat_statements.

I hope this helps, and if it does, or if you encounter a problem or spot an error, please do not hesitate to let me know.

Ellert van Koperen, January 2023.