img

About cardinality in PostgeSQL

All tables in any database contain data on which lookups of the values contained within it are performed. Using an index to do such a lookup usually greatly improves the speed of this lookup. The most important factor, next to the actual existence of an index covering (at least) the data that is being looked up, is the cardinality of the data in the table. The term cardinality in this context translates basically into 'diversity', or 'distinctiveness', and is always applied to individual columns.
As a general rule: the more distinctive the data in a column, the more effective it will be to get only rows that are 'interresting' when using an index that is based on that column. In PostgreSQL there is a very simple method to get a good impression of the cardinality of data in a table:
SELECT schemaname, tablename, attname as colname, inherited, null_frac, avg_width, n_distinct, correlation FROM pg_stats WHERE tablename = 'interesting_table';
Note that this query uses the meta-data that is stored within PostgreSQL, which is why it is very fast, but also relies on VACUUM ANALYZE "interesting_table"; having ran. If the contents of interesting_table is brand new, running a VACUUM ANALYZE on it must be done manually to ensure the correctness of the outcome of the cardinality-investigation query.

Example result

 schemaname |     tablename     |       colname        | inherited |   null_frac   | avg_width |   n_distinct   |  correlation  
------------+-------------------+----------------------+-----------+---------------+-----------+----------------+---------------
 public     | interesting_table | id                   | f         |             0 |         8 |             -1 |      0.994938 
 public     | interesting_table | version              | f         |             0 |         4 |              2 |     0.9413431 
 public     | interesting_table | external_id          | f         |             0 |        16 |             -1 | 0.00067209493 
 public     | interesting_table | sub_item_id          | f         |             0 |         8 |    -0.58149874 |     0.6830048 
 public     | interesting_table | valid_from           | f         |             0 |         8 |            133 |    0.68429905 
 public     | interesting_table | created_by           | f         |             0 |         9 |          19498 |    0.08784102 
 public     | interesting_table | last_mutations       | f         |     0.9123667 |         5 |             20 |    0.30380416 
 public     | interesting_table | auditing_initialized | f         |        0.1513 |         1 |              1 |             1 
 public     | interesting_table | freeze_end_date      | f         |     0.9996667 |         8 | -0.00033330917 |           0.6 
(9 rows)

How to read the outcome of this query
There are only a couple columns with information that is important to interpret.

null_frac is the fraction of this column that contains NULL values, 0 means there are none, and 1 means all of it is NULL.

n_distinct If this value is a positive integer it is the amount of distinct values that are estimated to be present in this column, so an absolute amount of 'distinctiveness'. If it is a negative number then it must be interpreted as the fraction of not-NULL values that are distinct (after removing the - sign). So -1 means all of the cells in this column are distinct, ignoring the rows that contain a NULL-value.

correlation is the relation between the value and the sequential position in the table. Often auto-increment (serial) fields have a high correlation and things like guids have a very low correlation.

So, looking at the example above, the external_id looks very distinct and thus a good candidate to put an index on if there are lookups based off of it, but also has a very low correlation. Inspecting its contents reveals quickly that this column contains guids from elsewhere.
On the other hand, columns like version, auditing_initialized and freeze_end_date have a terrible cardinality. Even if you would put an index on these columns, that index will never be used by the PostgreSQL planner, as that would only cause overhead.
The PostgreSQL query-planner is a very clever piece of kit, but you still need to create the correct indexes to enable it to make the right choices.


Ellert van Koperen, July 2022.