Embedding a date into a Serial
(with the purpose to effectively partition on date or age)This article uses PostgeSQL as an example. It is probably possible to use the same technique in other database systems too.
In PostgreSQL there is one very important caveat when considering a range-partitioning of tables: every lookup of rows must contain the partition-key if the planner is to know what partition to use. (The partition-key is the field (or occasionally) that the partitioning is taking place on). Stated differently: any primary index must cover at least all partition-keys.
As partitions are often preferably done on the date of insertion (to facilitate easy deletion of big chunks of data past their retention-limit) it would seem obvious to use the timestamp (or a derivative of that) of the time of insertion as the partition-key.
That however, poses a problem: the database has to know in which partition to look if it is to perform optimally. This would mean that all lookups in these tables must be done with both the primary key and also the partition-key/timestamp. Even on for example a join, this timestamp must be included, or all partitions will be visited. That is far from optimal, especially so as it means including this timestamp everywhere in any SQL code, and that gets (much) worse when it is existing code that must be modified.
In order for effective partitioning on date to still take place, it would be very helpful for the date (at time of data insertion) to be part of the primary key.
And since there should be full control over the primary-key Serials/Sequences used, there is a solution.
Please note a Serial column is in fact a pseudo-type, as it really is an Int field with an implicitly created sequence (named tablename_colname_seq) attached.
The same goes for Bigserial.
We will need to use Bigserial / Bigint fields (63-bit int, as the negative part of the range is ignored) as there is enough space in this field to apply a bit of trickery to accomplish just what we would need to resolve the issue at hand.
There is so much space in a Bigint field, ranging from -9223372036854775808 to 9223372036854775807, that it is entirely possible to use some of that bitspace for our own purpose. Note that a Bigint sequence will never be a negative number, so the range is actually from 0 to 9223372036854775807, and the 64th bit is never used. Still, space aplenty.
Update: it came to my attention that the universally used 'now()' statements may or may not return the actual current time at the instant the function is called. To avoid any confusion I therefore changed all uses of 'now()' into 'CURRENT_DATE', which is guaranteed to return the start time of the current transaction.
Method A: the best method for human readability, modifying the sequence itself
There is actually enough space in the Bigint value to simply put in a complete 6 digit date.
This date is notated as 2-digit year notation, for 4 digit year values the space is possibly available, but that is really overkill and does take away perhaps too much out of the 19 available digits (ymmv).
If we use the left 6 decimal digits of the bigint field, the maximum number we can support for the age is 9.223.372.036.854.775.807 -> 922337.
We can again put YYMMDD in that, so this will work until the year 2092.
The first 2 digits are now the year, the next 2 digits are the month number (with leading zero) and the 5th and 6th digits are the day within the month.
Note that it is safe to execute this statement more then once since the date part is separate from the sequential value, which is retained.
SELECT setval('"mydatabase".myschema.mytable_id_sequence'::regclass, (currval('"mydatabase".myschema.mytable_id_sequence'::regclass)::bigint % 10000000000000::bigint + ((EXTRACT(year from CURRENT_DATE)::bigint-2000)*10000::bigint + (EXTRACT(month from CURRENT_DATE)::bigint)*100::bigint + (EXTRACT(day from CURRENT_DATE)::bigint) ) * 10000000000000::bigint )::bigint, true);
Note 2: as the date now takes up 6 digits, the sequential value has one less digit to work with. This means that the sequential values are now limited to be at most 1.0E13.
It would require a continuous 3000 inserts per second over a period of 97 years to reach this limit, so we can state with certainty that this astronomical number will not be reached in our lifetimes.
With this method an original sequence value of 1234567890 would be converted to 2208030001234567890. (time of writing is 2022-08-03)
Now, with the sequences of the tables that are to be partitioned prepared with the proposed method, partitioning on date, for example a partition per month to very easily drop a chunk of data after a certain retention period has passed, has become trivial, and all of that without the need to change anything to any application-generated SQL code.
As a bonus benefit (at least with method A) a human that gets to see one of these IDs can now see at a glance how old this data actually is, which can be a huge plus for intermediate or end-users.
In part 2 to this post several methods to get the date value into the sequence are discussed and compared.
Ellert van Koperen, original article created on August 2022, updated on February 2023.