Database.one Blog - In this episode: some cool ways to ensure the correctnes of an embedded date-in-sequence.

Embedding a date into a Serial

part 2: updating the sequence correctly

This article uses PostgreSQL as the primary database. It is probably possible to use the same technique in other database systems too.

Update: it came to my attention that the universally used 'now()' statements may or may not return the actual current time at the instant the function is called. To avoid any confusion I therefore changed all uses of 'now()' into 'CURRENT_DATE', which is guaranteed to return the start time of the current transaction.

After the decision has been made to employ the trick of embedding a date into a sequence (or bigserial, which is more or less the same) as explained here, some thought must go into ensuring the correct values will be coming out of this sequence.
Note: as these methods work with direct access to the sequence, the more modern bigint GENERATED ALWAYS AS IDENTITY need a bespoke trick that i address in part 3.

Let me get something out of the way first:
Q: why not simply update the sequence from a cron job at midnight?
A: this can not be accurate. If there is an insert done just after 00:00 and the cronjob has not run yet, then this row will have the wrong date in its sequence, and may even land in an incorrect partition. That is just sloppy and not acceptable.

Lets first set up a test-bed for the (speed-)testing of the different methods.
I created a PostgreSQL Docker container using https://hub.docker.com/_/postgres and spun it up using
docker run --name postgrescontainer -p 5432:5432 -e POSTGRES_PASSWORD=*************** -d postgres
This runs on my late 2014 Macbook (2.2 GHz Core i7), so not the fastest hardware, nor an exact replication of a professional hosting environment, but that is quite irrelevant when comparing differences, as you will see soon enough.

Setting up some prerequisites:

CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE OR REPLACE FUNCTION sha1(text) returns text as $$
    SELECT encode(digest($1, 'sha1'), 'hex')
$$ LANGUAGE SQL STRICT IMMUTABLE;

This creates an easy to use sha1 function, that helps nicely when generating complex kind-of random data.

Baseline

Lets first define the baselines. We create a simple table and a more complex table (with more columns and more indexes) and in both insert one million rows of random data.

This is the part that must be re-run every test. To save space I will not copy-paste it in every code example below, but dont forget to re-run this when doing the tests yourself.

DROP TABLE if exists test;
CREATE TABLE test
( id bigserial primary key,
  something text,
  tstamp timestamp default now() );

DROP TABLE if exists testw;
CREATE TABLE testw
( id bigserial primary key,
  something text,
  somenumber int,
  somethingelse text,
  tstamp timestamp default now() );
CREATE INDEX ON testw (something, somenumber);
CREATE INDEX ON testw (somenumber, somethingelse);
CREATE INDEX ON testw (somethingelse, tstamp);
CREATE INDEX ON testw (tstamp, id);

Perform the baseline test:


INSERT INTO test(something, tstamp)
  SELECT concat('nummer ', i::text), '2000-01-01'::timestamp + interval '2' day * (i::int)
  FROM generate_series(1,1000000) as t(i);

-- 1,000,000 rows affected in 5 s 454 ms

INSERT INTO testw (something, somenumber, somethingelse, tstamp)
  SELECT concat('nummer ', i::text), i::int, sha1(i::text),
         '2000-01-01'::timestamp + interval '2' day * (i::int)
  FROM generate_series(1,1000000) as t(i);

-- 1,000,000 rows affected in 31 s 468 ms

Results	Simple table	More complex table
Baseline	5 s 454 ms	31 s 468 ms

Method one: always update the sequence

In this bruteforce method we simply update the sequence to have the current date embedded, whether it is needed or not.

-- recreate test table, and then

CREATE OR REPLACE FUNCTION test__BI() RETURNS trigger AS $$
BEGIN
  PERFORM setval('public.test_id_seq'::regclass,
                  (currval('public.test_id_seq'::regclass)::bigint
                    % 10000000000000::bigint
                    + ( (EXTRACT(year from CURRENT_DATE)::bigint-2000)*10000::bigint
                        + (EXTRACT(month from CURRENT_DATE)::bigint)*100::bigint
                        + (EXTRACT(day from CURRENT_DATE)::bigint)
                    ) * 10000000000000::bigint
                )::bigint,
                true);
  NEW.id = currval('public.test_id_seq'::regclass)::bigint;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER test__BI BEFORE INSERT ON test FOR EACH ROW EXECUTE PROCEDURE test__BI();

INSERT INTO test(something, tstamp)
SELECT concat('nummer ', i::text), '2000-01-01'::timestamp + interval '2' day * (i::int)
FROM generate_series(1,1000000) as t(i);

-- 1,000,000 rows affected in 20 s 635 ms

-- recreate testw table, and then

CREATE OR REPLACE FUNCTION testw__BI() RETURNS trigger AS $$
BEGIN
  PERFORM setval('public.testw_id_seq'::regclass,
                  (currval('public.testw_id_seq'::regclass)::bigint
                    % 10000000000000::bigint
                    + ( (EXTRACT(year from CURRENT_DATE)::bigint-2000)*10000::bigint
                        + (EXTRACT(month from CURRENT_DATE)::bigint)*100::bigint
                        + (EXTRACT(day from CURRENT_DATE)::bigint)
                    ) * 10000000000000::bigint
                )::bigint,
                true);
  NEW.id = currval('public.testw_id_seq'::regclass)::bigint;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER testw__BI BEFORE INSERT ON testw FOR EACH ROW EXECUTE PROCEDURE testw__BI();

INSERT INTO testw (something, somenumber, somethingelse, tstamp)
  SELECT concat('nummer ', i::text), i::int, sha1(i::text),
         '2000-01-01'::timestamp + interval '2' day * (i::int)
  FROM generate_series(1,1000000) as t(i);

-- 1,000,000 rows affected in 47 s 809 ms

Results	Simple table	More complex table
Baseline	5 s 454 ms	31 s 468 ms
Always update sequence	20 s 635 ms	47 s 809 ms

The difference is 15 to 16 seconds. That is quite a significant slowdown.
On the positive side, the trigger will ensure that the sequence is always set correctly.

Method two: conditionally update the sequence

In this somewhat more subtle method we update the sequence if it is deemed needed.

-- recreate test table, and then
	
CREATE OR REPLACE FUNCTION test__BI() RETURNS trigger AS $$
BEGIN
IF currval('public.test_id_seq'::regclass)::bigint <
          ( (EXTRACT(year from CURRENT_DATE)-2000)::bigint*10000::bigint
              + EXTRACT(month from CURRENT_DATE)::bigint*100::bigint
              + EXTRACT(day from CURRENT_DATE)::bigint
          )::bigint * 10000000000000::bigint
THEN
    PERFORM setval('public.test_id_seq'::regclass,
                (currval('public.test_id_seq'::regclass)::bigint % 10000000000000::bigint
                    + ( (EXTRACT(year from CURRENT_DATE)-2000)::bigint*10000::bigint
                         + EXTRACT(month from CURRENT_DATE)::bigint*100::bigint
                         + EXTRACT(day from CURRENT_DATE)::bigint
                      )::bigint * 10000000000000::bigint
                )::bigint,
                true);
    NEW.id = currval('public.test_id_seq'::regclass)::bigint;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER test__BI BEFORE INSERT ON test FOR EACH ROW EXECUTE PROCEDURE test__BI();

INSERT INTO test(something, tstamp)
SELECT concat('nummer ', i::text), '2000-01-01'::timestamp + interval '2' day * (i::int)
FROM generate_series(1,1000000) as t(i);

-- 1,000,000 rows affected in 9 s 745 ms

-- recreate testw table, and then

CREATE OR REPLACE FUNCTION testw__BI() RETURNS trigger AS $$
BEGIN
IF currval('public.testw_id_seq'::regclass)::bigint <
          (( (EXTRACT(year from CURRENT_DATE)-2000)*10000
                + EXTRACT(month from CURRENT_DATE)*100
                + EXTRACT(day from CURRENT_DATE)
          )::bigint * 10000000000000::bigint)
THEN
    PERFORM setval('public.testw_id_seq'::regclass,
                (currval('public.testw_id_seq'::regclass)::bigint
                    % 10000000000000::bigint
                    + ( (EXTRACT(year from CURRENT_DATE)-2000)::bigint * 10000::bigint
                         + EXTRACT(month from CURRENT_DATE)::bigint * 100::bigint
                         + EXTRACT(day from CURRENT_DATE)::bigint
                      )::bigint * 10000000000000::bigint
                )::bigint,
                true);
    NEW.id = currval('public.testw_id_seq'::regclass)::bigint;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER testw__BI BEFORE INSERT ON testw FOR EACH ROW EXECUTE PROCEDURE testw__BI();

INSERT INTO testw (something, somenumber, somethingelse, tstamp)
  SELECT concat('nummer ', i::text), i::int, sha1(i::text),
         '2000-01-01'::timestamp + interval '2' day * (i::int)
  FROM generate_series(1,1000000) as t(i);

-- 1,000,000 rows affected in 36 s 407 ms

Results	Simple table	More complex table
Baseline	5 s 454 ms	31 s 468 ms
Always update sequence	20 s 635 ms	47 s 809 ms
Conditionally update sequence	9 s 745 ms	36 s 407 ms

The difference is about 5 seconds. Much better.
On the positive side, the trigger will ensure that the sequence is always set correctly and the check shaves off 2/3rd of the overhead. Not bad.

Method three: only at certain times conditionally update the sequence

This is the most evolved method. Using this trick shaves off a little extra overhead, but does rely on one additional step, as explained below.

-- recreate test table, and then

CREATE OR REPLACE FUNCTION test__BI() RETURNS trigger AS $$
BEGIN
IF ((EXTRACT(hour from CURRENT_DATE) = 0) AND (EXTRACT(minute from CURRENT_DATE) < 10)) OR (currval('public.test_id_seq'::regclass)::bigint < 10000000000000::bigint) THEN
  IF currval('public.test_id_seq'::regclass)::bigint <
          ( (EXTRACT(year from CURRENT_DATE)-2000)::bigint*10000::bigint
              + EXTRACT(month from CURRENT_DATE)::bigint*100::bigint
              + EXTRACT(day from CURRENT_DATE)::bigint
          )::bigint * 10000000000000::bigint
  THEN
    PERFORM setval('public.test_id_seq'::regclass,
                (currval('public.test_id_seq'::regclass)::bigint % 10000000000000::bigint
                    + ( (EXTRACT(year from CURRENT_DATE)-2000)::bigint*10000::bigint
                         + EXTRACT(month from CURRENT_DATE)::bigint*100::bigint
                         + EXTRACT(day from CURRENT_DATE)::bigint
                      )::bigint * 10000000000000::bigint
                )::bigint,
                true);
    NEW.id = currval('public.test_id_seq'::regclass)::bigint;
  END IF;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER test__BI BEFORE INSERT ON test FOR EACH ROW EXECUTE PROCEDURE test__BI();
truncate test;
INSERT INTO test(something, tstamp)
SELECT concat('nummer ', i::text), '2000-01-01'::timestamp + interval '2' day * (i::int)
FROM generate_series(1,1000000) as t(i);

-- 1,000,000 rows affected in 7 s 680 ms

Have a good look at the code. The first bit of the first IF statement restricts any further code from running outside of the allowed timewindow between 00:00 and 00:10. This is intentional.

-- recreate test table, and then

CREATE OR REPLACE FUNCTION testw__BI() RETURNS trigger AS $$
BEGIN
IF ((EXTRACT(hour from CURRENT_DATE) = 0) AND (EXTRACT(minute from CURRENT_DATE) < 10)) OR (currval('public.testw_id_seq'::regclass)::bigint < 10000000000000::bigint) THEN
  IF currval('public.testw_id_seq'::regclass)::bigint <
          (( (EXTRACT(year from CURRENT_DATE)-2000)*10000
                + EXTRACT(month from CURRENT_DATE)*100
                + EXTRACT(day from CURRENT_DATE)
          )::bigint * 10000000000000::bigint)
  THEN
    PERFORM setval('public.testw_id_seq'::regclass,
                (currval('public.testw_id_seq'::regclass)::bigint
                    % 10000000000000::bigint
                    + ( (EXTRACT(year from CURRENT_DATE)-2000)::bigint * 10000::bigint
                         + EXTRACT(month from CURRENT_DATE)::bigint * 100::bigint
                         + EXTRACT(day from CURRENT_DATE)::bigint
                      )::bigint * 10000000000000::bigint
                )::bigint,
                true);
    NEW.id = currval('public.testw_id_seq'::regclass)::bigint;
  END IF;
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE OR REPLACE TRIGGER testw__BI BEFORE INSERT ON testw FOR EACH ROW EXECUTE PROCEDURE testw__BI();

INSERT INTO testw (something, somenumber, somethingelse, tstamp)
  SELECT concat('nummer ', i::text), i::int, sha1(i::text),
         '2000-01-01'::timestamp + interval '2' day * (i::int)
  FROM generate_series(1,1000000) as t(i);

-- 1,000,000 rows affected in 34 s 072 ms

Also here the first bit of the first IF statement restricts any further code from running outside of the allowed timewindow between 00:00 and 00:10. This is intentional.

This gives an either external or internal cronjob time to run and update the sequences where needed for sequences that did not see an update already, and if an update is done in the critical timewindow between midnight and 10 minutes after midnight these will always be checked and modified where needed before being inserted.
This cronjob can update the sequence with a function similar to what was used in Method one, where a PERFORM setval simply sets the sequence correctly and that is that.
It is important that this cronjob does run however, if there is an issue there and it does NOT run for whatever reason, this may lead to incorrect values in the sequence, which would be mean inconsistent data !

Results	Simple table	More complex table
Baseline	5 s 454 ms	31 s 468 ms
Always update sequence	20 s 635 ms	47 s 809 ms
Conditionally update sequence	9 s 745 ms	36 s 407 ms
Only at certain times conditionally update the sequence	7 s 680 ms	34 s 072 ms

The difference is less then 3 seconds, the fastest by a long shot sofar.

Method four: change the sequence-value on the fly

In this even more subtle method we change the sequence-value on the fly, i.e. we directly pad the data into it.

-- for this method we must change something to the creation of the table

DROP TABLE IF EXISTS test;
DROP SEQUENCE IF EXISTS test_serial;
CREATE SEQUENCE test_serial START 1;

CREATE OR REPLACE FUNCTION gen_test_key() RETURNS BIGINT AS $$
DECLARE
  new_id bigint;
BEGIN
  new_id = (nextval('public.test_serial'::regclass)::bigint % 10000000000000::bigint
            + ( (EXTRACT(year from CURRENT_DATE)-2000)::bigint * 10000::bigint
                + EXTRACT(month from CURRENT_DATE)::bigint * 100::bigint
                + EXTRACT(day from CURRENT_DATE)::bigint
              )::bigint * 10000000000000::bigint
           )::bigint;
RETURN new_id;
END;
$$ LANGUAGE plpgsql;

CREATE TABLE test
( id bigint primary key default gen_test_key(),
  something text,
  tstamp timestamp default now());


INSERT INTO test(something, tstamp)
SELECT concat('nummer ', i::text), '2000-01-01'::timestamp + interval '2' day * (i::int)
FROM generate_series(1,1000000) as t(i);

-- 1,000,000 rows affected in 7 s 609 ms

-- for this method we must change something to the creation of the table

DROP TABLE IF EXISTS testw;
DROP SEQUENCE IF EXISTS testw_serial;
CREATE SEQUENCE testw_serial START 1;

CREATE OR REPLACE FUNCTION gen_testw_key() RETURNS BIGINT AS $$
DECLARE
  new_id bigint;
BEGIN
  new_id = (nextval('public.testw_serial'::regclass)::bigint % 10000000000000::bigint
            + ( (EXTRACT(year from CURRENT_DATE)-2000)::bigint * 10000::bigint
                + EXTRACT(month from CURRENT_DATE)::bigint * 100::bigint
                + EXTRACT(day from CURRENT_DATE)::bigint
              )::bigint * 10000000000000::bigint
           )::bigint;
RETURN new_id;
END;
$$ LANGUAGE plpgsql;

CREATE TABLE testw
( id bigint primary key default gen_testw_key(),
  something text,
  somenumber int,
  somethingelse text,
  tstamp timestamp default now());
CREATE INDEX ON testw (something, somenumber);
CREATE INDEX ON testw (somenumber, somethingelse);
CREATE INDEX ON testw (somethingelse, tstamp);
CREATE INDEX ON testw (tstamp, id);


INSERT INTO testw (something, somenumber, somethingelse, tstamp)
  SELECT concat('nummer ', i::text), i::int, sha1(i::text), '2000-01-01'::timestamp + interval '2' day * (i::int)
  FROM generate_series(1,1000000) as t(i);

-- 1,000,000 rows affected in 33 s 667 ms

In a previous iteration of this method a trigger was used to change the id before inserting it, but by simply using the function directly as the default value it became twice as quick.
The only downside is that with this method it is possible to insert incorrect id values. That would be a stupid thing to do, but can not be prevented without an additional trigger.

Results	Simple table	More complex table
Baseline	5 s 454 ms	31 s 468 ms
Always update sequence	20 s 635 ms	47 s 809 ms
Conditionally update sequence	9 s 745 ms	36 s 407 ms
Only at certain times conditionally update the sequence	7 s 680 ms	34 s 072 ms
Directly pad the sequence	7 s 609 ms	33 s 667 ms

The difference is just over 2 seconds.
Note: adding-in the date on the fly leaves the sequence as-is, which you may like or dislike, depending on your preferences and standards.

Conclusion

Of all the proposed techniques, method 1 (always update sequence) is too simplistic, and off the table.

Method 2 (conditionally update sequence) is a robust method that I would be happy to use without ever having to worry about it anymore.

Method 3 (only at certain times conditionally update the sequence) has the advantage that it performs better then method 2, but it does add complexity, which can break.

Method 4 (directly pad the sequence) is an elegant method which surprisingly is also the fastest method. The only downside is that the sequence itself is not updated, but if that is something that you can live with, this method is definitely a very good choice.

Ellert van Koperen, original article created on September 2022, updated on February 2023.

There is a part 3 to this article, solving the same issue for generated always as identity fields.

Sharing is caring.