Aggregation and statistics, part 1: fundamentals

Aggregation is a very powerfull tool. Beyond the well-known simple aggregations like sum and count, there are some others that can be very usefull.
It is possible to cascade aggregations, drastically reducing the volume of records needed in follow-up calculations, this advanced topic will be handeled in part 2, first the basics.
This is best explained using an example.

The examples are written for MySql/MariaDB but are also available for PostgreSQL. Just jab the appropriate button:

We have a table with (bank)transfers:
DROP TABLE IF EXISTS transfers;
CREATE TABLE transfers (
  id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
  Moment datetime NOT NULL,
  Quantity float NOT NULL,
  Account varchar(20) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

INSERT INTO transfers(Moment, Quantity, Account) VALUES('2021-02-08 23:00:00',2,'111'),
 ('2021-02-09 08:00:01',6,'111'), ('2021-02-09 11:11:11',9,'111'),
 ('2021-02-09 22:22:21',15,'111'), ('2021-02-09 23:33:33',5,'222');
This gives us the following table:
id (Primary_key)Moment (datetime)Quantity (float)Account (varchar)
12021-02-08 23:00:002111
22021-02-09 08:00:016111
32021-02-09 11:11:119111
42021-02-09 22:22:2115111
52021-02-09 23:33:335222
Note: there is only one transfer on 2021-02-08, and only one transfer of account 222 on 2021-02-09. This will become important later.

It would be usefull to have totals on a day by day basis for each account. That means grouping on the date and account using GROUP BY DATE(Moment), Account .
Lets make a table to hold some essentials, and add some extra columns to hold statistical values:
DROP TABLE IF EXISTS daily_transfers;
CREATE TABLE daily_transfers (
  id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
  Datum date NOT NULL,
  Total_Quantity float NOT NULL,
  Account varchar(20) NOT NULL,
  Nr_of_transfers int NOT NULL,
  Avg_Quantity float NOT NULL,
  Std_dev float NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Filling up such a totals table is very easy. In a (cron)job something like the following statement can be executed just after each day has ticked over: (for this example with hardcoded dates)
DELETE FROM daily_transfers WHERE Datum = date('2021-02-09');
INSERT INTO daily_transfers(Datum, Total_Quantity, Account, Nr_of_transfers, Avg_Quantity, Std_dev)
SELECT 
	Date(t.Moment), 
	SUM(t.Quantity), 
	t.Account, 
	COUNT(t.id), 
	avg(t.Quantity), 
	stddev_pop(t.Quantity) 
  FROM transfers as t 
  WHERE Date(t.Moment) = date('2021-02-09') 
  GROUP BY Date(t.Moment), Account;
The first time this runs the WHERE clause should be removed. Everytime it runs thereafter the WHERE clause is to prevent needless double work.
Note that the functions avg and stddev_pop are available in most RDBMSes, for sure in PostgreSQL, MySql and MariaDB.

That is most simple and also some less simple aggregation functions used:
- Date(t.Moment) takes the datepart of the timestamp and since it is grouping on it no further aggregation is needed. Otherwise something like the MAX() function would be in order.
- SUM(t.Quantity) sums up all the Quantity numbers
- t.Account is another grouped-on value, so no aggregation needed
- COUNT(t.id) counts how many records are processed per grouping
- avg(t.Quantity) calculates the average per grouping. The average is simply the sum of all values divided by the count of the number of records processed
- stddev_pop(t.Quantity) is the statistical standard deviation function, which tells you something about the 'spread' of the values. That is why it is 0 if there is only 1 row of data.


So after running the above aggregation query the daily_transfers table should contain something like:
id (Primary_key)Datum (date)Total_Quantity (float)Account (varchar)Nr_of_transfers (int)Avg_Quantity (float)Std_dev (float)
12021-02-082111120
22021-02-09301113103.74166
32021-02-095222150
An important note: statistics people may perhaps argue that the stddev_samp should be used here. In the sample-based standarddeviation (as opposed to the population-based standarddeviation) there is a included factor to account for increased uncertainty with smaller sample sizes. But as we are NOT working with samples, but with all data of each inspeced day, the stddev_pop is just fine. More on the difference between these two and another reason to choose for the stddev_pop function will be clarified in part 2 of this blogpost.

At the end of a larger period (week, month, year) we maybe want to calculate similar numbers, but then over a longer period.
The obvious thing to do is to simply take a similar query as before and just extend the span of the date-selection. But that is far from optimal, especially not so if we are working with millions or even billions of records, or multiple joins. In fact such an operation can bring datawarehouses to their knees, begging for forgiveness.

Is there a better way to accomplish this then?
There sure is!
And that I will explain in part 2 of this post, which you can read here!



Ellert, February 2021.
img