Aggregation and statistics, part 1: fundamentals
Aggregation is a very powerfull tool. Beyond the well-known simple aggregations like sum and count, there are some others that can be very usefull.It is possible to cascade aggregations, drastically reducing the volume of records needed in follow-up calculations, this advanced topic will be handeled in part 2, first the basics.
This is best explained using an example.
The examples are written for MySql/MariaDB but are also available for PostgreSQL. Just jab the appropriate button:
We have a table with (bank)transfers:
DROP TABLE IF EXISTS transfers;
CREATE TABLE transfers (
id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
Moment datetime NOT NULL,
Quantity float NOT NULL,
Account varchar(20) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO transfers(Moment, Quantity, Account) VALUES('2021-02-08 23:00:00',2,'111'),
('2021-02-09 08:00:01',6,'111'), ('2021-02-09 11:11:11',9,'111'),
('2021-02-09 22:22:21',15,'111'), ('2021-02-09 23:33:33',5,'222');
id (Primary_key) | Moment (datetime) | Quantity (float) | Account (varchar) |
---|---|---|---|
1 | 2021-02-08 23:00:00 | 2 | 111 |
2 | 2021-02-09 08:00:01 | 6 | 111 |
3 | 2021-02-09 11:11:11 | 9 | 111 |
4 | 2021-02-09 22:22:21 | 15 | 111 |
5 | 2021-02-09 23:33:33 | 5 | 222 |
It would be usefull to have totals on a day by day basis for each account. That means grouping on the date and account using GROUP BY DATE(Moment), Account .
Lets make a table to hold some essentials, and add some extra columns to hold statistical values:
DROP TABLE IF EXISTS daily_transfers;
CREATE TABLE daily_transfers (
id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
Datum date NOT NULL,
Total_Quantity float NOT NULL,
Account varchar(20) NOT NULL,
Nr_of_transfers int NOT NULL,
Avg_Quantity float NOT NULL,
Std_dev float NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Filling up such a totals table is very easy. In a (cron)job something like the following statement can be executed just after each day has ticked over: (for this example with hardcoded dates)
DELETE FROM daily_transfers WHERE Datum = date('2021-02-09');
INSERT INTO daily_transfers(Datum, Total_Quantity, Account, Nr_of_transfers, Avg_Quantity, Std_dev)
SELECT
Date(t.Moment),
SUM(t.Quantity),
t.Account,
COUNT(t.id),
avg(t.Quantity),
stddev_pop(t.Quantity)
FROM transfers as t
WHERE Date(t.Moment) = date('2021-02-09')
GROUP BY Date(t.Moment), Account;
Note that the functions avg and stddev_pop are available in most RDBMSes, for sure in PostgreSQL, MySql and MariaDB.
That is most simple and also some less simple aggregation functions used:
- Date(t.Moment) takes the datepart of the timestamp and since it is grouping on it no further aggregation is needed. Otherwise something like the MAX() function would be in order.
- SUM(t.Quantity) sums up all the Quantity numbers
- t.Account is another grouped-on value, so no aggregation needed
- COUNT(t.id) counts how many records are processed per grouping
- avg(t.Quantity) calculates the average per grouping. The average is simply the sum of all values divided by the count of the number of records processed
- stddev_pop(t.Quantity) is the statistical standard deviation function, which tells you something about the 'spread' of the values. That is why it is 0 if there is only 1 row of data.
So after running the above aggregation query the daily_transfers table should contain something like:
id (Primary_key) | Datum (date) | Total_Quantity (float) | Account (varchar) | Nr_of_transfers (int) | Avg_Quantity (float) | Std_dev (float) |
---|---|---|---|---|---|---|
1 | 2021-02-08 | 2 | 111 | 1 | 2 | 0 |
2 | 2021-02-09 | 30 | 111 | 3 | 10 | 3.74166 |
3 | 2021-02-09 | 5 | 222 | 1 | 5 | 0 |
At the end of a larger period (week, month, year) we maybe want to calculate similar numbers, but then over a longer period.
The obvious thing to do is to simply take a similar query as before and just extend the span of the date-selection. But that is far from optimal, especially not so if we are working with millions or even billions of records, or multiple joins. In fact such an operation can bring datawarehouses to their knees, begging for forgiveness.
Is there a better way to accomplish this then?
There sure is!
And that I will explain in part 2 of this post, which you can read here!
Ellert, February 2021.