Aggregation and statistics, part 2: the juicy bits

Aggregation is a very powerfull tool. Beyond the well-known simple aggregations like sum and count, there are some others that can be very usefull.
After the introduction in part 1 (which you can read here) we will continue with the more advanced subject of combining and cascading aggregations, drastically reducing the volume of records needed in follow-up calculations.

Disclaimer: the example is somewhat simplified. For example, using a DELETE WHERE BETWEEN is not ideal, upserting whould be better, but requires a unique key on something else then the Serial / AUTO_INCREMENT. Also, storing the date part of the timestamps as a Date can increase performance, but those tweaks are left out for simplicity's sake.

The examples are written for MySql/MariaDB but are also available for PostgreSQL. Just jab the appropriate button:

Lets revisit the example.

So we had a table with (bank)transfers:
id (Primary_key)Moment (datetime)Quantity (float)Account (varchar)
12021-02-08 23:00:002111
22021-02-09 08:00:016111
32021-02-09 11:11:119111
42021-02-09 22:22:2115111
52021-02-09 23:33:335222
Note: there is only one transfer on 2021-02-08, and only one transfer of account 222 on 2021-02-09. This will become important later.

We then created a table to hold the aggregations:
DROP TABLE IF EXISTS daily_transfers;
CREATE TABLE daily_transfers (
  id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
  Datum date NOT NULL,
  Account varchar(20) NOT NULL,
  Nr_of_transfers int NOT NULL,
  Avg_Quantity float NOT NULL,
  Std_dev float NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Filling up such a totals table is very easy. In a (cron)job something like the following statement can be executed just after each day has ticked over: (for this example with hardcoded dates)
DELETE FROM daily_transfers WHERE Datum = date('2021-02-09');
INSERT INTO daily_transfers(Datum, Account, Nr_of_transfers, Avg_Quantity, Std_dev)
SELECT 
	Date(t.Moment), 
	t.Account, 
	COUNT(t.id), 
	avg(t.Quantity), 
	stddev_pop(t.Quantity) 
  FROM transfers as t 
  WHERE Date(t.Moment) = date('2021-02-09') 
  GROUP BY Date(t.Moment), Account;
The first time this runs the WHERE clauses limiting the date should be removed. Everytime it runs thereafter the WHERE clause is to prevent needless double work.
Note that the functions avg and stddev_pop are available in most RDBMSes, for sure in PostgreSQL, MySql and MariaDB.
Note 2: in the original example there was also a column for the total amount, a grand total as it where, which I have removed from this example.

Lets analyse the generated columns again:
- Date(t.Moment) and t.Account are gouping columns, meaning that the segmentation of the data is based uppon it.
- COUNT(t.id) counts how many records are processed per grouping
- avg(t.Quantity) calculates the average per grouping. The average is simply the sum of all values divided by the count of the number of records processed
- stddev_pop(t.Quantity) is the statistical standard deviation function, which tells you something about the 'spread' of the values. That is why it is 0 if there is only 1 row of data.


So after running the above aggregation query the daily_transfers table should contain something like:
id (Primary_key)Datum (date)Account (varchar)Nr_of_transfers (int)Avg_Quantity (float)Std_dev (float)
12021-02-08111120
22021-02-091113103.74166
32021-02-09222150
An important note: statistics people may perhaps argue that the stddev_samp should be used here. In the sample-based standarddeviation (as opposed to the population-based standarddeviation) there is a included factor to account for increased uncertainty with smaller sample sizes. But as we are NOT working with samples, but with all data of each inspeced day, the stddev_pop is just fine.

Now, lets have a look how we could further combine these figures.
First we create a new table to hold the courser totals:
DROP TABLE IF EXISTS weekly_transfers;
CREATE TABLE weekly_transfers (
  id int(10) UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
  Datum_first date NOT NULL,
  Account varchar(20) NOT NULL,
  Nr_of_transfers int NOT NULL,
  Avg_Quantity float NOT NULL,
  Std_dev float NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Note: how you want to denote what week it is can be considered a matter of taste, I generally use the date of the first day of the week. And even whether that should be a sunday or a monday (or even a saturday) is not everywhere the same. Be sure to make an agreement about it, and to be consistent from there on.

So how would we go about combining the records from the daily_transfers table?
Lets analyze them one by one.

Datum & Account
The grouping of the weekly_transfers table will be again on a combination of the Datum and Account columns, so no aggregation will be needed on them.

Total_Quantity
To get a total quantity of the activities of a day use the next two numbers to find back this total for each row, and then sum them.

Nr_of_transfers (=n)
Nr_of_transfers (or n) was actually a simple rowcount. To get a total rowcount of a week, simply add them up for each daily row in each group using a SUM().

Avg_Quantity
Now it becomes a bit more interesting.
An average is calculated as: `avg = mu = (sum_(i=1)^n Quantity_i) / n` and to combine averages they need to be properly weighed. We allready know the calculated averages, and also the rowcounts of all the totals that are being combined.
In other words, `avg = (sum_(i=1)^n DailyQuantity_i) / (sum_(i=1)^n DailyTrans_i) ` or, put simply,
Avg_Quantity = SUM(Daily_Quantity) / SUM(DailyTrans)
Since we have not stored the sums per day, we can calculate them back by multiplying the aveage by the rowcount: `avg_tot = (sum_(i=1)^n DailyAvg_i*DailyTrans_i) / (sum_(i=1)^n DailyTrans_i) ` which comes down to
Avg_Quantity = SUM(DailyAvg * DailyTrans) / SUM(DailyTrans)

Std_dev
Now here is the real tough nut. For a population-standard deviation there are two formulas: `sigma = sqrt ((sum_(i=1)^n (x_i - abs(mu))^2) / n)` which is exactly equal to `sigma = sqrt ((sum_(i=1)^n (x_i)^2) / n - mu^2)`
They are functionally the same, but the form on the right side is more easy to work with in our case. Let me explain:
To get the standard deviation of the combined populations we need to combine the inner formulas `(x_i - abs(mu))^2 or (x_i)^2` by summing them up and then re-apply the rest of the formula with the combined values (avg, n).
Obviously the second one is not just more easy, we can actually avoid re-calculating it from the original values entirely, as we can simply reverse the formula: `sigma = sqrt ((sum_(i=1)^n (x_i)^2) / n - mu^2) -> -> sum_(i=1)^n (x_i)^2 = (sigma^2 + mu^2) * n`
So we can just fill in the numbers and get the original squared-sum back. Handy!
Let me show an example of how to combine two standard deviations, working out from that how to do it with 7 or 28 or whatever number of values you want is quite simple.
You need to have for each day d: `sum_(i=1)^n (x_i)^2 , mu_d, n_d`
- The n is stored in Nr_of_transfers for each record.
- The `mu` is stored in Avg_Quantity for each record.
- And the `sum_(i=1)^n (x_i)^2` we can calculate back from the Std_dev and n and `mu` from the original.

Combined `sum_(i=1)^n (x_i)^2 = sum_(i=1)^n (x_i)^2 + sum_(i=1)^n (x_i)^2` (where the first sum on the right is the first set, and the second sum is the second set). In plain language: calculate the core back and add them together. Not too hard.
In SQL that would look something like:
XsqaresumCombined = SUM((pow(Std_dev,2) + pow(Avg_Quantity,2)) * Nr_of_transfers) (from daily_transfers)

Now we have all pieces to combine it all into a final Std_dev:
Std_dev_Combined = SQRT (
	( SUM((pow(Std_dev,2) + pow(Avg_Quantity,2)) * Nr_of_transfers) 
		/ SUM( Nr_of_transfers ) ) 
	- pow(( SUM(Avg_Quantity * Nr_of_transfers) / SUM(Nr_of_transfers) ), 2)
)  

I admit, at first glance that looks complicated, but if you carefully follow the parts, you can see that the partial sum-of-squares are calculated back from the daily numbers and added together(second line), the total average found by calculating the partial totals back from the daily averages and Nr_of_transfers (inner bit of the fourth line), and then the standard deviation of the whole is calculated by following the normal formula.


Action!


Lets put it in SQL:

DELETE FROM weekly_transfers WHERE Date(Datum_first) BETWEEN date('2021-02-08') and date_add(date('2021-02-08'), interval 6 day);
INSERT INTO weekly_transfers(Datum_first, Account, Total_Quantity, Nr_of_transfers, Avg_Quantity, Std_dev)
  SELECT 
	MIN(t.Datum), 
	Account,
	SUM(t.Total_Quantity), 
	SUM(t.Nr_of_transfers), 
	SUM(Avg_Quantity * Nr_of_transfers) / SUM(Nr_of_transfers), 
	SQRT (
		( SUM((power(Std_dev,2) + pow(Avg_Quantity,2)) * Nr_of_transfers) 
			/ SUM( Nr_of_transfers ) ) 
		- power(( SUM(Avg_Quantity * Nr_of_transfers) / SUM(Nr_of_transfers) ), 2)
	)
  FROM daily_transfers as t 
  WHERE Date(t.Datum) BETWEEN date('2021-02-08') AND date_add(date('2021-02-08'), interval 6 day) 
  GROUP BY Account;

Note: I always prefer date calculations in SQL over programmatically messing around with them. Just one rule to remember: a BETWEEN is inclusive, both on the lower and the upper bound, thus if we want to get 1 week we should count from date+0 to date+6.

This results in:
id (Primary_key)Datum_first (date)Account (varchar)Nr_of_transfers (int)Avg_Quantity (float)Std_dev (float)
12021-02-08111484.74342
22021-02-09222150
With all the correct values, including the Std_dev.
Note: the Datum_first field has a quirk: it is the date of the first recorded transfer of that cluster. If you would rather have the start of the period it is better to put that directly into the query instead of the MIN() value. YMMV.

Recap
Using this method we can combine aggregated data without the need to access the original data.
Obviously this is a lot of hassle for just 5 records as in the example, but this same method also works if there are millions or datapoints per day, and then it is going to save you a huge amount of database-stress.


Ellert, February 2021.
img