group by – Mellowhost Blog

Let’s imagine, scenario is like the following:

We have three columns with the following values in a table named ‘sale_line’

----------------------------------------
product_id	qty_done	sale_id
----------------------------------------
1010		5.0		101
1010		1.0		102
1010		2.0		103
1012		1.0		104
1012		4.0		105
1012		4.0		106
1012		9.0		107

We have a product_id column that contains which product we are referring to, a qty_done column, which contains the number of products sold, and sale_id refers to the sale order where the product_id was dispatched. Now the example table contains how many of each product dispatched in different sale orders. We can write a query, to group the product_id and the total number of products sold using aggregation

select product_id, sum(qty_done) as total_sold from sale_line group by product_id;

The producing result would be the following:

---------------------------
product_id	total_sold
---------------------------
1010		8.0		
1012		17.0

We can see, how group by can give you the benefit of aggregation. But have you noticed, we have a column called ‘sale_id’ that doesn’t go with any aggregation function like ‘count’, ‘sum’ or ‘avg’. Can we use them for any purpose?

Sometimes, you may be interested to track down the orders responsible for a set of products using group by when each products will contain the list of orders responsible for causing the total sell. Can we accomplish that in one query?

We actually can. We can concat distinct ids and put them in a resulting array instead of our known aggregation function like ‘count’ or ‘sum’. Postgres provides a array aggregation function called ‘array_agg’, it can be used to produce such result. Have a look at the following query:

select product_id, sum(qty_done) as total_sold, array_agg(sale_id) as sale_ids from sale_line group by product_id;

The producing result would be the following:

-----------------------------------------
product_id	total_sold	sale_ids
-----------------------------------------
1010		8.0		[101, 102, 103]
1012		17.0		[104, 105, 106, 107]

As you can see, we have made postgres return us an array that contains the sale_ids for consisting total_sold even though, sale_id column wasn’t conventionally aggregable or let’s say different than the usual numeric pattern.

This technique can have many use cases. For example if you have a different model/table for managing sale orders, purchase orders, pickings and invoices while, your sale orders contains the price information, and the pickings doesn’t, but picking contains the actual number of products got dispatched, then you may first aggregate the total_sold from the picking table, then produce an average price using the sale_ids you have produced from the picking table using group by and array aggregation, then merge them. This seems quite complicated, I understand, but again, it’s all about putting a break to your reporting time, and manage scalability of your application, putting a constant cost based reporting algorithm using single sql query, even if you process 10 times more orders in future.