PostgreSQL has two functions that are not standard SQL, but are very helpful in certain situations : GREATEST and LEAST. In SQL MIN
and MAX
are aggregate functions that operate across all rows. But there are times where you want the smallest or largest value in a finite number of columns, a scalar min/max. That is where GREATEST
and LEAST
come in. Both allow for more than two values to be passed in, and also handle NULLs. From the docs:
The
PostgreSQL 11 DocumentationGREATEST
andLEAST
functions select the largest or smallest value from a list of any number of expressions. The expressions must all be convertible to a common data type, which will be the type of the result (see Section 10.5 for details). NULL values in the list are ignored. The result will be NULL only if all the expressions evaluate to NULL.
What about Other Database Vendors?
MySQL also has GREATEST and LEAST functions. Like Postgres, they accepts any number of arguments. However, the NULL semantics are different:
GREATEST()
returnsNULL
if any argument isNULL
.
SQLServer has the same functions, GREATEST and LEAST, and have similar NULL semantics as Postgres.
SQLite has scalar MAX and MIN functions, with MySQL-like NULL semantics.
Real World Example
I worked on an order purchase system that had a VIEW that joined together multiple tables. This view had an UPDATED_AT column which represents the most recent UPDATED_AT value from all of the tables that backed the view. This was necessary because the view was being used to snapshot data into another system, and would periodically be checked by comparing the UPDATED_AT
field (ie: WHERE vw.UPDATED_AT > ?
).
Here is a simplified version of that view:
CREATE VIEW vw_customer_order_line_items (customer_id, order_id, amount, sku,
item_price, item_quantity, most_recent_update)
AS
SELECT c.id,
o.id,
o.amount,
li.sku,
li.price,
li.quantity,
(SELECT MAX(updatecheck.updated_at)
FROM ( VALUES (c.updated_at),
(o.updated_at),
(li.updated_at) ) updatecheck(updated_at))
FROM customers c
JOIN orders o
ON o.customer_id = c.id
JOIN line_items li
ON li.orders_id = o.id;
SELECT *
FROM vw_customer_order_line_items
WHERE updated_at > '2019-12-01';
The nested SELECT statement creates an aliased nested query, updatecheck, with a single column - updated_at. Let's look at the query plan to see what is going on:
- SELECT
- Merge Join : 119200
- Index Scan : orders_pk : 18047
- Temporary (Materialize) : 111767
- Sort: 111413
- Hash Join : 86875
- Full Scan : customers : 14310
- Transformation Hash : 24128
- Full scan : line_items : 24128
- Aggregate : 0.4
- Value : 0.3
- Hash Join : 86875
- Sort: 111413
- Aggregate : 0.4
- Value: 0.3
- Merge Join : 119200
A query plan cost of 119,200? That is pretty high, but, not surprising given the lack of other WHERE clauses. With different WHERE clauses added, like on a customer_id, that cost would come down to 25.
I was working on trying to optimize this query, and one of the portions that stood out was the nested table for computing the maximum updated_at
value. I knew that GREATEST
was available, so I refactored the query to use it and re-ran the plan:
CREATE VIEW vw_customer_order_line_items (customer_id, order_id, amount, sku,
item_price, item_quantity, most_recent_update)
AS
SELECT c.id,
o.id,
o.amount,
li.sku,
li.price,
li.quantity,
GREATEST(c.updated_at, o.updated_at, li.updated_at)
FROM customers c
JOIN orders o
ON o.customer_id = c.id
JOIN line_items li
ON li.orders_id = o.id;
SELECT *
FROM vw_customer_order_line_items
WHERE updated_at > '2019-12-01';
After optimization:
- SELECT
- Gather : 81701
- Hash Join : 66558
- Hash Join : 50999
- Full Scan : line_items : 19226
- Transformation Hash : 11718
- Full Scan : orders : 11718
- Hash Join : 50999
- Transformation Hash : 8252
- Full Scan : customers : 8252
- Hash Join : 66558
- Gather : 81701
The cost went from 119,200 down to 81,701, a roughly 32% reduction. What is interesting is that the query optimizer chose to do three full scans this time, where as previously it only needed to do two plus an index scan (further optimizations took care of this). But, for a quick one-line change to the query, it still was a good improvement in query execution time, and made the SQL much more readable.
Next time you need to find the largest (or smallest) value in a list of scalar values, give GREATEST
and LEAST
a try.
Pingback: Self Join a Table to Get a Pair of Value But Could Not Receive the Expected Results