A common problem I have worked on in my career has been to troubleshoot and improve slow, complex SQL queries. Usually these are part of codebases with few or no tests. If there are tests, they usually are testing some other part of the code and assume that the query is correct. It is rare to see tests that prove the columns being selected are correct and the rows are in the correct order.
Because of this, modifying an old query to improve its performance usually involves running the old query and comparing the results with a newly refactored version. The naïve way to do this is to just output two sets of rows and do a diff, or worse pasted the results into a spreadsheet and compare the results there.
There is a quick way that I have used in PostgreSQL to see if I have achieved exact results without having to export to an external tool: hash the results of the two queries, and see if the resulting hashes are identical.
PostgreSQL makes it easy to export the results of a query to JSON. This JSON blob can then be hashed. This involves two json functions - first, to_json
to convert each row into a JSON object, then json_agg
to concatenate all of the rows into an array. Finally, the JSON array is converted to a byte array and passed into the sha256
function to produce a hash.
If the result sets are small - on the order of thousands of rows - the hash step could be skipped. I've just done that because sometimes I emit the hash and save it off, and just do a quick visual check of the first few characters. But below, that is not needed.
If the hashes are identical, then it is extremely likely that all of the rows are identical and are in the same order.
Here is what that looks like:
with
experimentQuery as ( select * from ...),
controlQuery as ( select * from ...),
experimentJson as (
select encode(
sha256(json_agg(to_json(q.*))::text::bytea)
, 'hex') as ej from experimentQuery q
),
controlJson as (
select encode(
sha256(json_agg(to_json(q.*))::text::bytea)
, 'hex') as cj from controlQuery q
)
select case when (cj = jej2) then 'equal' else 'different' end as result
from experimentJson cross join controlJson
If using json is too much, then string_agg
can also be used on a single column, such as the id
column. This can be done by replacing json_agg(to_json(q.*))::text::bytea
with string_agg(q.id::text,',')::bytea
with
experimentQuery as ( select * from ...),
controlQuery as ( select * from ...),
experimentStragg as (
select encode(
sha256(string_agg(q.id::text,',')::bytea
, 'hex') as ej from experimentQuery q
),
controlStragg as (
select encode(
sha256(string_agg(q.id::text,',')::bytea
, 'hex') as cj from controlQuery q
)
select case when (cj = ej) then 'equal' else 'different' end as result
from experimentStragg cross join controlStragg
Pingback: Getting a quick hash of a query in PostgreSQL – danwatt.org