photography of three dogs looking up

Optimizing Away a LEFT JOIN in a Pinch

A couple years ago the company I worked for had had just launched a free consumer-facing application, and had been pushing it hard on social media. They were seeing hundreds to thousands of active users. It was painfully slow, and the culprit appeared to be the database.

One of the main queries on the application was pushing the database server to its limits. Some users were waiting 10+ seconds for results, and the system was only getting slower under the load.

This was a team of very solid developers, but to get the problem fixed fast they called in a few experts from across the company. Two of us came to help with the SQL itself, and others came to assist with the Ruby code. The desire was to have a fix out within a day or two, so a complete re-work was out of the question.

This app was running with NewRelic, so we could see exactly what queries were performing poorly. And the way this application was architected, it had a nearly identical database in the QA environment so we could easily run EXPLAIN PLANs for the problematic queries.

Source Code

A Docker enabled version of this post is available on GitHub

Requirements of the App - Dog Finder

I’m going to change up the scenario to protect the identity of the company and team, but the underlying technical is the same.

Let’s say the application was a tool to find dogs that were available for adoption from area shelters. Among all of the search parameters, one of the options would be the breed, such as a Labrador or Poodle. Dogs can be a mix of breeds, and that mix could be expressed as a many-many relationship if the components of their mix were known. It is also possible for the breed to not be known, and in that case, no breed information is stored in the database.

This database is trimmed down as much as possible, other than having a full set of reference data about various breeds. No information about shelters, dog age, descriptions, etc have been added to keep it simple.

  • If the dog has multiple breeds, the search will only consider one of them. Ie: A dog that is a mix of Labrador and Poodle, searching for “Labrador” would return this dog. Searching for specific breed combinations is not a requirement.
  • If a dog has no breed, either due to clerical error or it is not known, those dogs should be returned for all users. The front-end will show these separately at the bottom of the search results as “Other dogs that need a home”
  • Data is only added to the system on a periodic basis, not in real-time. It is primarily a read-only system.

The Problem

The query was generated by the ORM. Stripping away all other WHERE clauses or complexities, the critical part of looked something like this:

select d.*, b.breed
from dog d
  left join dog_breed db on d.id = db.dog_id
  left join breed b on db.breed_id = b.id
where b.breed = 'Labrador Retriever'
  or b.id is null
group by d.id, b.breed
order by b.breed nulls last,
  d.name;
Query Plan

The database was well indexed. There were no missing indexes on the relevant foreign key columns. The data types all looked good, and any low hanging fruit that I typically look for had been taken care of already.

The problem came down to the following statement: OR b.id IS NULL.

This query is asking the database to select two sets of records - dogs that have a specific breed, or no breed whatsoever. Even if we split this into two queries - one for specific breeds and one for no breeds - the query for no breeds would have similar same poor performance.

The EXPLAIN PLAN for the query above reveals three full table scans, or SEQ SCAN - every row of the three tables is being evaluated. One of the tables in the actual application had nearly a million rows in it.

Even though there are indexes, they are being ignored because they do not help the LEFT JOIN ... WHERE b.id IS NULL combination. We are asking the database to select all entries from the left side of a many-many relationship that has no corresponding entry on the right side. In order to do this, the database has to gather all of the values that do exist and then exclude them.

The Simple Solution

The simple solution we came up with was to find a way to eliminate the LEFT JOIN. To do this, we needed to make sure every dog had at least one record in the join table dog_breed. We added a special Unknown breed and added a join record for any dog that was missing a dog_breed. We updated the WHERE clause to include Unknown and remove the NULL check. To make sure that Unknown showed up after all other results, a sort was added to ensure that Unknown results were put at the end of the results.

One final change was behind the scenes. The original application had a unique index on the join table, with dog_id as the first column and breed_id as the second column. We also added an index that contained only the breed_id. Because the query to find the breeds is very fast - looping over a table with just a few hundred rows, the database server can get the primary keys for the breed table very efficiently. Given those, it needs to traverse the relationship back to the dog table, and it can eliminate a lot more rows from the join table if the key is structured with the known values first.

Having both indexes can be useful. When the dog is known, having the dog_id first in the index will allow for looking up the breed quickly. But when only the breed is known, having the breed_id index will speed up finding dogs for that breed.

select d.*, nullif(b.breed,'Unknown') 
from dog d
  join dog_breed_unknown db on d.id = db.dog_id
  join breed b on db.breed_id = b.id
where b.breed in ('Labrador Retriever', 'Unknown')
group by d.id, b.breed
order by breed asc nulls last,
  d.name asc;
Improved Query Plan

You can also see at the end of the queries a test for equivalence, which I talked about in a previous post.

Takeaways

I don't have exact numbers anymore, but our solution brought the query cost down from almost 10,000 down to close to 100 - a 99% reduction in query cost. That app had on the order of a million rows. In the trivial example I have set up in Docker with just a few thousand rows the improvement is a little more modest - a cost of 916 down to 260 - still a 71% reduction in cost.

One lesson to learn from all of this is that most tools that use an index of some sort will be much faster when searching for a known value.

Also, if you have a two-column index on a many-many joiner table, consider adding an index on just the second column if you plan on doing queries where the table referenced by the second column contains your known values.

Now again, to be very clear, given more time we could have re-architected the application. The query to find dogs with no known breed could have been split into its own query, which if it relied on a LEFT JOIN or NOT EXISTS could still benefit from adding an Unknown record. Maybe the solution there would just be to have a cached set of dogs to feature, or have some other column or tagging mechanism to flag certain dogs to show at the bottom of every search page.

Can You Make it Better?

I have shared the code for this example on GitHub for several reasons. A few years ago I encountered some developers who were very familiar with MySQL and knew how to bootstrap it, but were fearful of trying out PostgreSQL. Docker makes this super easy. I am also sharing it because maybe someone out there looks at this and knows how to do it even better. I would welcome the feedback!

Source Code

A Docker enabled version of this post is available on GitHub

Leave a Reply