A couple years ago the company I worked for had had just launched a free consumer-facing application, and had been pushing it hard on social media. They were seeing hundreds to thousands of active users. It was painfully slow, and the culprit appeared to be the database.
One of the main queries on the application was pushing the database server to its limits. Some users were waiting 10+ seconds for results, and the system was only getting slower under the load.
This was a team of very solid developers, but to get the problem fixed fast they called in a few experts from across the company. Two of us came to help with the SQL itself, and others came to assist with the Ruby code. The desire was to have a fix out within a day or two, so a complete re-work was out of the question.
This app was running with NewRelic, so we could see exactly what queries were performing poorly. And the way this application was architected, it had a nearly identical database in the QA environment so we could easily run EXPLAIN PLANs for the problematic queries.
Requirements of the App - Dog Finder
I’m going to change up the scenario to protect the identity of the company and team, but the underlying technical is the same.
Let’s say the application was a tool to find dogs that were available for adoption from area shelters. Among all of the search parameters, one of the options would be the breed, such as a Labrador or Poodle. Dogs can be a mix of breeds, and that mix could be expressed as a many-many relationship if the components of their mix were known. It is also possible for the breed to not be known, and in that case, no breed information is stored in the database.
This database is trimmed down as much as possible, other than having a full set of reference data about various breeds. No information about shelters, dog age, descriptions, etc have been added to keep it simple.
- If the dog has multiple breeds, the search will only consider one of them. Ie: A dog that is a mix of Labrador and Poodle, searching for “Labrador” would return this dog. Searching for specific breed combinations is not a requirement.
- If a dog has no breed, either due to clerical error or it is not known, those dogs should be returned for all users. The front-end will show these separately at the bottom of the search results as “Other dogs that need a home”
- Data is only added to the system on a periodic basis, not in real-time. It is primarily a read-only system.
The Problem
The query was generated by the ORM. Stripping away all other WHERE clauses or complexities, the critical part of looked something like this:
select d.*, b.breed
from dog d
left join dog_breed db on d.id = db.dog_id
left join breed b on db.breed_id = b.id
where b.breed = 'Labrador Retriever'
or b.id is null
group by d.id, b.breed
order by b.breed nulls last,
d.name;
The database was well indexed. There were no missing indexes on the relevant foreign key columns. The data types all looked good, and any low hanging fruit that I typically look for had been taken care of already.
The problem came down to the following statement: OR b.id IS NULL
.
This query is asking the database to select two sets of records - dogs that have a specific breed, or no breed whatsoever. Even if we split this into two queries - one for specific breeds and one for no breeds - the query for no breeds would have similar same poor performance.
The EXPLAIN PLAN
for the query above reveals three full table scans, or SEQ SCAN
- every row of the three tables is being evaluated. One of the tables in the actual application had nearly a million rows in it.
Even though there are indexes, they are being ignored because they do not help the LEFT JOIN ... WHERE b.id IS NULL
combination. We are asking the database to select all entries from the left side of a many-many relationship that has no corresponding entry on the right side. In order to do this, the database has to gather all of the values that do exist and then exclude them.
The Simple Solution
The simple solution we came up with was to find a way to eliminate the LEFT JOIN
. To do this, we needed to make sure every dog
had at least one record in the join table dog_breed
. We added a special Unknown
breed and added a join record for any dog that was missing a dog_breed
. We updated the WHERE
clause to include Unknown
and remove the NULL
check. To make sure that Unknown
showed up after all other results, a sort was added to ensure that Unknown
results were put at the end of the results.
One final change was behind the scenes. The original application had a unique index on the join table, with dog_id
as the first column and breed_id
as the second column. We also added an index that contained only the breed_id. Because the query to find the breeds is very fast - looping over a table with just a few hundred rows, the database server can get the primary keys for the breed table very efficiently. Given those, it needs to traverse the relationship back to the dog table, and it can eliminate a lot more rows from the join table if the key is structured with the known values first.
Having both indexes can be useful. When the dog is known, having the dog_id
first in the index will allow for looking up the breed quickly. But when only the breed is known, having the breed_id
index will speed up finding dogs for that breed.
select d.*, nullif(b.breed,'Unknown')
from dog d
join dog_breed_unknown db on d.id = db.dog_id
join breed b on db.breed_id = b.id
where b.breed in ('Labrador Retriever', 'Unknown')
group by d.id, b.breed
order by breed asc nulls last,
d.name asc;
You can also see at the end of the queries a test for equivalence, which I talked about in a previous post.
Takeaways
I don't have exact numbers anymore, but our solution brought the query cost down from almost 10,000 down to close to 100 - a 99% reduction in query cost. That app had on the order of a million rows. In the trivial example I have set up in Docker with just a few thousand rows the improvement is a little more modest - a cost of 916 down to 260 - still a 71% reduction in cost.
One lesson to learn from all of this is that most tools that use an index of some sort will be much faster when searching for a known value.
Also, if you have a two-column index on a many-many joiner table, consider adding an index on just the second column if you plan on doing queries where the table referenced by the second column contains your known values.
Now again, to be very clear, given more time we could have re-architected the application. The query to find dogs with no known breed could have been split into its own query, which if it relied on a LEFT JOIN
or NOT EXISTS
could still benefit from adding an Unknown
record. Maybe the solution there would just be to have a cached set of dogs to feature, or have some other column or tagging mechanism to flag certain dogs to show at the bottom of every search page.
Can You Make it Better?
I have shared the code for this example on GitHub for several reasons. A few years ago I encountered some developers who were very familiar with MySQL and knew how to bootstrap it, but were fearful of trying out PostgreSQL. Docker makes this super easy. I am also sharing it because maybe someone out there looks at this and knows how to do it even better. I would welcome the feedback!