Troubleshooting a SQL Server Implicit Conversion Issue

Over the last couple of weeks, I have been spending some time troubleshooting performance issues in some of our Java-based RESTful services. This week I came across one that required some different steps to troubleshoot. First, the tl;dr - our JDBC driver (jTDS) converts CHAR to NCHAR, which causes an index SCAN on a CHAR column instead of an index SEEK.

The Architecture

Here is a simplified version of the architecture of this specific service:

Java 6, Spring 4, Hibernate 4, Tomcat
An endpoint in this service GET /myservice/{NaturalKey}
SQL Server 2014 Enterprise
jTDS Database Driver
A table:
- id : integer primary key
- NaturalKey: CHAR(16), randomly generated exposed over the RESTful service.
- There are a few other columns, but these are the important ones.
A UNIQUE INDEX on the CHAR(16) NaturalKey column.
~2 million rows

The Problem

Every few days I log into NewRelic to check on some of the services that I maintain. NewRelic gives insight into which web transactions are slow, which database calls are slow, and even captures some profiling information from unusually slow transactions.

One happened to catch my my: GET /myservice/RANDOMSTRING. This request is called thousands of times per day, and takes, on average, 500ms to respond. One of the apps that calls it passes in multiple keys, to a slightly different endpoint GET /myservice/keys=RS1,RS2,RS3, etc - those calls average 5 SECONDS. Now, I know that this service has a very simple design - there is one main table with a few columns on it, and the RANDOMSTRING references an indexed field.

When NewRelic captures a slow transaction, it also grabs all of the SQL statements that were executed, as well as a profile of where in the Java code the most CPU time was spent. 95% of the time was spent waiting on the database. Hibernate does a pretty good job of making ugly SQL, but after applying a little formatting to the offending query, it is doing what I expect it to do:

SELECT Id, NaturalKey, OtherField
FROM MyTable
WHERE NaturalKey = ?

Alright, that seems very straight forward. There is an index on NaturalKey, so it should do a seek in that index to find the primary key, then do a clustered index seek to get the actual row from the table.

I've been using the preview release of 0xDBE by JetBrains lately, which does not have the visual execution plan that SQL Management Studio has. I turn on the profiler, which after running a query in 0xDBE, will produce two tables of output - one for the results of the query, one for the execution plan:

SET STATISTICS PROFILE ON;

After replacing the one parameter with a known NaturalKey value and executing the query, the result comes back instantly - far faster than the 500ms that the service is taking. The execution plan is also what I expect it to be:

index seek on idx_naturalkey
clustered index seek on pk_id
key lookup

At this point one of our database administrators got involved. I do have the ability to view execution plans on our servers, but I do not have access to the profiling tools. He fired up the profiler to watch this application, and caught one very important hint as to what the problem was: The actual query that was being executed was different than the one Hibernate was issuing. See if you can spot the difference:

SELECT Id, NaturalKey, OtherField
FROM MyTable
WHERE NaturalKey = N?

The little N before the parameter indicates that the parameter is being passed as Unicode (N for Natural language character set).

Looking at the execution plan for this query is very different:

index scan on idx_naturalkey
clustered index seek on pk_id
key lookup

Now, I am not a database expert by any stretch of the imagination. I know enough to do quite a bit of damage, but until recently, I didn't really think much about the differences of a scan versus a seek. Use the Index, Luke, has a great breakdown of the differences, which I will attempt to summarize:

seek - the index is traversed as a tree, quickly going through the leaf nodes to find matching values. This is more or less a binary search. In Big-O, this is O(log(n)).
scan - every row of the index is scanned, a linear search. O(n).

We are dealing with indexes - they are structures designed for quick traversal, even with a scan. A full table scan, where every row of the table is read, is far slower than an index scan, but an index scan is far slower than an index seek.

Our problem was that in order to find the entry in the index for RANDOMSTRING, SQL Server had to read each row of the index, convert the value from a CHAR(16) to an NCHAR(16), and then perform a string comparison.

I did say that it was our database driver that was doing it. We use jTDS as our driver, and it turns out it has an option that affects this behavior, sendStringParametersAsUnicode, which defaults to true. It turns out that if this is set to false, the performance of this service does indeed speed up dramatically.

The Solution(s)

We explored several different options on how to fix this, each with different tradeoffs:

Option 1: Change the driver configuration

Since the database driver is what is causing this, it makes sense that we should change it there. We have profiled setting sendStringParametersAsUnicode to false, and it gives us the performance that we want. Unfortunately, this service does store other data than this one key (which I have excluded for the sake of simplicity), and some of those columns do in fact store unicode values. We need more time to evaluate the greater implications of changing this setting.

Option 2: Change the column type

A simple ALTER TABLE command could be used to change the column in question from a CHAR(16) to an NCHAR(16). This column would only ever store ASCII alpha-numeric values, so there would be some wasted space here. This change is only being done to satisfy a requirement of the database driver, not usually something you do that affects long-term storage if there are other options. One other downside is that it will incur a brief amount of downtime - we have to drop the associated index, change the column definition, and add the index back.

On the plus side, there are several other NCHAR / NVARCHAR columns in this table, so this column will match those now.

Option 3: Add a computed index

A no-downtime alternative option would be to add a computed column, with an associated index. The SQL would look something like this:

ALTER TABLE myservice ADD [NaturalKey_nchar] AS (CAST(NaturalKey AS nchar(16))); 
create index ix_NaturalKey_nchar on myservice(NaturalKey_nchar);

It turns out that the query optimizer is smart enough to use this index automatically (I am not sure if this is a feature new to 2014, or if it is specific to Enterprise edition). If I type the normal query in, it will use the original index; if the parameter is prefixed with "N", it will use this alternate index.

The downside of this approach is that there are two indexes, and every insert to this table will result in an additional write. Fortunately this table sees a lot more reads that writes, but it does need to be taken into consideration.

For now, we are pursuing option 2 - changing the column definition. Longer term, I am going to spend some more time looking into the implications of changing the jTDS option.

Takeaways

For a database non-expert, there were some very useful tools that came in handy in troubleshooting this issue:

NewRelic - to see at a high level where the problem was, as well as capture the offending SQL traces
SQL Server Profiler - if you have access to them, they can see the actual queries that are being issued
STATISTICS PROFILE ON, or Management Studio's visual execution plan - to see what indexes are being used, and how they are used