Optimizing a SQLite Database

TL:DR; takeaways:

These seem very obvious, but with rushed deadlines, and “it works well enough”, these things are sometimes overlooked.

  • Use the right data type for the job. If the documentation says VARCHAR when the values are always integers, maybe you should use INT.
  • Generate a proper primary key when you can, skip indexes that overlap with the primary key
  • Sort your data before importing (especially if there will not be a primary key to enforce an ordering), it will likely compress better
  • sqlite3_analyzer is a handy tool

The story

I’ve been working on a project that involves taking flat files, importing them into SQLite, and exporting that database for use on low-powered client devices. I’ve had two such projects over the past couple of years, and each one has provided some small challenges in storing data in an optimal way for use on constrained devices.

For the current file, there was an initial pass at importing the data that was was quite naive. It produced large files (large as far as a low-powered devices are concerned), but nobody was complaining, except me (and I am not a user of this particular application). I was not at all satisfied with how big the database was, so I set out to make some incremental improvements.

The file is compressed (ZIP) before sending to the phone. The phone decompresses the file before first use.

  • Raw file size: 791mb
  • Initial SQLite size: 768mb (299mb compressed)

There is a handy tool for discovering where the space is being used in a SQLite database – sqlite3_analyzer (brew install sqlite-analyzer). It showed that one table consumed most of the space:

Table A consumes almost 95% of the database. That looks like the place to spend some time optimizing.

The documentation I had been given indicated that the flat files originated from a SQL database, and included the column types from that database. Most columns were CHAR/VARCHAR. However, after looking through the data, it appeared that the values in these columns were numeric. So, I changed the column definitions to be integers and re-imported the data (sqlite does not let you change column types – your option is to select into a new table with the same columns but different types, drop the original, and rename, or just re-import the data).

  • With integers columns wherever possible: 736mb (298mb compressed)

The uncompressed size shrunk by about 7%, but the compressed size barely changed.

The client developers informed me around this time that they were adding an index on the device that covered some of the varchar columns, and were wondering if I could go ahead and pre-compute the index on my end. I added it to the version of the database with integers:

  • With baked in index: 813mb (324mb compressed)

This added about 7% to the compressed size

After looking at the columns that were being indexed, and looking over the data, I realized that the two indexed columns were part of a unique key that was not documented. By adding two more columns, for a total of four, I could make a composite primary key, and ditch the index. I also could disable the built-in ROWID that SQLite generates. This was the trick I needed:

  • With integers and primary key: 722mb (129mb compressed)

This was huge. The uncompressed file shrank a little bit more, but the compressed file shrank by 57%. I believe the biggest savings is that the data was now ordered. For whatever reason the files I was supplied with did not have any specific ordering for it – rows that should have been grouped together were dispersed throughout the file. Now that there was a primary key, the rows would be ordered by that primary key, and like data would be stored together, allowing for better compressibility.

There is still a lot of room for improvement in this file. One big optimization is that the remaining VARCHAR columns have few distinct values compared to the total number of rows in the table. It would make sense to normalize these values, store them in a lookup table and use an integer foreign key. I believe I could save about 60% of the uncompressed file, and another 50% of the compressed file.

 

Leave a Reply

Your email address will not be published. Required fields are marked *