TL:DR; takeaways:
These seem very obvious, but with rushed deadlines, and "it works well enough", these things are sometimes overlooked.
- Use the right data type for the job. If the documentation says VARCHAR when the values are always integers, maybe you should use INT.
- Generate a proper primary key when you can, skip indexes that overlap with the primary key
- Sort your data before importing (especially if there will not be a primary key to enforce an ordering), it will likely compress better
- sqlite3_analyzer is a handy tool
The story
I've been working on a project that involves taking flat files, importing them into SQLite, and exporting that database for use on low-powered client devices. I've had two such projects over the past couple of years, and each one has provided some small challenges in storing data in an optimal way for use on devices with limited storage and network bandwidth.
For the current file, there was an initial pass at importing the data that was was quite naive. It produced large files (large as far as a low-powered devices are concerned), but nobody was complaining, except me (and I am not a user of this particular application). I was not at all satisfied with how big the database was, so I set out to make some incremental improvements.
The file is compressed (ZIP) before sending to the phone. The phone decompresses the file before first use.
- Raw file size: 791mb
- Initial SQLite size: 768mb (299mb compressed)
There is a handy tool for discovering where the space is being used in a SQLite database - sqlite3_analyzer (brew install sqlite-analyzer). It showed that one table consumed most of the space:
*** Page counts for all tables with their indices *****************************
A............................................ 177747 94.8%
B............................................ 5293 2.8%
C............................................ 3755 2.0%
D............................................ 375 0.20%
E............................................ 278 0.15%
F............................................ 50 0.027%
G............................................ 28 0.015%
H............................................ 13 0.007%
I............................................ 4 0.002%
J............................................ 1 0.0%
*** Table A **************************************************************
Percentage of total database...................... 94.8%
Number of entries................................. 4598639
Bytes of storage consumed......................... 728051712
Bytes of payload.................................. 677052035 93.0%
B-tree depth...................................... 4
Average payload per entry......................... 147.23
Average unused bytes per entry.................... 3.13
Average fanout.................................... 374.00
Non-sequential pages.............................. 1 0.0%
Maximum payload per entry......................... 380
Entries that use overflow......................... 0 0.0%
Index pages used.................................. 475
Primary pages used................................ 177272
Overflow pages used............................... 0
Total pages used.................................. 177747
Unused bytes on index pages....................... 248368 12.8%
Unused bytes on primary pages..................... 14124797 1.9%
Unused bytes on overflow pages.................... 0
Unused bytes on all pages......................... 14373165 2.0%
Table A consumes almost 95% of the database. That looks like the place to spend some time optimizing.
The documentation I had been given indicated that the flat files originated from a SQL database, and included the column types from that database. Most columns were CHAR/VARCHAR. However, after looking through the data, it appeared that the values in these columns were numeric. So, I changed the column definitions to be integers and re-imported the data (sqlite does not let you change column types - your option is to select into a new table with the same columns but different types, drop the original, and rename, or just re-import the data).
- With integers columns wherever possible: 736mb (298mb compressed)
The uncompressed size shrunk by about 7%, but the compressed size barely changed.
The client developers informed me around this time that they were adding an index on the device that covered some of the varchar columns, and were wondering if I could go ahead and pre-compute the index on my end. I added it to the version of the database with integers:
- With baked in index: 813mb (324mb compressed)
This added about 7% to the compressed size
After looking at the columns that were being indexed, and looking over the data, I realized that the two indexed columns were part of a unique key that was not documented. By adding two more columns, for a total of four, I could make a composite primary key, and ditch the index. I also could disable the built-in ROWID that SQLite generates. This was the trick I needed:
- With integers and primary key: 722mb (129mb compressed)
This was huge. The uncompressed file shrank a little bit more, but the compressed file shrank by 57%. I believe the biggest savings is that the data was now ordered. For whatever reason the files I was supplied with did not have any specific ordering for it - rows that should have been grouped together were dispersed throughout the file. Now that there was a primary key, the rows would be ordered by that primary key, and like data would be stored together, allowing for better compressibility.
There is still a lot of room for improvement in this file. One big optimization is that the remaining VARCHAR columns have few distinct values compared to the total number of rows in the table. It would make sense to normalize these values, store them in a lookup table and use an integer foreign key. I believe I could save about 60% of the uncompressed file, and another 50% of the compressed file.
*** Table A **************************************************************
Percentage of total database...................... 93.8%
Number of entries................................. 4598451
Bytes of storage consumed......................... 678645760
Bytes of payload.................................. 647384697 95.4%
B-tree depth...................................... 5
Average payload per entry......................... 140.78
Average unused bytes per entry.................... 2.70
Average fanout.................................... 26.00
Non-sequential pages.............................. 1 0.0%
Maximum payload per entry......................... 374
Entries that use overflow......................... 0 0.0%
Index pages used.................................. 6172
Primary pages used................................ 159513
Overflow pages used............................... 0
Total pages used.................................. 165685
Unused bytes on index pages....................... 500459 2.0%
Unused bytes on primary pages..................... 11914732 1.8%
Unused bytes on overflow pages.................... 0
Unused bytes on all pages......................... 12415191 1.8%