Compressing a Word List on 1980’s Hardware

This is a long overdue continuation of a series I stared a couple years ago. Last time I wrote about huffman codes, and eventually was going to write about prefix trees.

Does this research really provide anything useful for the world? Maybe not. It is more of a curiosity. As a computer scientist, I have long found it fascinating how my predecessors of one or two generations before me were able to pack so much functionality into such constrained systems.

I had been focusing on a proprietary, patented device that has about 1MB of memory that stores about 5MB of text. It turns out though that there was a similarly constrained device, a Gameboy cartridge with similar technical limitations.

The cartridge appears to be an unlicensed title from a company that specifically was selling into the Christian market because they were working around licensing issues from Nintendo (Wikipedia has the history). This company mainly dealt with NES titles, but released a few for the GameBoy. From what I was able to find about it online, their GB cartridges used a custom/non-standard memory controller. The Gameboy has several different memory controllers that can be used on cartridges, offering tradeoffs of available RAM and ROM. This one appears to use one that is not normally used by licensed titles. Almost every emulator will crash once the text is accessed, but BGB and Gambatte appear to support this memory controller.

Using a tool that can visualize a binary using color ranges, it is clear that there are at least 3 major sections of data. The first one, 45kb, consists of a lot of bytes of ASCII data. Then there is a section that looks like it has gradients of descending numbers in 4 byte blocks, and is around 120 kb - "block b". Then there is a final section that has very high entropy, maybe some form of compression, around 810 kb in length.

A sample of "block b", visualized using bionicle

The section with a lot of ASCII looked like the lexicon. There were a lot of words, or parts of words, but most words were missing their final letter. That letter was replaced with byte with a high bit set. But, it was clear that about 45kB of memory was devoted to a list of words, with some special flags or bits set here and here as a form of rudimentary compression.

I set this word list aside for a long time, but recently went back to decipher what was going on and figured most of it out. This is a simplified version of the rules employed by the decoder - there are some edge cases I won't be covering here, and some I have not figured out yet without digging deep into the disassembly.

Visualization of the Lexicon section, highlighted by byte class (blue=ASCII)

Rule 1: The High Bit Final Letter

The very first word in the word list is the capital letter A. In ASCII, this is represented as one byte, 0x41 (0b01000001). But, the word list starts with 0xC1. This happens to be 0x41 with the high bit set - 0b11000001. Looking through the word list in binary, there were lots of instances where there was a ASCII letter that had a high bit set.

This rule alone eliminates a delimiter after each word, and is responsible for over 10% of the savings in this algorithm.

Rule 1: For decoding, if a high bit is encountered, consider it the end of the word. Take the current byte, masking out the high bit, and use it as the last letter of the current word.

ByteMeaning
0x42First letter - B
0x6FSecond letter - o
0xE2Final letter - b with high-bit set - Bob
Example of Rule 1

Rule 2: Upper and Lower Case

When trying to figure out how the KJ-21 worked, it seemed like upper and lower-case versions of words likely had separate codes. This could be observed when doing a search like test? - the KJ-21 slowly prints each word it encounters that matches the pattern, and for some words would print test, Test and TEST. The Gameboy cartridge appears to follow a similar pattern.

Rule 2: When a lower-case and upper-case version of a word both exist in the text, a 0x80 byte will appear after the last letter of the word. For example, if Bob was in the dictionary but bob was as well, it would be represented as follows:

ByteMeaning
0x42First letter - B
0x6FSecond letter - o
0xE2Final letter - b with high-bit set - Bob
0x80Add the previous word, with case flipped - bob. The word list is Bob,bob
Example of Rule 2

This allows storing two words - 6 letters plus two additional terminators (ie: newline) in 4 bytes instead of 8 bytes.

There is no rule for words that should be in upper case. Those are typically stored separately.

Rule 3: Suffixes and Plurals

The first two rules decoded much of the file, but I kept encountering a lot of missing words, and they all appeared to be versions of words where the root word was successfully decoded, but a version with a suffix was missing. Many of these words were followed by bytes outside of the ASCII range. After looking for patterns, all of these appeared to be in the range of 0x82 and 0x9E, with some values missing. It appeared that many of these were identical, but after iterating through them, the pattern was if the lowest bit was set, ie: it was odd, it meant to add a version with the suffix, with the word capitalized or lower-cased, depending on the state of the "current word". 0x82 would add the current word in whatever case it was, 0x83 would add the same suffix but flip the capitalization.

Rule 3: If a byte of 0x82 through 0x9E is encountered, append a suffix to the most recent decompressed word, capitalizing or lower casing depending on the lowest bit.

ByteMeaning
0x74First letter - t
0x65Second letter - e
0x73Third letter - s
0xF4End of word - t - test
0x82suffix s - tests
0x83suffix s + flip case : Tests. The word list is "test, tests, Tests"
Example of Rule 3

The following suffixes were found: s, 's, s', ed, red, ing, est, eth, ite, ites, ion, ions, ation, ations, able

Additionally, the code appears to handle English rules regarding suffixes and vowels, at least for some of the suffixes. If the last letter of the "current" word ends in a vowel, and the suffix starts with a vowel, then the last letter of the "current" word is removed before concatenating the suffix. For example, if the current word is code and the suffix is ed (represented by 0x88), the resulting word would be coded.

Rule 4: Hyphens

There are a lot of words that appear in the word list that have hyphens in them. This appears to be a design choice by the makers of this software, as many other versions of the KJV omit the hyphens. For example, the name Abi-albon, which appears in 2 Sam. 23:31, in many places (including the KJ-21) omit the hyphen. For the KJ-21 this may have been a choice to keep the built-in keyboard simple.

Rule 4: - if an end of word has been reached and is followed by a hyphen, backspace until encountering the first hyphen, and then continue building a new word. For example, if we wanted Test-ing and Test-ed, and discarded the suffix rule #3:

ByteMeaning
0x54T
0x65e
0x73s
0x74t
0x2Dhyphen -
0x69i
0x6En
0xE7End of word, g: Test-ing
0x2Dhyphen - - backspace until encountering the first hyphen, current word Test-
0x65e
0xE4End of word, d: Test-ed
Example of Rule 4

Rule 5: Trimming / Preserving

When a word is terminated using rule 1, and it is not followed by a suffix or any other byte that has a special rule that we have encountered so far, the rule is to start a brand new word. The final rule will re-use some characters from the previously decoded word.

Rule 5: For bytes 0x01 through 0x1C (1 - 28 in decimal), preserve that many characters of the previously decompressed word, and then start a new word using those characters.

Using the example of Test and Team:

ByteMeaning
0x54T
0x65e
0x73s
0xF4End of word, t : Test
0x02Preserve first 2 characters, Te
0x61a
0xEDEnd of word, m : Team. Word list is Test,Team
Example of Rule 5

A Few Technical Observations From Disassembly

I do not consider myself even an amateur when it comes to disassembly or reading assembly code. I took one course in college 20 years ago that dealt with assembly - mostly x86 with a tiny bit of 6502 thrown in there. But, I do think I know enough to parse through execution traces produced by BizHawk / Gambatte to see a little bit of what is going on. I don't yet know how the sequence of tokens (integers representing words) is produced from verse data, but I do understand at least how the words are decoded.

Visualization of the Skip List(s) and surrounding blocks
  • The ROM stores a "skip list", starting very close to the beginning of the cartridge at 0x0C2A. The word number is shifted right by 4 bits, and that remaining 12 bit number (0-4095) is used to look up a value in this skip list. The skip list contains three-byte entries. The first byte indicates which bank of memory to use (the dictionary spans 4), and then two bytes are an offset within that bank to a word that is close to the one we want (the word number with the lowest 4 bits zeroed out).
  • In the visualization above, I have filtered the bytes from 0x00-0x03, with anything higher showing up as black. You can see the first byte of each entry in the skip list as a purple (indicating the first memory block for the lexicon), followed by words for the second block (orange). ~14,000 words (0x36B0), shifted right by 4 (0x36B = 875), times 3 bytes = 2,625 bytes (give or take) for the skip list.
  • The code then advances from that point to the end of the target word, looks for suffixes, and then starts working backward to unwind all of the compression rules.
  • Occasionally Rule 5 is ignored and a full word will appear without any compression. It took me a while to figure out why. Using the word beginning (3rd word in Gen 1:1), it uses word number 1309. My decoder says 1299, so I have a bug or unaccounted for edge case somewhere. Using the skip list, execution moves to the start of the word beg (word 1286 in my counting). But, because of all of the other rules, the pointer is really to the letter 'g' in beg. So what the code appears to do is every so often it will skip rule 5, and write a full word out. In the case of decoding "beginning", and before the word beg, this full word is bedchamber.
  • Putting all of that together: as the code is iterating through (I assume) an array of integers that indicates words, the integer is shifted to derive a skip list lookup value. The skip list gets us close to the word we want in the dictionary, beg in this example. The code finds the end of the word we care about, then works backwards through memory to find the first uncompressed word, bedchamber. It then works forward to construct the actual word. For the word beginning, this takes around 6,000 instructions.

Final Results

The resulting word list on this device is about 13,858 words. I say about - this differs from other lists on the KJV and there are a couple edge cases I haven't deciphered. Expanded out, this would be 111,717 bytes with a single-byte separator. Using just 740 bytes of code (around address 0xC900), the developers were able to store a 111kb word list in about 45kb. Including the ~2.6kb skip list, this is around 2:1 compression.

Leave a Reply