This is a long overdue continuation of a series I stared a couple years ago. Last time I wrote about huffman codes, and eventually was going to write about prefix trees.
Does this research really provide anything useful for the world? Maybe not. It is more of a curiosity. As a computer scientist, I have long found it fascinating how my predecessors of one or two generations before me were able to pack so much functionality into such constrained systems.
I had been focusing on a proprietary, patented device that has about 1MB of memory that stores about 5MB of text. It turns out though that there was a similarly constrained device, a Gameboy cartridge with similar technical limitations.
The cartridge appears to be an unlicensed title from a company that specifically was selling into the Christian market because they were working around licensing issues from Nintendo (Wikipedia has the history). This company mainly dealt with NES titles, but released a few for the GameBoy. From what I was able to find about it online, their GB cartridges used a custom/non-standard memory controller. The Gameboy has several different memory controllers that can be used on cartridges, offering tradeoffs of available RAM and ROM. This one appears to use one that is not normally used by licensed titles. Almost every emulator will crash once the text is accessed, but BGB and Gambatte appear to support this memory controller.
Using a tool that can visualize a binary using color ranges, it is clear that there are at least 3 major sections of data. The first one, 45kb, consists of a lot of bytes of ASCII data. Then there is a section that looks like it has gradients of descending numbers in 4 byte blocks, and is around 120 kb - "block b". Then there is a final section that has very high entropy, maybe some form of compression, around 810 kb in length.
The section with a lot of ASCII looked like the lexicon. There were a lot of words, or parts of words, but most words were missing their final letter. That letter was replaced with byte with a high bit set. But, it was clear that about 45kB of memory was devoted to a list of words, with some special flags or bits set here and here as a form of rudimentary compression.
I set this word list aside for a long time, but recently went back to decipher what was going on and figured most of it out. This is a simplified version of the rules employed by the decoder - there are some edge cases I won't be covering here, and some I have not figured out yet without digging deep into the disassembly.
Rule 1: The High Bit Final Letter
The very first word in the word list is the capital letter A
. In ASCII, this is represented as one byte, 0x41 (0b01000001)
. But, the word list starts with 0xC1
. This happens to be 0x41
with the high bit set - 0b11000001
. Looking through the word list in binary, there were lots of instances where there was a ASCII letter that had a high bit set.
This rule alone eliminates a delimiter after each word, and is responsible for over 10% of the savings in this algorithm.
Rule 1: For decoding, if a high bit is encountered, consider it the end of the word. Take the current byte, masking out the high bit, and use it as the last letter of the current word.
Byte | Meaning |
---|---|
0x42 | First letter - B |
0x6F | Second letter - o |
0xE2 | Final letter - b with high-bit set - Bob |
Rule 2: Upper and Lower Case
When trying to figure out how the KJ-21 worked, it seemed like upper and lower-case versions of words likely had separate codes. This could be observed when doing a search like test?
- the KJ-21 slowly prints each word it encounters that matches the pattern, and for some words would print test
, Test
and TEST
. The Gameboy cartridge appears to follow a similar pattern.
Rule 2: When a lower-case and upper-case version of a word both exist in the text, a 0x80 byte will appear after the last letter of the word. For example, if Bob
was in the dictionary but bob
was as well, it would be represented as follows:
Byte | Meaning |
---|---|
0x42 | First letter - B |
0x6F | Second letter - o |
0xE2 | Final letter - b with high-bit set - Bob |
0x80 | Add the previous word, with case flipped - bob . The word list is Bob,bob |
This allows storing two words - 6 letters plus two additional terminators (ie: newline) in 4 bytes instead of 8 bytes.
There is no rule for words that should be in upper case. Those are typically stored separately.
Rule 3: Suffixes and Plurals
The first two rules decoded much of the file, but I kept encountering a lot of missing words, and they all appeared to be versions of words where the root word was successfully decoded, but a version with a suffix was missing. Many of these words were followed by bytes outside of the ASCII range. After looking for patterns, all of these appeared to be in the range of 0x82
and 0x9E
, with some values missing. It appeared that many of these were identical, but after iterating through them, the pattern was if the lowest bit was set, ie: it was odd, it meant to add a version with the suffix, with the word capitalized or lower-cased, depending on the state of the "current word". 0x82
would add the current word in whatever case it was, 0x83
would add the same suffix but flip the capitalization.
Rule 3: If a byte of 0x82
through 0x9E
is encountered, append a suffix to the most recent decompressed word, capitalizing or lower casing depending on the lowest bit.
Byte | Meaning |
---|---|
0x74 | First letter - t |
0x65 | Second letter - e |
0x73 | Third letter - s |
0xF4 | End of word - t - test |
0x82 | suffix s - tests |
0x83 | suffix s + flip case : Tests . The word list is "test, tests, Tests" |
The following suffixes were found: s, 's, s', ed, red, ing, est, eth, ite, ites, ion, ions, ation, ations, able
Additionally, the code appears to handle English rules regarding suffixes and vowels, at least for some of the suffixes. If the last letter of the "current" word ends in a vowel, and the suffix starts with a vowel, then the last letter of the "current" word is removed before concatenating the suffix. For example, if the current word is code
and the suffix is ed
(represented by 0x88), the resulting word would be coded
.
Rule 4: Hyphens
There are a lot of words that appear in the word list that have hyphens in them. This appears to be a design choice by the makers of this software, as many other versions of the KJV omit the hyphens. For example, the name Abi-albon
, which appears in 2 Sam. 23:31, in many places (including the KJ-21) omit the hyphen. For the KJ-21 this may have been a choice to keep the built-in keyboard simple.
Rule 4: - if an end of word has been reached and is followed by a hyphen, backspace until encountering the first hyphen, and then continue building a new word. For example, if we wanted Test-ing
and Test-ed
, and discarded the suffix rule #3:
Byte | Meaning |
---|---|
0x54 | T |
0x65 | e |
0x73 | s |
0x74 | t |
0x2D | hyphen - |
0x69 | i |
0x6E | n |
0xE7 | End of word, g : Test-ing |
0x2D | hyphen - - backspace until encountering the first hyphen, current word Test- |
0x65 | e |
0xE4 | End of word, d : Test-ed |
Rule 5: Trimming / Preserving
When a word is terminated using rule 1, and it is not followed by a suffix or any other byte that has a special rule that we have encountered so far, the rule is to start a brand new word. The final rule will re-use some characters from the previously decoded word.
Rule 5: For bytes 0x01
through 0x1C
(1 - 28 in decimal), preserve that many characters of the previously decompressed word, and then start a new word using those characters.
Using the example of Test
and Team
:
Byte | Meaning |
---|---|
0x54 | T |
0x65 | e |
0x73 | s |
0xF4 | End of word, t : Test |
0x02 | Preserve first 2 characters, Te |
0x61 | a |
0xED | End of word, m : Team . Word list is Test,Team |
A Few Technical Observations From Disassembly
I do not consider myself even an amateur when it comes to disassembly or reading assembly code. I took one course in college 20 years ago that dealt with assembly - mostly x86 with a tiny bit of 6502 thrown in there. But, I do think I know enough to parse through execution traces produced by BizHawk / Gambatte to see a little bit of what is going on. I don't yet know how the sequence of tokens (integers representing words) is produced from verse data, but I do understand at least how the words are decoded.
- The ROM stores a "skip list", starting very close to the beginning of the cartridge at
0x0C2A
. The word number is shifted right by 4 bits, and that remaining 12 bit number (0-4095) is used to look up a value in this skip list. The skip list contains three-byte entries. The first byte indicates which bank of memory to use (the dictionary spans 4), and then two bytes are an offset within that bank to a word that is close to the one we want (the word number with the lowest 4 bits zeroed out). - In the visualization above, I have filtered the bytes from 0x00-0x03, with anything higher showing up as black. You can see the first byte of each entry in the skip list as a purple (indicating the first memory block for the lexicon), followed by words for the second block (orange). ~14,000 words (0x36B0), shifted right by 4 (0x36B = 875), times 3 bytes = 2,625 bytes (give or take) for the skip list.
- The code then advances from that point to the end of the target word, looks for suffixes, and then starts working backward to unwind all of the compression rules.
- Occasionally Rule 5 is ignored and a full word will appear without any compression. It took me a while to figure out why. Using the word
beginning
(3rd word in Gen 1:1), it uses word number 1309. My decoder says 1299, so I have a bug or unaccounted for edge case somewhere. Using the skip list, execution moves to the start of the wordbeg
(word 1286 in my counting). But, because of all of the other rules, the pointer is really to the letter 'g' inbeg
. So what the code appears to do is every so often it will skip rule 5, and write a full word out. In the case of decoding "beginning", and before the wordbeg
, this full word isbedchamber
. - Putting all of that together: as the code is iterating through (I assume) an array of integers that indicates words, the integer is shifted to derive a skip list lookup value. The skip list gets us close to the word we want in the dictionary,
beg
in this example. The code finds the end of the word we care about, then works backwards through memory to find the first uncompressed word,bedchamber
. It then works forward to construct the actual word. For the wordbeginning
, this takes around 6,000 instructions.
Final Results
The resulting word list on this device is about 13,858 words. I say about - this differs from other lists on the KJV and there are a couple edge cases I haven't deciphered. Expanded out, this would be 111,717 bytes with a single-byte separator. Using just 740 bytes of code (around address 0xC900
), the developers were able to store a 111kb word list in about 45kb. Including the ~2.6kb skip list, this is around 2:1 compression.