Decoding Text Compression on a Gameboy

This hopefully will be my last post in a periodic series on text compression on 1980's hardware.

The previous post covers how the word list on this "game" was compressed.

In this post I will try to do my best to reconstruct how I figured out how the GB KJV compressed its text. Much of this work was done late at night in short sprints while my wife worked and my infant was in their "not sleeping through the night" phase, and I am writing this months later.

1. Memory Analysis

To start, let me clear that I am barely a reverse engineering novice. I took two assembler courses in college, one x86 and one I think in MIPS. That was almost 20 years ago, and I haven't touched assembly or machine language since. As well as I know the JVM, I only have a passing understanding of what goes into JVM bytecode. So, learning the Gameboy instruction set was, well, an interesting challenge.

After loading Genesis 1:1 in an emulator, I was curious if the plain text was in memory somewhere. Since I knew the lexicon was stored mostly as ASCII (see previous post), I used an emulator with a debugger to analyze memory to look for ASCII text. Sure enough, there were at least two locations with plain text representations of the currently visible verses.

Text for all displayed verses in VRAM at address 0x1800.

One appeared to be a buffer closely tied to what was displayed on the screen, the other was a smaller buffer used to decode the current verse. When the screen starts with Genesis 1:1, there is enough room on and off screen to render verses 1-3, so this second buffer contained the contents of just verse 3.

Text for the last verse loaded stored in WRAM around address 0x3788

2. Call Analysis

Using the BizHawk emulator, I was able to capture all executed instructions as well as registers from the verse selection screen to the first 3 verse being rendered. This produced 100's of megabytes of logs for just a few seconds of execution.

Aside: The memory mapper on these "games" is so unique that BizHawk explicitly detects these games by hashes of their ROMs. Also, in retrospect, had I a working C# environment, adding custom logging to that code would have been easier than the route I chose to take.

Form there, I wrote a script to parse all of the CALL and RET instructions and output a call graph, including the addresses of the call sites and call counts. The chart is below, with names added in from later analysis of what the code was doing. (It is worth noting that there are some wonderful decompilation tools that will figure this kind of stuff out, jut not one I saw at the time for the Gameboy CPU)

graph LR 54AA_renderVerse == "54AA 3" ==> 5315_InitVars 54AA_renderVerse == "54AD 3" ==> 5456_CheckVariableC764 54AA_renderVerse == "54FD 1" ==> 781C_registerHLmath 54AA_renderVerse == "5519 3" ==> 7421_renderNumber 54AA_renderVerse == "5523 3" ==> 7421_renderNumber 54AA_renderVerse == "552D 3" ==> 7824_sixteenBitSubtract 54AA_renderVerse == "5509 1" ==> memcpy 54AA_renderVerse == "5510 1" ==> 56A8_InternalSub1 54AA_renderVerse == "5587 1" ==> 780D_AddAtoHL 54AA_renderVerse == "5597 1" ==> memcpy 54AA_renderVerse == "5534 16" ==> 56B6_InternalSub2 54AA_renderVerse == "553D 74" ==> 56D0_registerAmath 54AA_renderVerse == "55E7 61" ==> memcpy 54AA_renderVerse == "55EC 61" ==> dnt(7365_DecodeNextToken) 54AA_renderVerse == "55F0 61" ==> C900_decodeToken 54AA_renderVerse == "5600 61" ==> 6E08_check_C884 54AA_renderVerse == "5682 3" ==> 56C8_space_fill 54AA_renderVerse == "568D 3" ==> 6E0F_CheckTwoVarsForZero 54AA_renderVerse == "5698 12" ==> 56C8_space_fill 56A8_InternalSub1 == "56A8 1" ==> 56B6_InternalSub2 56B6_InternalSub2 == "56BC 17" ==> 781C_registerHLmath 6E0F_CheckTwoVarsForZero == "6E0F 3" ==> 6E08_check_C884 6E0F_CheckTwoVarsForZero ==> 651C_check_C072 dnt(7365_DecodeNextToken) == "7372 39" ==> 780D_AddAtoHL dnt(7365_DecodeNextToken )== "738A 22" ==> rc(C8AA_ReadSectionC) dnt(7365_DecodeNextToken) == "73DB 40" ==> rb(C785_ReadSectionB) dnt(7365_DecodeNextToken) == "73E7 40" ==> 780D_AddAtoHL dnt(7365_DecodeNextToken) == "implied" ==> 7406_ResetTemporaryTokenBuffer style dnt fill:#ffbbbb style rb fill:#ffbbbb style rc fill:#ffbbbb

Three verses were rendered, accounting for 61 total tokens/words. Those two were great checks to look for - and sure enough, there was a memcpy type function called at 61 places, as well as a function at 0x7365 that was called 61 times. Several functions were also called 3 times.

It turns out the functions at 0x7365, 0xC785, and 0xC8AA were the most important for actually reading and decompressing the data.

3. How The Compression Actually Works

After figuring out what the assembly was doing, it turns out that the text compression is quite simple. There are three ways the text can be compressed.

First, a quick recap from the previous post. There are three significant sections of memory, and a few smaller ones not covered here.

  • A: The lexicon itself
  • B: 4 byte blocks of compressed lookup pairs
  • C: The actual stream of compressed text

The memory mapper on the GBAKJV allows for reading 32kb blocks, and each block has a small header of code (0x0150 bytes). Because of this, the code has to keep track of a block number and an offset within that block.

Let's look at the first 20 bytes of the compressed text, and the log of my decoder:

68 AF C5 64 5A 1D BC 44
03 C9 61 EC FA BF 5E 67
A5 EA E4 0C DD 3F A4 F7

68AF/13AF : Standard lookup : Genesis
C564/7064 : Two byte compressed : % | In | the (3 tokens)
5A1D/051D: Standard lookup : beginning
BC44/6744 : Two byte compressed : God | created (2 tokens)
03 : Single byte lookup : the
C961/7461 : Two byte compressed : heaven | and | the | earth (4 tokens)
ECFA/97FA : Two byte compressed : . | @ | And | the | earth (5 tokens)
BF5E/6A5E : Two byte compressed : was | without (2 tokens)
67A5/12A5 : Standard lookup : form
EAE4/95E4 : Two byte compressed : , | and | void (3 tokens)
0C : Single byte lookup : ; | and
DD3F/883F : Two byte compressed : darkness | was (2 tokens)
A4F7/4FF7 : Two byte compressed : upon | the | face | of | the (5 tokens)

Scenario 1: "Standard" Lookup

The first scenario is when a two-byte sequence, or token, is used to look up a token/word. The GBAKJV does not use a List or Map type structure, but instead has a small function (0xC900) that takes in a token integer, and returns a word or punctuation symbol. The very first example is the first two bytes: 0x68AF. Notice in the log that the actual token is 0x13AF. The two byte sequence has been subtracted by 0x5500. Put another way, the first byte has been subtracted by 0x55, which is what actually happens in the code. More on that in scenario 3.

After subtracting 0x55 and then reading the second byte, a check is done to see if the resulting number is in the range of valid tokens. In this case, the token decoder will accept numbers from 0 to 13870 (or 0x362e in hex). Interestingly, 13875 is also a valid value, and returns "test". The rest of the values from 13871 to 13874, and values higher, appear to be junk. As long as the token is less than or equal to 13870, no further math is done and the word is returned.

Scenario 2: Two-Byte Lookup to Recursive Pairs

If the token, after subtracting 0x55 from the first byte, is greater than 13870, then it indicates that some compression has been done using a lookup table. The lookup table is divided into 4 byte blocks of pairs of two bytes. In the simplest case, this means a 2:1 compression - the token maps to one of these blocks, which in turn maps to two tokens in the 0-13870 range.

The code here is somewhat complicated because the lookup table spans several memory blocks. Much of the code is devoted to converting the two-byte values into offsets in these memory blocks

One of the examples, C564/7064 : % | In | the the value in the first position, 0x0003 decodes to a percent sign. There are no percent signs in the KJV, so this must have some special meaning, as do several other values. This is what I have been able to determine:

  • % : Start of a book. This resets the chapter and verse to 1
  • $ : End of a book
  • # : End of a chapter
  • @ : End of a verse
  • { and } - These appear to be for some text outside of a verse, such as a subheading for a chapter or between verses. They start to show up in the Psalms.

The code has a stack that it uses to decompress these types of tokens recursively. The stack only has room for 14 tokens, but the way the tokens were chosen and the code is implemented it appears to compress groups of up to 124 tokens.

One of the earliest long runs of 14 tokens is (with pipes added to show separation):

years | , | and | begat | sons | and | daughters | : | @ | And | all | the | days | of

6 And Seth lived an hundred and five years, and begat Enos: 7 And Seth lived after he begat Enos eight hundred and seven years, and begat sons and daughters: 8 And all the days of Seth were nine hundred and twelve years: and he died. 9 And Enos lived ninety years, and begat Cainan: 10 And Enos lived after he begat Cainan eight hundred and fifteen years, and begat sons and daughters: 11 And all the days of Enos were nine hundred and five years: and he died. 12 And Cainan lived seventy years, and begat Mahalaleel: 13 And Cainan lived after he begat Mahalaleel eight hundred and forty years, and begat sons and daughters: 14 And all the days of Cainan were nine hundred and ten years: and he died. 15 And Mahalaleel lived sixty and five years, and begat Jared: 16 And Mahalaleel lived after he begat Jared eight hundred and thirty years, and begat sons and daughters: 17 And all the days of Mahalaleel were eight hundred ninety and five years: and he died.

Genesis 5:6-17

In this block the phrase "years, and begat sons and daughters: [verse marker] And all the days of" is repeated 4 times. This is represented as 0x64EA each time, which expands out to:

graph 64EA ==> 4E08 4E08 ==> 3FB3 4E08 ==> 3FCF 3FB3 ==> 356B(356B: years) 3FB3 ==> 3D67 3D67 ==> 362F 362F ==> 0006(0006: comma) 362F ==> 0249 3D67 ==> 0512(0512: begat) 3FCF ==> 3B56 3B56 ==> 2CE1(2CE1: sons) 3B56 ==> 0249(0249: and) 3FCF ==> 0bf5(0BF5: daugthers) 64EA ==> 5705 5705 ==> 3670 3670 ==> 3633 3633 ==> 0008(0008: colon) 3633 ==> 000B(000B: verse marker) 3670 ==> 0248(0248: And) 5705 ==> 44D7 44D7 ==> 3636 3636 ==> 01BB(01BB: all) 3636 ==> 3018(3018: the) 44D7 ==> 380C 380C ==> 0BFE(0BFE: days) 380C ==> 217E(217E: of)

The longest chains appear to be 122 and 124 tokens long and overlap with one another.

  • 0x8291 (4 times, 124 tokens): , | offered | : | @ | His | offering | was | one | silver | charger | , | the | weight | whereof | was | an | hundred | and | thirty | shekels | , | one | silver | bowl | of | seventy | shekels | , | after | the | shekel | of | the | sanctuary | ; | both | of | them | full | of | fine | flour | mingled | with | oil | for | a | meat | offering | : | @ | One | golden | spoon | of | ten | shekels | , | full | of | incense | : | @ | One | young | bullock | , | one | ram | , | one | lamb | of | the | first | year | , | for | a | burnt | offering | : | @ | One | kid | of | the | goats | for | a | sin | offering | : | @ | And | for | a | sacrifice | of | peace | offerings | , | two | oxen | , | five | rams | , | five | he | goats | , | five | lambs | of | the | first | year | : | this | was | the | offering | of
  • 0x5d7c (3 times, 122 tokens): : | @ | His | offering | was | one | silver | charger | , | the | weight | whereof | was | an | hundred | and | thirty | shekels | , | one | silver | bowl | of | seventy | shekels | , | after | the | shekel | of | the | sanctuary | ; | both | of | them | full | of | fine | flour | mingled | with | oil | for | a | meat | offering | : | @ | One | golden | spoon | of | ten | shekels | , | full | of | incense | : | @ | One | young | bullock | , | one | ram | , | one | lamb | of | the | first | year | , | for | a | burnt | offering | : | @ | One | kid | of | the | goats | for | a | sin | offering | : | @ | And | for | a | sacrifice | of | peace | offerings | , | two | oxen | , | five | rams | , | five | he | goats | , | five | lambs | of | the | first | year | : | this | was | the | offering | of

The last two-byte sequence is 0xAAFF, "in | former | , | and | the | God", which is used 434 times. This makes sense due to the subtraction of 0x55 - 0xAA + 0x55 = 0xFF, meaning this is 0xFFFF, the largest number that can be represented when reserving 0x00-0x54 for one-byte purposed.

The 2-byte patterns range from 0x362F to 0xAA56, for a total of 29,642 pairs.

Scenario 3: Single Byte Compression

Remember that value 0x55? When a byte is read from memory section C, and it is greater than 0x55, a second byte is read, the 0x55 is subtracted from the first and the two bytes are joined together to do scenarios 1 and 2. But if it that first byte is less than 0x55, then we are dealing with a single byte lookup. The value is multiplied by two, then is looked up against a separate lookup table. That table has two byte values, which point to the same lookup table as scenario 2. This also means that the single byte can ultimately decompress to more than one word.

For example, 0x01 is just a comma. 0x02 is "and", 0x03 is "the".

Take 0x1E. It expands to 4 words or symbols:

graph 1E ==> 2EF4 2EF4 ==> 3632 2EF4 ==> 3637 3632 ==> 0007(0007 period .) 3632 ==> 000B(verse marker) 3637 ==> 0248(0248 And) 3637 ==> 3018(3018 the)

I think it likely was a design decision to make sure that these single-byte values all mapped to word groupings where all of the words or symbols could be considered "stop" words. It may be that these extremely common phrases just happen to include punctuation or common articles (and, the, etc), or it may have been done this way to speed up searching. Many searching or indexing algorithms discard "stop words", so if the search part of the code (which I do not plan on looking into) wanted to be a bit faster, it could just skip these single bytes.

I have not dug into the code that does searching to see if stop-word skipping is indeed implemented. Searching for another if (byte <= 0x55) type block would not be too hard to do to see if this is indeed the case.

Does this Algorithm Have A Name?

I did some research to see if there was a name for this algorithm. After a while, it looks like this is an early implementation of https://en.m.wikipedia.org/wiki/Re-Pair, or Recursive Pairing. This algorithm was formalized in 1999, six years after the GBKJV was released. It is a fairly simple algorithm, so it possible that someone at Wisdom Tree came up with a similar implementation before it was given a name. It would be really interesting to see if someone who worked for Wisdom Tree ended up being one of the name associated with Re-Pair.

A copy of the formal algorithm is can be found via Google Scholar. They used the King James Bible as one of their test cases, and produced a compressed version at around 1.7MB. Though it does not say, I suspect they were performing a lossless compression on the English ASCII characters, not on a lossy tokenized integer compression.

Overall Compression Algorithm

The overall compression algorithm / codec appears thus to be

  1. Tokenzie all words and symbols, typically separated by spaces, to integer tokens.
  2. Recursively identify pairs of tokens that appear 3+ times, replacing two adjacent tokens with a single token that points to a lookup table. Keep doing this until the lookup table is full, or there are no more repeated pairs
  3. Re-number tokens, such that frequent stop-word pairs get a value of 0x00-0x55, then tokens that are just plain tokens, then pair lookups.
  4. Output the tokens
    • 0x00-0x55 get a single byte
    • Otherwise, two bytes

Validating

To validate this, I actually emulated parts of the ROM using some custom Kotlin code. I started on this well before I saw that BizHawk was open source and written in C#, otherwise I might have just worked in their codebase.

After identifying where the text decompression was running, I emulated each instruction, and broke functions down based on where RET statements were, or where some jumps were being performed.

I may release this code at some point, maybe. It is a mess.

Here is a sample of what it looks like:

// Implementation of a 16 bit subtractor, since the Gameboy is only capable of 8 bit subtraction
import org.danwatt.gbakjv.*
@ExperimentalStdlibApi
@ExperimentalUnsignedTypes
class SixteenBitSubtraction_7824 : GBFunc {
    override fun run(cpu: CpuState) {
// cpu.asm is just a comment on what the original statement was, and what address it came from
        cpu.asm("7824 LD   A,E") 
        cpu.a = cpu.e

        cpu.asm("7825 SUB  A,C")
        cpu.a = cpu.a.sub(cpu.c, cpu.f)

        cpu.asm("7826 LD   E,A")
        cpu.e = cpu.a

        cpu.asm("7827 LD   A,D")
        cpu.a = cpu.d

        cpu.asm("7828 SBC  A,B")
        cpu.a = cpu.a.subc(cpu.b, cpu.f)

        cpu.asm("7829 LD   D,A")
        cpu.d = cpu.a

        cpu.asm("782A RET")
    }
}

Leave a Reply