Using an LLM for text compression

Upon seeing DRINK ME by “J” on o565.com, I remarked on lobste.rs that you can actually go a lot further with this idea — you can make a decent compressor just by hooking the next-token probabilities of an LLM up to an arithmetic coder. This isn't an original idea (it's the basis of Fabrice Bellard's ts_zip), but I wanted to try it myself.

However, it turns out that there are a few too many wrinkles to both LLMs and arithmetic coding for a simple weekend hack, so I decided to do the next best thing. I wrote a compressor that uses any LLM supported by llama.cpp as a pre-processor for zlib. That's not quite as good as arithmetic coding, but I still managed to get some decent results. Just think of it as leaving some room for improvement :)

The models I ended up using were LLaMa-2-7B and LLaMa-3-8B, both quantized with “Q5_K_M”, which means slightly more than 5 bits per coefficient. I also tested Gemma-2B, but didn't get good results, so I left it out.

The code is on GitHub, but it's not intended for any serious use.

How it works

My code takes the recent context (up to 512 LLM tokens worth of what's already been compressed), passes it to the model, and asks for the 128 most likely candidates for the next token.

If the top most-likely token is a prefix of the remaining input, I choose that token. Otherwise, I choose the longest token from among the 128. In either case, a token is encoded as its index in the list of candidates (the most likely being 0, and the last entry in the list being 127). This is a simple heuristic — we would prefer to use a longer token so that we can code fewer tokens, but it's even better to code a long string of zeroes.

If none of the top 128 tokens are prefixes of the remaining input, then the next character of input is encoded as its unicode codepoint plus 128.

The sequence of integers that comes out of that is encoded using ULEB128. Token indices take up 1 byte each, and literal codepoints take up 2-3 bytes depending on their value.

The ULEB128 bytes are then encoded using zlib in straight “deflate” mode, with no gzip headers.

Decompression

Yes, it's really possible to reverse this (as long as the LLM is run deterministically). First you zlib decode, then ULEB128 decode. Then, if the value is 128 or above, you subtract 128 and output the corresponding character; if it's less than 128, run the LLM with the current context, and output the nth token from its list of candidates.

Results

Although my compressor needs Unicode text input, its output is binary, so all sizes are given in bytes (unlike J's post). “gzip -9” is gzip -9cn, “Brotli” is brotli --best -c, and the last two columns are my code using the two different LLaMA models.

Name Info Original Size gzip -9 Brotli LLaMA-2-7B LLaMA-3-8B
Alice in Wonderland Chapter 1 (must be trimmed slightly differently from J's) 11,858 5,091 (2.33x) 4,285 (2.77x) 313 (37.9x) 203 (58.4x)
Alice in Wonderland Chapter 1 ROT13'd 11,858 5,091 (2.33x) 4,830 (2.46x) 4,798 (2.47x) 4,382 (2.71x)
Alice in Wonderland pg11.txt 174,355 60,907 (2.86x) 51,603 (3.38x) 4,404 (39.6x) 2,945 (59.2x)
Alice in Wonderland HTML pg11-images.html 192,520 63,789 (3.02x) 54,008 (3.56x) 7,181 (26.8x) 5,049 (38.3x)
The Short Victorious War Chapter 1 by David Weber 18,821 8,266 (2.28x) 6,953 (2.71x) 3,530 (5.33x) 3,486 (5.40x)
GPL v2 from /usr/share/common-licenses 18,092 6,824 (2.65x) 5,289 (3.42x) 197 (91.8x) 139 (131x)
clippings.go Some $WORK code that I'm sure isn't verbatim in the corpus 18,680 6,399 (2.92x) 5,574 (3.35x) 2,766 (6.75x) 2,697 (6.93x)
packager_ingress.yaml Kubernetes manifest, with embedded HAproxy config 5,019 1,736 (2.89x) 1,493 (3.36x) 964 (5.21x) 871 (5.76x)
ig_rz.dat A Fortran data file containing 65 years of solar indices 9,682 3,503 (2.76x) 2,721 (3.56x) 2,184 (4.43x) 2,201 (4.40x)

O'RLY?

And just for fun, a small image (100x100, low color)

PNGCrush XPM XPM gzipped XPM brotli XPM Llama 2 XPM Llama 3
894 10,807 797 664 801 730

Discussion

LLaMA 3 outperforms LLaMA 2 in all but one case. Most text compresses around 2x as well as gzip, and at least 1.5x as well as Brotli, but some examples (like Alice in Wonderland and the GPL) get an extremely high ratio. We can assume that they were in LLaMA's training corpus, and very well absorbed, so the model predicts them correctly almost 100% of the time, which gives gzip long strings of zeroes to compress.

rot13ing Alice in Wonderland is illustrative. Because gzip has no prior, rot13'd text compresses exactly as well as the original. But rot13 sets Brotli back to being only slightly better than gzip, because its dictionary is ineffective, and it sets LLaMA back to being only slightly better than Brotli, because the tokenizer doesn't find words it knows, and so the “language model” has nothing with which to make predictions.

The image example was just for fun... it manages to get into the same ballpark as PNG (PNG is only bigger because, even after pngcrush, it's more structured/metadata-laden), but that's because I used a very small image with a horizontal width that fits within the LLM's context window. On anything more substantial, PNG would win.

I had to use a limited context (only 512 tokens into the past) due to limitations of the HTTP API to llama.cpp that I was using. There's an input caching mechanism that's supposed to help with that (it recognizes if it's already processed a prefix of what you gave it, restores the model state from the prefix, and then only parses the new data), but that cache seemed to have collisions that prevented the data from round-tripping, so I had to turn it off — which meant I had to limit the context to keep the speed bearable — and bearable in this case means something like 15-35 bytes per second, depending on how well the input tokenizes.

If I was using the library directly instead of the HTTP interface I could solve that problem and get more speed and better performance at the same time, but I sunk enough hours into this silly thing already!