Andrew's Blog

The United States is, or claims to be, a constitutional republic. Let's examine what that means.

A constitutional form of government is one in which the government is obligated to follow the law. The US Constitution is, it says, “the supreme Law of the Land” — the law before all other law. But more than that, it's the law that brings the federal government into existence, and the only font of legitimate government authority. The “will of the people” set the Constitution in motion, and the Constitution set the government in motion. A law needs to be constitutional or it's no law, and any exercise of power that doesn't find its justification in the Constitution is a crime.

Let's Make Amends

The Constitution contains the means of its own amendment. This is important. Since the Constitution is the paramount law of the land, and the Constitution provides the means by which it can be modified, it follows that that is the only way in which the Constitution “evolves”. The operation of the Constitution doesn't change by “finding a new interpretation” compatible with prevailing opinion; it changes by (in the words of George Washington) “an explicit and authentic act of the whole people” — or, in more practical terms, when an amendment is duly proposed, and ratified by three quarters of the states.

If “the majority of the people” want something the Constitution forbids, then it's entirely intentional that the majority should be frustrated. The law isn't a club to be wielded by state against state, or neighbor against neighbor. The Constitution exists to protect the 49% against the 51%. It's the levee that stems the tide of chaotic public opinion unless it finds a consistent, near-unanimous direction.

Most of the great crises in American history have arisen from this conflict between those who believe that the government is obligated to follow the law, and those who believe that the government is obligated to “do what's right” — i.e. bow to a prevailing opinion that doesn't have the clout to achieve its ends legally. In many cases through history the “do what's right” crowd has won, to great celebration. But the continuous erosion of the Constitution occasioned by those victories has left the people without their protection against mobocracy, and eroded away the entire concept of rule of law. People aren't supposed to live in fear that one day they will wake up and find that the law targets them for abuse and destruction — but today they do.

A Matter of Interpretation

So, the Constitution can't be swerved from its original intent except by amending it, but at a remove of hundreds of years, sometimes there are legitimate questions of just what that intent was. We can refer to the minutes of the constitutional convention, where alternatives were discussed and rejected, and there are a great deal of writings by the Framers and their contemporaries, but a crucial source of enlightenment comes from the ratifying debates in the individual states.

The Constitution wasn't universally loved at its introduction, and its passage was far from certain. There were many who believed (with some justification) that the stronger federal government it created would be able to trample on the rights of the states and the people. They argued long and hard against its ratification, and offered changes that would have to be made before they would accept it. But in the end, the Constitution was adopted, on the condition that the Bill of Rights accompany it.

That the ratification passed, was largely a result of the efforts of Federalists who set out to address each objection. So for instance, when Spencer raised an objection to the extent of federal jurisdiction in the convention of North Carolina, Maclaine reassures the group that

The powers of Congress are limited and enumerated. We say we have given them those powers, but we do not say we have given them more. We retain all those rights which we have not given away to the general government. [...] It is as plain a thing as can possibly be, that Congress can have no power but what we expressly give them. There is an express clause, which, however disingenuously it has been perverted from its true meaning, clearly demonstrates that they are confined to those powers which we have given them. This clause enables them to make all laws which shall be necessary and proper for carrying into execution the foregoing powers, and all other powers vested by this Constitution in the government of the United States, or any department or officers thereof. This clause specifies that they shall make laws to carry into execution, all the powers vested by this Constitution, consequently they can make no laws to execute any other power.

Which gives a clear exegesis of the “necessary and proper” clause. The Constitution was ratified on the basis of that interpretation. It must rest on that interpretation. The words of Maclaine and others like him should be in the ears of every judge and justice, second only to the text of the Constitution itself, unless they're superseded by amendment to that text.

The President

The Constitution doesn't provide for any direct way for the people to elect the President. It doesn't even require that the people are indirectly asked who they want the President to be. It leaves it up to the states to decide how to choose their Presidential electors, and in the early days many state legislatures did so without putting the matter to a vote at all. Why wasn't this a bigger issue than it was?

I'll claim it was because the Constitution did not, and still does not, give the President any power, or any discretion, over the peacetime domestic affairs of any citizen. The President is the administrative servant of Congress, charged with ensuring that the laws are “faithfully executed”, and the public face of Congress, charged with dealing with foreign powers on their behalf.

The President has a head of state's power to make treaties — only with the consent of the Senate. The President makes appointments — only with the consent of the Senate. The President commands the armed forces — but only Congress can declare war. The President has the power to veto bills — but Congress may override it.

The President is responsible for the smooth operation of the executive, but the duties of the executive are set by law. The President can't order anyone to do anything except what the law (rooted in the Constitution) requires of them, nor order them not to perform any duty the law requires of them. Any such order would be illegal and contrary to the oath of office.

The Future and the Past

If there's to be any hope of a future for democracy and civil rights in the US, or anywhere in the world, it needs to proceed from the principle that the government and the people alike are bound by the law, and that the “consent of the governed” is mediated by a Constitution which sometimes (even frequently) ties the hands of the majority in order to protect the diverse individual needs of the minority.

Unfortunately there is no political party which embraces such a principle. In neglecting it they put the freedom and the very lives of their constituents at risk. But perhaps the problem lies in the nature of party itself. In 1796, when George Washington wrote those words I quoted above about the Constitution, he was tired and angry at Thomas Jefferson for gathering an opposition. In his “Farewell Address” he writes about the dangers of parties. He wouldn't acknowledge that, in truth, he was the spokesman for the nation's first party, doing his best to shut down its second. But despite that he was surprisingly prescient when he wrote that, if we don't cut them off at the knees, political parties will result in “the alternate domination of one faction over another, sharpened by the spirit of revenge”, and will put into power “cunning, ambitious, and unprincipled men” who will invite corruption, use government for their own ends, and leave our politics open to meddling by foreign powers. You might think he had a crystal ball on 2025, but the truth is that it was an entirely foreseeable outcome 250 years ago, when some clever people tried to build a machinery to prevent it. Unfortunately the machine, the Constitution, has been plagued by mechanics who never read the manual, and by vicious drivers.

Upon seeing DRINK ME by “J” on o565.com, I remarked on lobste.rs that you can actually go a lot further with this idea — you can make a decent compressor just by hooking the next-token probabilities of an LLM up to an arithmetic coder. This isn't an original idea (it's the basis of Fabrice Bellard's ts_zip), but I wanted to try it myself.

However, it turns out that there are a few too many wrinkles to both LLMs and arithmetic coding for a simple weekend hack, so I decided to do the next best thing. I wrote a compressor that uses any LLM supported by llama.cpp as a pre-processor for zlib. That's not quite as good as arithmetic coding, but I still managed to get some decent results. Just think of it as leaving some room for improvement :)

The models I ended up using were LLaMa-2-7B and LLaMa-3-8B, both quantized with “Q5_K_M”, which means slightly more than 5 bits per coefficient. I also tested Gemma-2B, but didn't get good results, so I left it out.

The code is on GitHub, but it's not intended for any serious use.

How it works

My code takes the recent context (up to 512 LLM tokens worth of what's already been compressed), passes it to the model, and asks for the 128 most likely candidates for the next token.

If the top most-likely token is a prefix of the remaining input, I choose that token. Otherwise, I choose the longest token from among the 128. In either case, a token is encoded as its index in the list of candidates (the most likely being 0, and the last entry in the list being 127). This is a simple heuristic — we would prefer to use a longer token so that we can code fewer tokens, but it's even better to code a long string of zeroes.

If none of the top 128 tokens are prefixes of the remaining input, then the next character of input is encoded as its unicode codepoint plus 128.

The sequence of integers that comes out of that is encoded using ULEB128. Token indices take up 1 byte each, and literal codepoints take up 2-3 bytes depending on their value.

The ULEB128 bytes are then encoded using zlib in straight “deflate” mode, with no gzip headers.

Decompression

Yes, it's really possible to reverse this (as long as the LLM is run deterministically). First you zlib decode, then ULEB128 decode. Then, if the value is 128 or above, you subtract 128 and output the corresponding character; if it's less than 128, run the LLM with the current context, and output the nth token from its list of candidates.

Results

Although my compressor needs Unicode text input, its output is binary, so all sizes are given in bytes (unlike J's post). “gzip -9” is gzip -9cn, “Brotli” is brotli --best -c, and the last two columns are my code using the two different LLaMA models.

Name Info Original Size gzip -9 Brotli LLaMA-2-7B LLaMA-3-8B
Alice in Wonderland Chapter 1 (must be trimmed slightly differently from J's) 11,858 5,091 (2.33x) 4,285 (2.77x) 313 (37.9x) 203 (58.4x)
Alice in Wonderland Chapter 1 ROT13'd 11,858 5,091 (2.33x) 4,830 (2.46x) 4,798 (2.47x) 4,382 (2.71x)
Alice in Wonderland pg11.txt 174,355 60,907 (2.86x) 51,603 (3.38x) 4,404 (39.6x) 2,945 (59.2x)
Alice in Wonderland HTML pg11-images.html 192,520 63,789 (3.02x) 54,008 (3.56x) 7,181 (26.8x) 5,049 (38.3x)
The Short Victorious War Chapter 1 by David Weber 18,821 8,266 (2.28x) 6,953 (2.71x) 3,530 (5.33x) 3,486 (5.40x)
GPL v2 from /usr/share/common-licenses 18,092 6,824 (2.65x) 5,289 (3.42x) 197 (91.8x) 139 (131x)
clippings.go Some $WORK code that I'm sure isn't verbatim in the corpus 18,680 6,399 (2.92x) 5,574 (3.35x) 2,766 (6.75x) 2,697 (6.93x)
packager_ingress.yaml Kubernetes manifest, with embedded HAproxy config 5,019 1,736 (2.89x) 1,493 (3.36x) 964 (5.21x) 871 (5.76x)
ig_rz.dat A Fortran data file containing 65 years of solar indices 9,682 3,503 (2.76x) 2,721 (3.56x) 2,184 (4.43x) 2,201 (4.40x)

O'RLY?

And just for fun, a small image (100x100, low color)

PNGCrush XPM XPM gzipped XPM brotli XPM Llama 2 XPM Llama 3
894 10,807 797 664 801 730

Discussion

LLaMA 3 outperforms LLaMA 2 in all but one case. Most text compresses around 2x as well as gzip, and at least 1.5x as well as Brotli, but some examples (like Alice in Wonderland and the GPL) get an extremely high ratio. We can assume that they were in LLaMA's training corpus, and very well absorbed, so the model predicts them correctly almost 100% of the time, which gives gzip long strings of zeroes to compress.

rot13ing Alice in Wonderland is illustrative. Because gzip has no prior, rot13'd text compresses exactly as well as the original. But rot13 sets Brotli back to being only slightly better than gzip, because its dictionary is ineffective, and it sets LLaMA back to being only slightly better than Brotli, because the tokenizer doesn't find words it knows, and so the “language model” has nothing with which to make predictions.

The image example was just for fun... it manages to get into the same ballpark as PNG (PNG is only bigger because, even after pngcrush, it's more structured/metadata-laden), but that's because I used a very small image with a horizontal width that fits within the LLM's context window. On anything more substantial, PNG would win.

I had to use a limited context (only 512 tokens into the past) due to limitations of the HTTP API to llama.cpp that I was using. There's an input caching mechanism that's supposed to help with that (it recognizes if it's already processed a prefix of what you gave it, restores the model state from the prefix, and then only parses the new data), but that cache seemed to have collisions that prevented the data from round-tripping, so I had to turn it off — which meant I had to limit the context to keep the speed bearable — and bearable in this case means something like 15-35 bytes per second, depending on how well the input tokenizes.

If I was using the library directly instead of the HTTP interface I could solve that problem and get more speed and better performance at the same time, but I sunk enough hours into this silly thing already!