Is the theoretical limit to compression basically what Kolmolgorov complexity is, or is there another shannon/information theory theorem that defines the maximum possible lossless compression in terms of symbols over a channel, where the channel represents the algorithm? ("compression theorem" refers to something else more abstract I think.)
Intuitively it seems like there would have to be one for lossless compression, and a proof that a theorem for the limits of lossy compression was either impossible or an open loop/Hard problem.
Yes, the Kolmolgorov complexity of a string X is the minimum length of a program/string that when fed into an efficient universal computer will yield back the string X. The Kolmolgorov complexity of a string is also undecideable - it's essentially equivalent to the halting problem.
"Best Lossy Compression" is ambiguous - how much loss is acceptable is isn't defined in the phrase.
The theorem you are looking for is Shannon's Source Coding Theorem[1]. It basically states that no encoding scheme can losslessly compress data beyond the given data's Shannon entropy.
Amazing, thank you, it had to be something like that. It makes much more sense as an encoding limitation than as something abstract as the "shortest possible program" via Kolmolgorov complexity. I was wondering whether KC may just be an inconsistently defined thought experiment compared to concrete ideas from information theory, and if it's not, how much in common it would necessarily have with the coding theorem, as how are they not talking about the same thing.
What you're after is "minimum description length": choose some representation, e.g. gzip, and try to find the shortest encoding of our input in that representation.
The simplest representation is just raw data. There's only one way to represent our input in this format: just write it out verbatim. Hence its minimum description length is identical to the input size.
A more powerful representation is run-length encoding, where runs of identical symbols can be stored as the symbol and the number of repetitions; everything else is stored raw. Note that there are multiple ways to represent an input in this format: e.g. "50 zeros" is the same as "25 zeros 25 zeros", "00000 45 zeros", etc. yet the latter aren't minimum descriptions.
A more powerful representation is to use back-references: this lets us compress runs of multiple repeated symbols (run-length encoding only handles runs with 1 repeated symbol), and re-use commonly-occuring 'phrases'. This is essentially how LZ works. Again, there are multiple ways to represent an input in this format; finding a minimum description is incredibly expensive; most compressors don't bother, and instead have a configurable 'effort' (e.g. 1 to 9).
Adding more power to a representation makes it able to compress more complex patterns, but also makes it harder to find a minimum description. The most powerful representation is a Universal Turing Machine, which can represent any computable pattern; its minimum description length is the Kolmogorov complexity, and finding it is uncomputable in general.
Note that both Shannon information and algorithmic information (i.e. Kolmogorov complexity) can be gamed if we focus on a single message; for example, I can define an encoding like this:
- To encode a snapshot of Wikipedia as of 2022-06-30T14:00:00Z, output a single `1`
- To encode any other data, run it through `zip` and prepend a `0`
- To decode data which begins with a `1`, output a snapshot of Wikipedia as of 2022-06-30T14:00:00Z
- To decode data which begins with a `0`, send everything after that `0` through `unzip`
Hence we need to focus on the distribution of possible messages, and the big-O behaviour after many messages have been sent, rather than individual messages themselves.
Shannon information is essentially a frequentist approach, based on counting and expectations. Algorithmic information is essentially Bayesian, where the choice of Universal Turing Machine defines a prior, the bits of a message (AKA a program) are evidence, and decoding that message (AKA running the program) updates that prior.
So rewarding, thank you! Naievely, doesn't the expectation in the frequentist approach function as a prior in the Bayesian case? It's like there is a logical homomorphism between the two concepts of KC and SI.
In a sense, Kolmolgorov complexity and Shannon entropy can basically be considered as equivalent concepts: they both define the minimum amount of information required to fully define a given piece of data.
Intuitively it seems like there would have to be one for lossless compression, and a proof that a theorem for the limits of lossy compression was either impossible or an open loop/Hard problem.