开发者

Bounds on practical compression and Kolmogorov complexity

开发者 https://www.devze.com 2023-02-03 20:00 出处:网络
I\'m looking into using compression as a way to measure the relation of a document to a corpus of documents. In doing so I\'ve found a strange result when using bzip2; len(compress(corpus)) > len(comp

I'm looking into using compression as a way to measure the relation of a document to a corpus of documents. In doing so I've found a strange result when using bzip2; len(compress(corpus)) > len(compress(corpus + new_document)). Should this be the case with a practical compression algorithm and is this th开发者_开发知识库eoretically possible when looking at the Kolmogorov complexity of data? (the idea is to use a compression algorithm to approximate the complexity of the data)


Yes, it should be the case with a practical compression algorithm, and is theoretically possible with Kolmogorov complexity. The easiest way to explain why is with an example.

Suppose the following:

  • your document separator character is ,
  • corpus is documents abc,def,abc,def,abc,def,abc,
  • new document is def
  • your compression algorithm (or kolmogorov description language) just allows repetition by prefixing with a repeat count followed by | (this is a variant of run-length encoding)

Then:

  • compress(corpus) = "3|abc,def,1|abc"
  • compress(corpus+new_document) = "4|abc,def,"

So compress(corpus) is longer than compress(corpus+new_document). It's a bit contrived, but hopefully explains how the result could theoretically appear with a simple scheme. I'm not saying this is what happens with bzip2, just showing how it is theoretically possible.

Edit It has been mentioned in another answer that run-length encoding is not Turing complete and so cannot be used for Kolmogorov complexity. While this is true, using a Turing language you can implement an encoding of runlength in whatever description language you choose to use, with the same result, so the example still holds valid.


Real-life compression algorithms have quirks like this but they only provide for a very crude approximation anyway.

And as for whether it can happen in theory, probably, but the difference isn't significant.

Let's assume you've got two strings, x and y, where x is a prefix of y. Let's say for example that

x = "asdfasdfasdfasdfasdfasdfasdfasdfasdf"

y = "asdfasdfasdfasdfasdfasdfasdfasdfasdf23452345234523452344523452452345234524345234"

Let's assume furthermore that D is the shortest description of y. (I.e. K(y) = |D|)

In this case, x can be described as |"the number described by D minus 46 characters"|, which is longer than D but only by a small constant and a logarithmic factor (the number of characters in the rest of the instructions basically).

There might even be a shorter description of x but we know that at worst, K(x) <= K(y)+log(|y|-|x|)

However, you have to bear in mind that theoretical Kolmogorov complexity is incomputable and constant differences mean nothing in this field.

(N.b.: The RLE example above isn't valid as RLE isn't a Turing complete language, therefore it can't be used as a description language for Kolmogorov complexity.)

0

精彩评论

暂无评论...
验证码 换一张
取 消