I'm looking into using compression as a way to measure the relation of a document to a corpus of documents. In doing so I've found a strange result when using bzip2; len(compress(corpus)) > len(compress(corpus + new_document)). Should this be the case with a practical compression algorithm and is this th开发者_开发知识库eoretically possible when looking at the Kolmogorov complexity of data? (the idea is to use a compression algorithm to approximate the complexity of the data)
Yes, it should be the case with a practical compression algorithm, and is theoretically possible with Kolmogorov complexity. The easiest way to explain why is with an example.
Suppose the following:
- your document separator character is
,
- corpus is documents abc,def,abc,def,abc,def,abc,
- new document is def
- your compression algorithm (or kolmogorov description language) just allows repetition by prefixing with a repeat count followed by
|
(this is a variant of run-length encoding)
Then:
- compress(corpus) = "3|abc,def,1|abc"
- compress(corpus+new_document) = "4|abc,def,"
So compress(corpus)
is longer than compress(corpus+new_document)
. It's a bit contrived, but hopefully explains how the result could theoretically appear with a simple scheme. I'm not saying this is what happens with bzip2, just showing how it is theoretically possible.
Edit It has been mentioned in another answer that run-length encoding is not Turing complete and so cannot be used for Kolmogorov complexity. While this is true, using a Turing language you can implement an encoding of runlength in whatever description language you choose to use, with the same result, so the example still holds valid.
Real-life compression algorithms have quirks like this but they only provide for a very crude approximation anyway.
And as for whether it can happen in theory, probably, but the difference isn't significant.
Let's assume you've got two strings, x and y, where x is a prefix of y. Let's say for example that
x = "asdfasdfasdfasdfasdfasdfasdfasdfasdf"
y = "asdfasdfasdfasdfasdfasdfasdfasdfasdf23452345234523452344523452452345234524345234"
Let's assume furthermore that D is the shortest description of y. (I.e. K(y) = |D|)
In this case, x can be described as |"the number described by D minus 46 characters"|, which is longer than D but only by a small constant and a logarithmic factor (the number of characters in the rest of the instructions basically).
There might even be a shorter description of x but we know that at worst, K(x) <= K(y)+log(|y|-|x|)
However, you have to bear in mind that theoretical Kolmogorov complexity is incomputable and constant differences mean nothing in this field.
(N.b.: The RLE example above isn't valid as RLE isn't a Turing complete language, therefore it can't be used as a description language for Kolmogorov complexity.)
精彩评论