Compression algorithms with small dictionaries_问答_开发者

I'm looking for a compression algorithm which works wit开发者_如何转开发h symbols smaller than a byte. I did a quick research about compression algorithms and it's being hard to find out the size of the used symbols. Anyway, there are streams with symbols smaller than 8-bit. Is there a parameter for DEFLATE to define the size of its symbols?

plaintext symbols smaller than a byte

The original descriptions of LZ77 and LZ78 describe them in terms of a sequence of decimal digits (symbols that are approximately half the size of a byte).

If you google for "DNA compression algorithm", you can get a bunch of information on algorithms specialized for compression files that are almost entirely composed of the 4 letters A G C T, a dictionary of 4 symbols, each one about 1/4 as small as a byte. Perhaps one of those algorithms might work for you with relatively little tweaking.

The LZ77-style compression used in LZMA may appear to use two bytes per symbol for the first few symbols that it compresses. But after compressing a few hundred plaintext symbols (the letters of natural-language text, or sequences of decimal digits, or sequences of the 4 letters that represent DNA bases, etc.), the two-byte compressed "chunks" that LZMA puts out often represent a dozen or more plaintext symbols. (I suspect the same is true for all similar algorithms, such as the LZ77 algorithm used in DEFLATE).

If your files use only a restricted alphabet of much less than all 256 possible byte values, in principle a programmer could adapt a variant of DEFLATE (or some other algorithm) that could make use of information about that alphabet to produce compressed files a few bits smaller in size than the same files compressed with standard DEFLATE. However, many byte-oriented text compression algorithms -- LZ77, LZW, LZMA, DEFLATE, etc. build a dictionary of common long strings, and may give compression performance (with sufficiently large source file) within a few percent of that custom-adapted variant -- often the advantages of using a standard compressed file format is worth sacrificing a few percent of potential space savings.

compressed symbols smaller than a byte

Many compression algorithms, including some that give the best known compression on benchmark files, output compressed information bit-by-bit (such as most of the PAQ series of compressors, and some kinds of arithmetic coders), while others output variable-length compressed information without regard for byte boundaries (such as Huffman compression).

Some ways of describing arithmetic coding talk about pieces of information, such as individual bits or pixels, that are compressed to "less than one bit of information".

EDIT: The "counting argument" explains why it's not possible to compress all possible bytes, much less all possible bytes and a few common sequences of bytes, into codewords that are all less than 8 bits long. Nevertheless, several compression algorithms can and often do represent represent some bytes or (more rarely) some sequences of bytes, each with a codeword that is less than 8 bit long, by "sacrificing" or "escaping" less-common bytes that end up represented by other codewords that (including the "escape") are more than 8 bits long.

Such algorithms include:

The Pike Text compression using 4 bit coding
byte-oriented Huffman
several combination algorithms that do LZ77-like parsing of the file into "symbols", where each symbol represents a sequence of bytes, and then Huffman-compressing those symbols -- such as DEFLATE, LZX, LZH, LZHAM, etc.

The Pike algorithm uses the 4 bits "0101" to represent 'e' (or in some contexts 'E'), the 8 bits "0000 0001" to represent the word " the" (4 bytes, including the space before it) (or in some contexts " The" or " THE"), etc. It has a small dictionary of about 200 of the most-frequent English words, including a sub-dictionary of 16 extremely common English words.

When compressing English text with byte-oriented Huffman coding, the sequence "e " (e space) is compressed to two codewords with a total of typically 6 bits.

Alas, when Huffman coding is involved, I can't tell you the exact size of those "small" codewords, or even tell you exactly what byte or byte sequence a small codeword represents, because it is different for every file. Often the same codeword represents a different byte (or different byte sequence) at different locations in the same file. The decoder decides which byte or byte sequence a codeword represents based on clues left behind by the compressor in the headers, and on the data decompressed so far. With range coding or arithmetic coding, the "codeword" may not even be an integer number of bits.

You may want to look into a Golomb-Code. A golomb code use a divide and conquer algorithm to compress the inout. It's not a dictionary compression but it's worth to mention.