开发者

Is Terra Compression possible? If so, please explain and provide samples

开发者 https://www.devze.com 2023-01-12 10:29 出处:网络
Long Ascii String Text may or may not be crushed and compressed into hash kind of ascii \"checksum\" by using sophisticated mathematical formula/algorithm. Just like air which can be compressed.

Long Ascii String Text may or may not be crushed and compressed into hash kind of ascii "checksum" by using sophisticated mathematical formula/algorithm. Just like air which can be compressed.

To compress megabytes of ascii text into a 128 or so bytes, by shuffling, then mixing new "patterns" of singl开发者_Python百科e "bytes" turn by turn from the first to the last. When we are decompressing it, the last character is extracted first, then we just go on decompression using the formula and the sequential keys from the last to the first. The sequential keys and the last and the first bytes must be exactly known, including the fully updated final compiled string, and the total number of bytes which were compressed.

This is the terra compression I was thinking about. Is this possible? Can you explain examples. I am working on this theory and it is my own thought.


In general? Absolutely not.

For some specific cases? Yup. A megabyte of ASCII text consisting only of spaces is likely to compress extremely well. Real text will generally compress pretty well... but not in the order of several megabytes into 128 bytes.

Think about just how many strings - even just strings of valid English words - can fit into several megabytes. Far more than 256^128. They can't all compress down to 128 bytes, by the pigeon-hole principle...


If you have n possible input strings and m possible compressed strings and m is less than n then two strings must map to the same compressed string. This is called the pigeonhole principle and is the fundemental reason why there is a limit on how much you can compress data.

What you are describing is more like a hash function. Many hash functions are designed so that given a hash of a string it is extremely unlikely that you can find another string that gives the same hash. But there is no way that given a hash you can discover the original string. Even if you are able to reverse the hashing operation to produce a valid input that gives that hash, there are infinitely many other inputs that would give the same hash. You wouldn't know which of them is the "correct" one.


Information theory is the scientific field which addresses questions of this kind. It also provides you the possibility to calculate the minimum amount of bits needed to store a compressed message (with lossless compression). This lower bound is known as the Entropy of the message.

Calculation of the Entropy of a piece of text is possible using a Markov model. Such a model uses information how likely a certain sequence of characters of the alphabet is.


The air analogy is very wrong.

When you compress air you make the molecules come closer to each other, each molecule is given less space.

When you compress data you can not make the bit smaller (unless you put your harddrive in a hydraulic press). The closest you can get of actually making bits smaller is increasing the bandwidth of a network, but that is not compression.

Compression is about finding a reversible formula for calculating data. The "rules" about data compression are like

  • The algorithm (including any standard start dictionaries) is shared before hand and not included in the compressed data.
  • All startup parameters must be included in the compressed data, including:
    • Choice of algorithmic variant
    • Choice of dictionaries
    • All compressed data
  • The algorithm must be able to compress/decompress all possible messages in your domain (like plain text, digits or binary data).

To get a feeling of how compression works you may study some examples, like Run length encoding and Lempel Ziv Welch.


You may be thinking of fractal compression which effectively works by storing a formula and start values. The formula is iterated a certain number of times and the result is an approximation of the original input.

This allows for high compression but is lossy (output is close to input but not exactly the same) and compression can be very slow. Even so, ratios of 170:1 are about the highest achieved at the moment.


This is a bit off topic, but I'm reminded of the Broloid compression joke thread that appeared on USENET ... back in the days when USENET was still interesting.

Seriously, anyone who claims to have a magical compression algorithm that reduces any text megabyte file to a few hundred bytes is either:

  • a scammer or click-baiter,
  • someone who doesn't understand basic information theory, or
  • both.


You can compress test to a certain degree because it doesn't use all the available bits (i.e. a-z and A-Z make up 52 out of 256 values). Repeating patterns allow some intelligent storage (zip).

There is no way to store arbitrary large chunks of text in any fixed length number of bytes.

You can compress air, but you won't remove it's molecules! It's mass keeps the same.

0

精彩评论

暂无评论...
验证码 换一张
取 消