开发者

Variable-byte encoding clarification

开发者 https://www.devze.com 2022-12-25 01:34 出处:网络
I am very new to the world of byte encoding so please excuse me (and by all means, correct me) if I am using/expressing simple concepts in the wrong way.

I am very new to the world of byte encoding so please excuse me (and by all means, correct me) if I am using/expressing simple concepts in the wrong way.

I am trying to understand variable-byte encoding. I have read the Wikipedia article (http://en.wikipedia.org/wiki/Variable-width_encoding) as well as a book chapter fr开发者_JS百科om an Information Retrieval textbook. I think I understand how to encode a decimal integer. For example, if I wanted to provide variable-byte encoding for the integer 60, I would have the following result:

1 0 1 1 1 1 0 0

(please let me know if the above is incorrect). If I understand the scheme, then I'm not completely sure how the information is compressed. Is it because usually we would use 32 bits to represent an integer, so that representing 60 would result in 1 1 1 1 0 0 preceded by 26 zeros, thus wasting that space as opposed to representing it with just 8 bits instead?

Thank you in advance for the clarifications.


The way you do it is by reserving one of the bits to mean "I'm not done with the value." Usually, that's the most significant bit.

When you read a byte, you process the lower 7 bits. If the most significant bit is 1, then you know there's one more byte to read, and you repeat the process, adding the next 7 bits to the current 7 bits.

The MIDI format uses that exact encoding to represent lengths of MIDI events, in the following manner:

  1. ExpectedValue = 0
  2. byte=ReadFromFile
  3. ExpectedValue = ExpectedValue + (byte AND 0x7f)
  4. if byte > 127 then
    1. ExpectedValue = ExpectedValue SHL 7
    2. Goto 2
  5. Done

For example, the value 0x80 would be represented using the bytes 0x81 0x00. You can try running the algorithm on those two bytes, and you see you'll get the right value.

UTF-8 works similarly, but it uses a slightly more complex scheme to tell you how many bytes you should be expecting. This allows for some error correction, since you can easily tell if the bytes you're getting match the length claimed. Wikipedia describes their structure quite well.


You hit the nail on the head.

There are many encoding schemes, such as gamma and delta, which are special cases of elias coding. These are bit-level codes, as opposed to the byte-level code you used, and are useful when you have a strong skew towards small numbers (which can often be achieved by encoding deltas instead of absolute values).

Bit-level encoding schemes are much more difficult to implement than byte-level schemes and the additional CPU burden may outweigh the time saved by having less data to read, though most modern CPUs have "highest-bit" and "lowest-bit" instructions that dramatically improve the performance of bit-level codecs. As CPU speeds continue to outpace RAM speeds, bit-level schemes will become more attractive, though the simplicity of byte-level codecs is a big factor too.


Yes, you are right, you save space by encoding using one byte instead of 4. Generally, you will save memory if the values you are encoding are much smaller than the maximum value that would have fit in your original fixed-width encoding.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号