How to encode a 8 byte block using only digits (numeric characters)?_问答_开发者

I need to encode streams of 8 byte such that encoded stream has only digits (zero to nine) in them. Are their any standard encoding mechanism for doing this? If there are multiple ways to 开发者_运维百科do it, which one is efficient in terms of length of encoded string (shorter is better)?

Treat the 8 bytes as a 64-bit unsigned integer and convert it to decimal and pad it to the left with zeroes. That should make for the shortest possible string, as it utilizes all available digits in all positions except the starting one.

If your data is not uniformly distributed there are other alternatives, looking into Huffman-coding so that the most commonly data patterns can be represented by shorter strings. One way is to use the first digit to encode the length of the string. All numbers except 1 in the first position can be treated as a length specifier. That way the maximum length of 20 digits will never be exceeded. (The 20th digit can only be 0 or 1, the highest 64-bit number is 18,446,744,073,709,551,615.) The exact interpretation mapping of the other digits into lengths should be based on the distribution of your patterns. If you have 10 patterns which are occuring VERY often you could e.g. reserv "0" to mean that one digit represents a complete sequence.

Any such more complicated encoding will however introduce the need for more complex packing/unpacking code and maybe even lookup tables, so it might not be worth the effort.

The answer to the efficiency question will depend a lot on the typical range of values in the 8-byte blocks. Consider Unicode's UTF-8 and UTF-16. UTF-8 is very efficient for encoding texts written primarily in western scripts, because most characters in those scripts are in the range 0x00 to 0x7F that UTF-8 can store in a single byte. But it's not very efficient for encoding texts written primarily in eastern scripts; UTF-16 or UTF-32 is a better choice there.

If you have a read up on the various UTFs, they may inspire a solution. Fundamentally they work by doing things like allowing a lot of values to be directly encoded in a byte, but then having a flag (the high-order bit, I think it is, in the case of UTF-8's first byte) indicating that that byte doesn't tell the whole story and the next byte (or two, or three, or four) is/are required. The starting point is a byte for UTF-8, a word for UTF-16, but the concepts are similar.

Now, you're working with a dramatically smaller range of values (0-9 rather than 0-255), and obviously I'm not recommending trying to directly use UTF, just the concept. For instance, say most of your values (directly or with some massaging) are less than 9000, quite a few are less than 9000000, and only rare values take you out beyond that. You might take the UTF approach and say that blocks (your 8-byte values) are divided into four-digit segments, and you'll always have at least one segment (four digits) per encoded block. If the first segment's value (aaaa) is between 0000 and 8999 (inclusive), it's a "terminal" segment — that's the actual value. But if it's 9aaa, that means there's a second segment and you should look at aaabbbb (bbbb being the next segment's value). If that value is between 0000000 and 8999999 (inclusive), it's a terminal; but if it's 9aabbbb, it means look at aabbbbcccc (cccc being the next segment); etc. I think that would give us this:

00000000000000000000-00000000000000008999 ->  4 digits (xxxx)
00000000000000009000-00000000000008999999 ->  8 digits (9xxxxxxx)
00000000000009000000-00000000008999999999 -> 12 digits (99xxxxxxxxxx)
00000000009000000000-00000008999999999999 -> 16 digits (999xxxxxxxxxxxxx)
00000009000000000000-00008999999999999999 -> 20 digits (9999xxxxxxxxxxxxxxxx)
00009000000000000000-08999999999999999999 -> 24 digits (99999xxxxxxxxxxxxxxxxxxx)
09000000000000000000-18446744073709551615 -> 28 digits (999999xxxxxxxxxxxxxxxxxxxxxx)
Or special case, just use 26 digits for the last one:  (999999xxxxxxxxxxxxxxxxxxxx)

There your best case is four digits and worst is 28 or 26, depending on whether you want to special-case the last segement in the block. A lot better (probably) than using 20 digits for each block.

Now, that's completely off-the-cuff and probably not as efficient as it could be, but you get the idea. It's very easy to deserialize, and probably not that hard to serialize.

You can see why I started with the comment about what your typical values are. If they're typically above 10,000,000,000,000,000,000, the above is not an efficient way to encode them directly. But similar techniques can be used if your typical values are at the high end rather than the low, by massaging the value a bit before encoding.

The result that has the shortest length is to convert it to decimal directly. This leads to the highest value being 18446744073709551615, but conversion can be difficult without arbitrary length integer capability.

The next longest is to convert it to octal as one chunk. This results in a maximum length of 22, with a value of 1777777777777777777777. This requires only shifts to convert, and can be handled easily enough.

The next longest is to convert it to either octal or decimal bytewise. This results in a length of 24, with 8 repetitions of 377 or 255 respectively. Converting back and forth is trivial, and is left as an exercise for the reader.