What's the best way to compress or encode a list of numbers of arbitrary length and sizes into a single alphanumeric string?
The goal is to be able to convert something like 1,5,8,3,20,212,42 into something like a8D1jN to be used in a URL, and then ba开发者_开发技巧ck to 1,5,8,3,20,212,42.
For the resulting string I'm fine with any number and any ASCII letter, lowercase and uppercase, so: 0-9a-zA-Z. I prefer not to have any punctuation whatsoever.
If you consider your list as a string, then you have 11 different characters to encode (0-9 and comma). This can be expressed in 4 bits. If you were willing to add, say $ and ! to your list of acceptable characters, then you would have 64 different output characters, and thus be able to encode 6 bits per character.
This would mean that you could map the string to an encoded string that would be about 30% shorter than the original one, and fairly obfuscated and random looking.
This way you could transcode the number series [1,5,8,3,20,212,42] to the string "gLQfoIcIeQqq".
UPDATE: I felt inspired and wrote a python solution for this solution (not fast but functional enough...)
ZERO = ord('0')
OUTPUT_CHARACTERS = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789$!"
def encode(numberlist):
# convert to string -> '1,5,8,3,20,212,42'
s = str(numberlist).replace(' ','')[1:-1]
# convert to four bit values -> ['0010', '1011', '0110', ... ]
# (add 1 to avoid the '0000' series used for padding later)
four_bit_ints = [0 <= (ord(ch) - ZERO) <= 9 and (ord(ch) - ZERO) + 1 or 11 for ch in s]
four_bits = [bin(x).lstrip('-0b').zfill(4) for x in four_bit_ints]
# make binary string and pad with 0 to align to 6 -> '00101011011010111001101101...'
bin_str = "".join(four_bits)
bin_str = bin_str + '0' * (6 - len(bin_str) % 6)
# split to 6bit blocks and map those to ints
six_bits = [bin_str[x * 6 : x * 6 + 6] for x in range(0, len(bin_str) / 6)]
six_bit_ints = [int(x, 2) for x in six_bits]
# map the 6bit integers to characters
output = "".join([OUTPUT_CHARACTERS[x] for x in six_bit_ints])
return output
def decode(input_str):
# map the input string from characters to 6bit integers, and convert those to bitstrings
six_bit_ints = [OUTPUT_CHARACTERS.index(x) for x in input_str]
six_bits = [bin(x).lstrip('-0b').zfill(6) for x in six_bit_ints]
# join to a single binarystring
bin_str = "".join(six_bits)
# split to four bits groups, and convert those to integers
four_bits = [bin_str[x * 4 : x * 4 + 4] for x in range(0, len(bin_str) / 4)]
four_bit_ints = [int(x, 2) for x in four_bits]
# filter out 0 values (padding)
four_bit_ints = [x for x in four_bit_ints if x > 0]
# convert back to the original characters -> '1',',','5',',','8',',','3',',','2','0',',','2','1','2',',','4','2'
chars = [x < 11 and str(x - 1) or ',' for x in four_bit_ints]
# join, split on ',' convert to int
output = [int(x) for x in "".join(chars).split(',') if x]
return output
if __name__ == "__main__":
# test
for i in range(100):
numbers = range(i)
out = decode(encode(numbers))
assert out == numbers
# test with original series
numbers = [1,5,8,3,20,212,42]
encoded = encode(numbers)
print encoded # prints 'k2UBsZgZi7uW'
print decode(encoded) # prints [1, 5, 8, 3, 20, 212, 42]
You can use an encoding scheme like the Base64
.
Base64 modules or libraries are common in multiple programming languages.
Instead of comma separating the numbers, you could do a simple encoding where you replace the last digit of each number with 'a'+digit. So, your list [1,5,8,3,20,212,42]
would become mysterious looking bfid2a21c4c
. :)
I would use something like this only if there are a handful of numbers, where compression won't be able to shorten the string a lot. If it s a lot of numbers we are talking about, you could try to perform some sort of compression + base64 encoding on the data instead.
Depends on the range of the numbers -- with a reasonable range a simple dictionary compression scheme could work.
Given your edit and estimate of 10k rows, a dictionary scheme where each number is mapped to a triple of [A-Za-z0-9] could be unique for 62*62*62 different entries.
There might be a super cool and efficient algorithm for your case. However, a very simple, tested, and reliable algorithm would be to use a "common" encoding or compression algorithm on the comma seperated string of numbers.
There are many available to choose from.
'best' depends on what your criteria are.
If best means simple : just string the numbers together seperated by a fixed character:
1a5a8a3a20a212a42
This should also be fast
If you want the resulting string to be small, you can run the string above through some compression algorithm like zip, then the result through some encoding like base64 or similar.
You can use the Chinese Remainder Theorem as well.
The idea is to find a number X such that
X = a1 mod n1
X = a2 mod n2
...
X = ak mod nk
gcd(Ni Nj)=1 for each combination (i j).
The CRT says how to find a smallest number X that satisfies these equations.
Like this, you can encode the numbers a1...ak as X, and keep a list of Ns fixed. Each Ni must be greater than ai, quite so.
精彩评论