开发者

Persistent Hashing of Strings in Python

开发者 https://www.devze.com 2022-12-24 09:28 出处:网络
How would you convert an arbitrary string into a unique integer, which would be the same across Python sessions and platforms? For example开发者_StackOverflow hash(\'my string\') wouldn\'t work becaus

How would you convert an arbitrary string into a unique integer, which would be the same across Python sessions and platforms? For example开发者_StackOverflow hash('my string') wouldn't work because a different value is returned for each Python session and platform.


Use a hash algorithm such as MD5 or SHA1, then convert the hexdigest via int():

>>> import hashlib
>>> int(hashlib.md5('Hello, world!').hexdigest(), 16)
144653930895353261282233826065192032313L


If a hash function really won't work for you, you can turn the string into a number.

my_string = 'my string'
def string_to_int(s):
    ord3 = lambda x : '%.3d' % ord(x)
    return int(''.join(map(ord3, s)))

In[10]: string_to_int(my_string)
Out[11]: 109121032115116114105110103L

This is invertible, by mapping each triplet through chr.

def int_to_string(n)
    s = str(n)
    return ''.join([chr(int(s[i:i+3])) for i in range(0, len(s), 3)])

In[12]: int_to_string(109121032115116114105110103L)
Out[13]: 'my string'


Here are my python27 implementation for algorithms listed here: http://www.cse.yorku.ca/~oz/hash.html. No idea if they are efficient or not.

from ctypes import c_ulong

def ulong(i): return c_ulong(i).value  # numpy would be better if available

def djb2(L):
  """
  h = 5381
  for c in L:
    h = ((h << 5) + h) + ord(c) # h * 33 + c
  return h
  """
  return reduce(lambda h,c: ord(c) + ((h << 5) + h), L, 5381)

def djb2_l(L):
  return reduce(lambda h,c: ulong(ord(c) + ((h << 5) + h)), L, 5381)

def sdbm(L):
  """
  h = 0
  for c in L:
    h = ord(c) + (h << 6) + (h << 16) - h
  return h
  """
  return reduce(lambda h,c: ord(c) + (h << 6) + (h << 16) - h, L, 0)

def sdbm_l(L):
  return reduce(lambda h,c: ulong(ord(c) + (h << 6) + (h << 16) - h), L, 0)

def loselose(L):
  """
  h = 0
  for c in L:
    h += ord(c);
    return h
  """
  return sum(ord(c) for c in L)

def loselose_l(L):
  return reduce(lambda h,c: ulong(ord(c) + h), L, 0)


First off, you probably don't really want the integers to be actually unique. If you do then your numbers might be unlimited in size. If that really is what you want then you could use a bignum library and interpret the bits of the string as the representation of a (potentially very large) integer. If your strings can include the \0 character then you should prepend a 1, so you can distinguish e.g. "\0\0" from "\0".

Now, if you prefer bounded-size numbers you'll be using some form of hashing. MD5 will work but it's overkill for the stated purpose. I recommend using sdbm instead, it works very well. In C it looks like this:

static unsigned long sdbm(unsigned char *str)
{
    unsigned long hash = 0;
    int c;

    while (c = *str++)
        hash = c + (hash << 6) + (hash << 16) - hash;

    return hash;
}

The source, http://www.cse.yorku.ca/~oz/hash.html, also presents a few other hash functions.


Here's another option, quite crude (probably has many collisions) and not very legible.

It worked for the purpose of generating an int (and later on, a random color) for different strings:

aString = "don't panic"
reduce( lambda x,y:x+y, map( lambda x:ord(x[0])*x[1],zip( aString, range( 1, len( aString ) ) ) ) )
0

精彩评论

暂无评论...
验证码 换一张
取 消