开发者

Lists Hash function

开发者 https://www.devze.com 2023-01-01 03:42 出处:网络
I\'m trying to make a hash function so I can tell if too lists with same sizes contain the same elements.

I'm trying to make a hash function so I can tell if too lists with same sizes contain the same elements.

For exemple this is what I want:

f((1 2 3))=f((1 3 2))=f((2 1 开发者_StackOverflow社区3))=f((2 3 1))=f((3 1 2))=f((3 2 1)).

Any ideea how can I approch this problem ? I've tried doing the sum of squares of all elements but it turned out that there are collisions,for exemple f((2 2 5))=33=f((1 4 4)) which is wrong as the lists are not the same.

I'm looking for a simple approach if there is any.


Sort the list and then:

list.each do |current_element|
  hash = (37 * hash + current_element) % MAX_HASH_VALUE
end


You're probably out of luck if you really want no collisions. There are N choose k sets of size k with elements in 1..N (and worse, if you allow repeats). So imagine you have N=256, k=8, then N choose k is ~4 x 10^14. You'd need a very large integer to distinctly hash all of these sets.

Possibly you have N, k such that you could still make this work. Good luck.

If you allow occasional collisions, you have lots of options. From simple things like your suggestion (add squares of elements) and computing xor the elements, to complicated things like sort them, print them to a string, and compute MD5 on them. But since collisions are still possible, you have to verify any hash match by comparing the original lists (if you keep them sorted, this is easy).


So you are looking something provides these properties,

1. If h(x1) == y1, then there is an inverse function h_inverse(y1) == x1

2. Because the inverse function exists, there cannot be a value x2 such that x1 != x2, and h(x2) == y1.

Knuth's Multiplicative Method

In Knuth's "The Art of Computer Programming", section 6.4, a multiplicative hashing scheme is introduced as a way to write hash function. The key is multiplied by the golden ratio of 2^32 (2654435761) to produce a hash result.

hash(i)=i*2654435761 mod 2^32

Since 2654435761 and 2^32 has no common factors in common, the multiplication produces a complete mapping of the key to hash result with no overlap. This method works pretty well if the keys have small values. Bad hash results are produced if the keys vary in the upper bits. As is true in all multiplications, variations of upper digits do not influence the lower digits of the multiplication result.

Robert Jenkins' 96 bit Mix Function

Robert Jenkins has developed a hash function based on a sequence of subtraction, exclusive-or, and bit shift.

All the sources in this article are written as Java methods, where the operator '>>>' represents the concept of unsigned right shift. If the source were to be translated to C, then the Java 'int' data type should be replaced with C 'uint32_t' data type, and the Java 'long' data type should be replaced with C 'uint64_t' data type.

The following source is the mixing part of the hash function.

int mix(int a, int b, int c)
{
  a=a-b;  a=a-c;  a=a^(c >>> 13);
  b=b-c;  b=b-a;  b=b^(a << 8); 
  c=c-a;  c=c-b;  c=c^(b >>> 13);
  a=a-b;  a=a-c;  a=a^(c >>> 12);
  b=b-c;  b=b-a;  b=b^(a << 16);
  c=c-a;  c=c-b;  c=c^(b >>> 5);
  a=a-b;  a=a-c;  a=a^(c >>> 3);
  b=b-c;  b=b-a;  b=b^(a << 10);
  c=c-a;  c=c-b;  c=c^(b >>> 15);
  return c;
}

You can read details from here


If all the elements are numbers and they have a maximum, this is not too complicated, you sort those elements and then you put them together one after the other in the base of your maximum+1.

Hard to describe in words... For example, if your maximum is 9 (that makes it easy to understand), you'd have :

f(2 3 9 8) = f(3 8 9 2) = 2389

If you maximum was 99, you'd have :

f(16 2 76 8) = (0)2081676

In your example with 2,2 and 5, if you know you would never get anything higher than 5, you could "compose" the result in base 6, so that would be :

f(2 2 5) = 2*6^2 + 2*6 + 5 = 89 f(1 4 4) = 1*6^2 + 4*6 + 4 = 64


Combining hash values is hard, I've found this way (no explanation, though perhaps someone would recognize it) within Boost:

template <class T>
void hash_combine(size_t& seed, T const& v)
{
  seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
}

It should be fast since there is only shifting, additions and xor taking place (apart from the actual hashing).

However the requirement than the order of the list does not influence the end-result would mean that you first have to sort it which is an O(N log N) operation, so it may not fit.

Also, since it's impossible without more stringent boundaries to provide a collision free hash function, you'll still have to actually compare the sorted lists if ever the hash are equals...


I'm trying to make a hash function so I can tell if two lists with same sizes contain the same elements.

[...] but it turned out that there are collisions

These two sentences suggest you are using the wrong tool for the job. The point of a hash (unless it is a 'perfect hash', which doesn't seem appropriate to this problem) is not to guarantee equality, or to provide a unique output for every given input. In the general usual case, it cannot, because there are more potential inputs than potential outputs.

Whatever hash function you choose, your hashing system is always going to have to deal with the possibility of collisions. And while different hashes imply inequality, it does not follow that equal hashes imply equality.

As regards your actual problem: a start might be to sort the list in ascending order, then use the sorted values as if they were the prime powers in the prime decomposition of an integer. Reconstruct this integer (modulo the maximum hash value) and there is a hash value.

For example:

2 1 3

sorted becomes

1 2 3

Treating this as prime powers gives

2^1.3^2.5^3

which construct

2.9.125 = 2250

giving 2250 as your hash value, which will be the same hash value as for any other ordering of 1 2 3, and also different from the hash value for any other sequence of three numbers that do not overflow the maximum hash value when computed.


A naïve approach to solving your essential problem (comparing lists in an order-insensitive manner) is to convert all lists being compared to a set (set in Python or HashSet in Java). This is more effective than making a hash function since a perfect hash seems essential to your problem. For almost any other approach collisions are inevitable depending on input.

0

精彩评论

暂无评论...
验证码 换一张
取 消