开发者

Is there a circular hash function?

开发者 https://www.devze.com 2022-12-26 09:16 出处:网络
Thinking about this question on testing string rotation, I wondered: Is there was such thing as a circular/cyclic hash function? E.g.

Thinking about this question on testing string rotation, I wondered: Is there was such thing as a circular/cyclic hash function? E.g.

h(abcdef) = h(bcdefa) = h(cdefab) etc

Uses for this include scalable algorithms which can check n strings against each other to see where some are rotations of others.

I suppose the essence of the hash is to extract information which is order-specific but not position-specific. Maybe something that finds a deterministic 'first position', rotates to it and hashes the result?

It all seems plausible, but slightly beyond my grasp a开发者_如何学Ct the moment; it must be out there already...


I'd go along with your deterministic "first position" - find the "least" character; if it appears twice, use the next character as the tie breaker (etc). You can then rotate to a "canonical" position, and hash that in a normal way. If the tie breakers run for the entire course of the string, then you've got a string which is a rotation of itself (if you see what I mean) and it doesn't matter which you pick to be "first".

So:

"abcdef" => hash("abcdef")
"defabc" => hash("abcdef")
"abaac" => hash("aacab") (tie-break between aa, ac and ab)
"cabcab" => hash("abcabc") (it doesn't matter which "a" comes first!)


Update: As Jon pointed out, the first approach doesn't handle strings with repetition very well. Problems arise as duplicate pairs of letters are encountered and the resulting XOR is 0. Here is a modification that I believe fixes the the original algorithm. It uses Euclid-Fermat sequences to generate pairwise coprime integers for each additional occurrence of a character in the string. The result is that the XOR for duplicate pairs is non-zero.

I've also cleaned up the algorithm slightly. Note that the array containing the EF sequences only supports characters in the range 0x00 to 0xFF. This was just a cheap way to demonstrate the algorithm. Also, the algorithm still has runtime O(n) where n is the length of the string.

static int Hash(string s)
{
    int H = 0;

    if (s.Length > 0)
    {
        //any arbitrary coprime numbers
        int a = s.Length, b = s.Length + 1;

        //an array of Euclid-Fermat sequences to generate additional coprimes for each duplicate character occurrence
        int[] c = new int[0xFF];

        for (int i = 1; i < c.Length; i++)
        {
            c[i] = i + 1;
        }

        Func<char, int> NextCoprime = (x) => c[x] = (c[x] - x) * c[x] + x;
        Func<char, char, int> NextPair = (x, y) => a * NextCoprime(x) * x.GetHashCode() + b * y.GetHashCode();

        //for i=0 we need to wrap around to the last character
        H = NextPair(s[s.Length - 1], s[0]);

        //for i=1...n we use the previous character
        for (int i = 1; i < s.Length; i++)
        {
            H ^= NextPair(s[i - 1], s[i]);
        }
    }

    return H;
}


static void Main(string[] args)
{
    Console.WriteLine("{0:X8}", Hash("abcdef"));
    Console.WriteLine("{0:X8}", Hash("bcdefa"));
    Console.WriteLine("{0:X8}", Hash("cdefab"));
    Console.WriteLine("{0:X8}", Hash("cdfeab"));
    Console.WriteLine("{0:X8}", Hash("a0a0"));
    Console.WriteLine("{0:X8}", Hash("1010"));
    Console.WriteLine("{0:X8}", Hash("0abc0def0ghi"));
    Console.WriteLine("{0:X8}", Hash("0def0abc0ghi"));
}

The output is now:

7F7D7F7F
7F7D7F7F
7F7D7F7F
7F417F4F
C796C7F0
E090E0F0
A909BB71
A959BB71

First Version (which isn't complete): Use XOR which is commutative (order doesn't matter) and another little trick involving coprimes to combine ordered hashes of pairs of letters in the string. Here is an example in C#:

static int Hash(char[] s)
{
    //any arbitrary coprime numbers
    const int a = 7, b = 13;

    int H = 0;

    if (s.Length > 0)
    {
        //for i=0 we need to wrap around to the last character
        H ^= (a * s[s.Length - 1].GetHashCode()) + (b * s[0].GetHashCode());

        //for i=1...n we use the previous character
        for (int i = 1; i < s.Length; i++)
        {
            H ^= (a * s[i - 1].GetHashCode()) + (b * s[i].GetHashCode());
        }
    }

    return H;
}


static void Main(string[] args)
{
    Console.WriteLine(Hash("abcdef".ToCharArray()));
    Console.WriteLine(Hash("bcdefa".ToCharArray()));
    Console.WriteLine(Hash("cdefab".ToCharArray()));
    Console.WriteLine(Hash("cdfeab".ToCharArray()));
}

The output is:

4587590
4587590
4587590
7077996


You could find a deterministic first position by always starting at the position with the "lowest" (in terms of alphabetical ordering) substring. So in your case, you'd always start at "a". If there were multiple "a"s, you'd have to take two characters into account etc.


I am sure that you could find a function that can generate the same hash regardless of character position in the input, however, how will you ensure that h(abc) != h(efg) for every conceivable input? (Collisions will occur for all hash algorithms, so I mean, how do you minimize this risk.)

You'd need some additional checks even after generating the hash to ensure that the strings contain the same characters.


Here's an implementation using Linq

public string ToCanonicalOrder(string input)
{
    char first = input.OrderBy(x => x).First();
    string doubledForRotation = input + input;
    string canonicalOrder 
        = (-1)
        .GenerateFrom(x => doubledForRotation.IndexOf(first, x + 1))
        .Skip(1) // the -1
        .TakeWhile(x => x < input.Length)
        .Select(x => doubledForRotation.Substring(x, input.Length))
        .OrderBy(x => x)
        .First();

    return canonicalOrder;
}

assuming generic generator extension method:

public static class TExtensions
{
    public static IEnumerable<T> GenerateFrom<T>(this T initial, Func<T, T> next)
    {
        var current = initial;
        while (true)
        {
            yield return current;
            current = next(current);
        }
    }
}

sample usage:

var sequences = new[]
    {
        "abcdef", "bcdefa", "cdefab", 
        "defabc", "efabcd", "fabcde",
        "abaac", "cabcab"
    };
foreach (string sequence in sequences)
{
    Console.WriteLine(ToCanonicalOrder(sequence));
}

output:

abcdef
abcdef
abcdef
abcdef
abcdef
abcdef
aacab
abcabc

then call .GetHashCode() on the result if necessary.

sample usage if ToCanonicalOrder() is converted to an extension method:

sequence.ToCanonicalOrder().GetHashCode();


One possibility is to combine the hash functions of all circular shifts of your input into one meta-hash which does not depend on the order of the inputs.

More formally, consider

for(int i=0; i<string.length; i++) {
  result^=string.rotatedBy(i).hashCode();
}

Where you could replace the ^= with any other commutative operation.

More examply, consider the input

"abcd"

to get the hash we take

hash("abcd") ^ hash("dabc") ^ hash("cdab") ^ hash("bcda").

As we can see, taking the hash of any of these permutations will only change the order that you are evaluating the XOR, which won't change its value.


I did something like this for a project in college. There were 2 approaches I used to try to optimize a Travelling-Salesman problem. I think if the elements are NOT guaranteed to be unique, the second solution would take a bit more checking, but the first one should work.

If you can represent the string as a matrix of associations so abcdef would look like

  a b c d e f
a   x
b     x
c       x
d         x
e           x
f x

But so would any combination of those associations. It would be trivial to compare those matrices.


Another quicker trick would be to rotate the string so that the "first" letter is first. Then if you have the same starting point, the same strings will be identical.

Here is some Ruby code:

def normalize_string(string)
  myarray = string.split(//)            # split into an array
  index   = myarray.index(myarray.min)  # find the index of the minimum element
  index.times do
    myarray.push(myarray.shift)         # move stuff from the front to the back
  end
  return myarray.join
end

p normalize_string('abcdef').eql?normalize_string('defabc') # should return true


Maybe use a rolling hash for each offset (RabinKarp like) and return the minimum hash value? There could be collisions though.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号