Is it possible to compare two strings by their "hash" numbers?_问答_开发者

I have a string which is lost forever. The only thing I have about it is some magic hash number. Now I have a new string, which could be similar or equal to the lost one. I need to find out how close it is.

Integer savedHash = 352736;
String newText = "this is new string";
if (Math.abs(hash(newText) - savedHash) < 100) {
  // wow, they are very close!
}

Are there any algorithms for this purpose?

ps. The length of the text is not fixed.

pps. I know how usual hash codes work. I'm interest开发者_开发百科ed in an algorithm that will work differently, giving me the functionality explained above.

ppps. In a very simple scenario this hash() method would look like:

public int hash(String txt) {
  return txt.length();
}

Standard hashing will not work in this case since close hash values do not imply close strings. In fact, most hash functions are designed to give close strings very different values, so as to create a random distribution of hash values for any given set of input strings.

If you had access to both strings, then you could use some kind of string distance function, such as Levenshtein distance. This calculates the edit distance between two strings, or the number of edits required to transform one to the other.

In this case however, the best approach might be to use some kind of fuzzy hashing technique. That way you don't have to store the original string, and can still get some measure of similarity.

If the hashes don't match then the strings are different.

If the hashes match then the strings are probably the same.

There is nothing else you can infer from the hash value.

No, this isn't going to work. The similarity of a hash bears no relation to the similarity of the original strings. In fact, it is entirely possible for 2 different strings to have the same hash. All you can say for sure is that if the hashes are different the strings were different.

[Edited in light of comment, possibility of collision is of course very real]

Edit for clarification:

If you only have the hash of the old string then there is no way you are going to find the original value of that string. There is no algorithm that would tell you if the hashes of 2 different strings represented strings that were close, and even if there was it wouldn't help. Even if you find a string that has an exact hash match with your old string there is still no way you would know if it was your original string, as any number of strings can produce the same hash value. In fact, there is a vast* number of strings that can produce the same hash.

[In theory this vast number is actually infinite but on any real storage system you can't generate an infinte number of strings. In any case your chance of matching an unknown string via this approach is very slim unless your hashes are large in relation to the input string, and even then you will need to brute force your way through every possible string]

As others have pointed out, with a typical hash algorithm, it just doesn't work like that at all.

There are, however, a few people who've worked out algorithms that are at least somewhat similar to that. For one example, there's a company called "Xpriori" that has some hashing (or least hash-like) algorithms that allow things like that. They'll let you compare for degree of similarity, or (for example) let you combine hashes so hash(a) + hash(b) == hash(a+b) (for some definition of +, not just simple addition of the numbers). Like with most hashes, there's always a possibility of collision, so you have some chance of a false positive (but by picking the hash size, you can set that chance to an arbitrarily small value).

As such, if you're dealing with existing data, you're probably out of luck. If you're creating something new, and want capabilities on this order, it is possible -- though trying to do it on your own is seriously non-trivial.

No. Hashes are designed so that minor variations in the input string cause huge differences in the resulting hashe. This is very useful for dictionary implementations, as well as verifying the integrity of a file (a single changed bit will cause a completely different hash). So no, it's not some kind of thing you can ever use as an inequality comparison.

If the hashCodes are different it cannot be the same String, however many Strings can have the same hashCode().

Depending on the nature of the Strings, doing a plain comparision could be more efficent than comparing the hashCode() it has to inspect and perform a calculation on every character, whereas comparision can store early e.g. if the length is different or as soon as it see a different character.

Any good hashing algorithm will by definition NEVER yield similar hashes for similar arguments. Otherwise, it would be too easy to crack. If the hashed value of "aaaa" looks similar to "aaab", then that is a poor hash. I have racked ones like that before without too much difficulty (fun puzzle to solve!) But you never know maybe your hash algorithm is poor. An idea what it is?

If you have time, you can just brute force this solution by hashing every possible word. Not elegant, but possible. Easier if you know the length of the original word as well.

If it is a standard has algorithm, like MD5, you can find websites that already have large mappings of source and hash, and get the answer that way. Try http://hashcrack.com/

I used this website successfully after one of our devs left and I needed to recover a password.

Cheers,

Daniel

You can treat the string as a really big number, but that's about the extent of your abilities in the general situation. If you have a specific problem domain, you may be able to compress a representation of the string to something smaller without losses, but still it will not be very useful.

For example, if you are working with individual words, you can use soundex to compare how similar two words will sound...

The best you can do with traditional hash codes will be to compare two strings for equality vs likely inequality. False positives are possible, but there will be no false negatives. You cannot compare for similarity this way, though.

a normal hash code changes a lot when the object changes a bit. that's made to distinguish different objects and don't care how resembling they could be. therefore the answer is no

Well, seems you want not real hash of string, but some fingerprint of string. Because you want it to be of 32-bits one way could be:

Calculate Pearson correlation coefficient between first and second half of string (if string length is odd number of chars, then add some padding) and store this number as 32-bit floating point number. But I'm not sure how reliable this method will be.

==EDIT==
Here is C example code (un-optimized) which implements this idea (a little bit modified):

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>

float mean(char *str) {
  char *x;
  float sum = 0.0;

  for(x=str; *x!='\0'; x++) {
    sum += (float) *x;
  }
  return sum/strlen(str);
}

float stddev(char *str) {
  char *x;
  float sum = 0.0;
  float u = mean(str);

  for(x=str; *x!='\0'; x++) {
    sum += ((float)*x - u)*((float)*x - u);
  }
  return sqrt(sum/strlen(str));
}

float covariance(char *str1, char *str2) {
  int i;
  int im = fmin(strlen(str1),strlen(str2));
  float sum = 0.0;
  float u1 = mean(str1);
  float u2 = mean(str2);

  for(i=0; i<im; i++) {
    sum += ((float)str1[i] - u1)*((float)str2[i] - u2);
  }
  return sum/im;
}

float correlation(char *str1, char *str2) {
  float cov = covariance(str1,str2);
  float dev1 = stddev(str1);
  float dev2 = stddev(str2);
  return cov/(dev1*dev2);
}

float string_fingerprint(char *str) {
  int len = strlen(str);
  char *rot = (char*) malloc((len+1)*sizeof(char));
  int i;
  // rotate string by CHAR_COUNT/2
  for(i=0; i<len; i++){
    rot[i] = str[(i+len/2)%len];
  }
  rot[len] = '\0';
  // now calculate correlation between original and rotated strings
  float corr = correlation(str,rot);
  free(rot);
  return corr;
}

int main() {
  char string1[] = "The quick brown fox jumps over the lazy dog";
  char string2[] = "The slow brown fox jumps over the crazy dog";
  float f1 = string_fingerprint(string1);
  float f2 = string_fingerprint(string2);
  if (fabs(f1 - f2) < 0.2) {
    printf("wow, they are very close!\n");
  }
  return 0;
}

hth!