开发者

Rabin Karp implementation the precomputed hash values and Hash function values don't match

开发者 https://www.devze.com 2022-12-07 19:20 出处:网络
I have Rabin Karp implementation in C++ (Rabin Karp is a string pattern matching algorithm that uses hashing technique to match substrings [Wiki link to the algorithm] (https://en.wikipedia.org/wiki/R

I have Rabin Karp implementation in C++ (Rabin Karp is a string pattern matching algorithm that uses hashing technique to match substrings [Wiki link to the algorithm] (https://en.wikipedia.org/wiki/Rabin%E2%80%93Karp_algorithm)), A trivial test case fails, I suspect precomputation of substring hashes routine is doing something wrong, because I tried replacing the precomputed value with the PolynomialHash function call ie. computing the hash of the substring in flight. I get the correct result with PolynomialHash function call but precomputed doesn't. What might have I done wrong here?

Algorithm pseudo code

1) Choose a large Prime p. 
2) Let x be a random value in range [1, p-1]. 
3) Get the hash value for the pattern (p and x are used here).
4) For each substring which has length equal to the pattern compute the hash value of this substring. 
5) If the hash values of pattern and the substring match then this substring could be a potential match, to ensure, check whether they are in fact the same. 

The reason a good implementation should avoid using PolynomialHash is because the time complexity goes high, instead precomputing the Hash values for each substring is more efficient. So, my problem is the precomputed hashes, and the function call hash values don't match. Below is the relevant code and the failing testcase. Note the large prime I used is 1e9 + 7. ll is typedef for long long int.

Rabin Karp implementation the precomputed hash values and Hash function values don't match

bool AreEqual(const string &text,const string &pattern,int si,int ei){
    for(int i =0;i<pattern.size();i++){
        if(pattern[i]!=text[si+i]) return false; 
    }
    return true;
}

ll PolynomialHash(const string &str,ll prime,ll x,int si,int ei){
    ll hash_val = 0; 
    for(int i = ei;i>=si;i--) hash_val = (hash_val*x+str[i])%prime;
    return hash_val;
} 


ll GetInt(const char &x){
    return x-'a';
}

void PrecomputeSubstringHashes(vector<ll> &substring_hashes,const string &text,const string &pattern,ll prime,ll x){
    int text_len = text.length(); 
    int pattern_len = pattern.length(); 
    substring_hashes[text_len-pattern_len] = PolynomialHash(text,prime,x,text_len-pattern_len,text_len-1); 
    ll y = 1;
    for(int i =0;i<pattern_len;i++) y = (y*x)%prime; // can still be optimized to O(log(|P|)) 
    for(int i = text_len-pattern_len-1;i>=0;i--){
        substring_hashes[i] = (prime + substring_hashes[i+1]*x + GetInt(text[i]) - y*GetInt(text[i+pattern_len]))%prime;
    }
}

void RabinKarpSubstringSearch(const string &text,const string &pattern){
    vector<int>positions; 
    ll prime = Mod; // This large prime used is 1e9 + 7
    ll x = 1 + rand()%prime; 
    ll pattern_hash = PolynomialHash(pattern,prime,x,0,pattern.length()-1);  
    assert(text.length()>=pattern.length());
    vector<ll>substring_hashes(text.length()-pattern.length()+1,0);  
    PrecomputeSubstringHashes(substring_hashes,text,pattern,prime,x);
    for(int i =0;i<=text.length()-pattern.length();i++){
        ll substr_hash = substring_hashes[i]; // Try with PolynomialHash it gives correct results
        if(substr_hash!=pattern_hash) continue;  
       // cout<<"Hitting at "<<i<<' '; // checking if expected and more indices [the wrong ones] hit this line.
        if(AreEqual(text,pattern,i,i+pattern.length()-1)) positions.push_back(i); /// The probab开发者_如何学JAVAlity of this condition being hit is too less thanks to a large prime number.
    }  
    cout<<'\n';
    for(const int &x : positions) cout<<x<<' ';
}
// Test case 

aab 
abxaabbbaaab 

PolynomialHash gives [3,9] as expected 
substring_hashes[i] gives only [9] 

I Did some further analysis I was right to suspect the Hash returned by Function and the precomputed values don't match. Here are both of them with input same as in the testcase. The Pattern Hash value is 850147133 seen at index 3 and 9 in Function hash.

// Function computed Hashes [index,HashValue]
0 45760133
1 921924329
2 423268806
3 **850147133**
4 654436503
5 654436504
6 227558154
7 423268784
8 423268783
9 **850147133**
// Precomputed Hashes [index,HashValue] 
0 317287043
1 159209231
2 90628189
3 807191256
4 327387316
5 934623271
6 740231484
7 428758105
8 399604110
9 **850147133**
0

精彩评论

暂无评论...
验证码 换一张
取 消