开发者

Is there a way to search SQL database for similar words (mean not identical words)?

开发者 https://www.devze.com 2023-01-12 17:50 出处:网络
Is there a way to search MySQL database for similar words (mean not identical words) . For example : user searches in database for word \"abcd\" and there is a word \"abd\" in the database so the sear

Is there a way to search MySQL database for similar words (mean not identical words) . For example : user searches in database for word "abcd" and there is a word "abd" in the database so the search engine 开发者_如何学Cor the program Ask the user "Do you mean [abd] ? " as in most of search engines in the web ? Please notice that the search word is not a part of the existing word (can't use "like")


Have a look at the Damerau-Levenshtein distance algorithm. It calculates the "distance" between two strings and determines how many steps it takes to transform one string into another. The less steps the closer the two strings are.

This article shows the algorithm implemented as a MySQL stored function.

The algorithm is so much better than LIKE or SOUNDEX.

I believe Google uses crowd sourced data rather than an algorithm. ie if a user types in abcd, clicks on the back button and then immediately searches for abd then it establishes a relationship between the two search terms as the user wasn't happy with the results. Once you have a very large community searching then the pattern appears.


Since the link in Dave Barker's answer is dead, here is the code from an archived version of the website:

CREATE FUNCTION LEVENSHTEIN (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        DECLARE s1_char CHAR;
        DECLARE cv0, cv1 VARBINARY(256);
        SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
        IF s1 = s2 THEN
          RETURN 0;
        ELSEIF s1_len = 0 THEN
          RETURN s2_len;
        ELSEIF s2_len = 0 THEN
          RETURN s1_len;
        ELSE
          WHILE j <= s2_len DO
            SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
          END WHILE;
          WHILE i <= s1_len DO
            SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
            WHILE j <= s2_len DO
                SET c = c + 1;
                IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                IF c > c_temp THEN SET c = c_temp; END IF;
                SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
            END WHILE;
            SET cv1 = cv0, i = i + 1;
          END WHILE;
        END IF;
        RETURN c;
      END

To note:

  • Maximum length of input strings is 255 characters. I’m sure you could edit the function to support more if needed.

  • I’ve tested it with international characters on a utf8_bin column and it seemed to work, but I’ve not tested that capability exstensively.

  • I’ve only tested it on MySQL 5.0+. No idea how it will work on versions less than that.

And as a bonus I also created a helper function that returns the ratio (as a percentage) of different : same characters which can be more helpful than just a straight edit distance (idea from here).

CREATE FUNCTION LEVENSHTEIN_RATIO (s1 VARCHAR(255), s2 VARCHAR(255))
  RETURNS INT
    DETERMINISTIC
      BEGIN
        DECLARE s1_len, s2_len, max_len INT;
        SET s1_len = LENGTH(s1), s2_len = LENGTH(s2);
        IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF;
        RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100);
      END


Depends on how far apart they are, you could look into soundex perhaps..

http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex


Check out Levenshtein_distance


Another technique is to create indexes on trigrams.

0

精彩评论

暂无评论...
验证码 换一张
取 消