开发者

How should I handle duplicate entries' weights in MyISAM search index?

开发者 https://www.devze.com 2023-02-15 14:45 出处:网络
Question I am using the result of a myisam_ftdump to generate a search suggestions table. This process went smoothly, but many words appear in the index multiple times. Clearly, I could just SELECT d

Question

I am using the result of a myisam_ftdump to generate a search suggestions table. This process went smoothly, but many words appear in the index multiple times. Clearly, I could just SELECT distinct term FROM suggestions ORDER BY weight, but doesn't this penalize words for showing up more开发者_JS百科 than once?

If it does, is there a concise formula for merging the rows?

If it does not, which rows should I keep (e.g., highest weighted, lowest weighted)?

Example Data

+-----+------------+----------+
| id  | word       | weight   |
+-----+------------+----------+
| 670 | young      | 0.416022 |
| 669 | york       |  0.54944 |
| 668 | years      | 0.281683 |
| 667 | years      | 0.416022 |
| 666 | wrote      | 0.416022 |
| 665 | written    |  0.35841 |
| 664 | writing    |  0.29518 |
| 663 | wright     | 0.281683 |
| 662 | witness    | 0.281683 |
| 661 | wiesenthal | 0.452452 |
| 660 | white      |  0.35841 |
| 659 | white      | 0.281683 |
| 658 | wgbh       | 0.369332 |
| 657 | weighs     |  0.35841 |
+-----+------------+----------+

See especially 'white' and 'years'.


It looks like you ran myisam_ftdump -d. I think you want to use myisam_ftdump -c instead.

That will give you one row per word, along with a count of how many times that word appears in the index, and its global weight.

Here's the doc on -c vs. -d:

  -c, --count         Calculate per-word stats (counts and global weights).
  -d, --dump          Dump index (incl. data offsets and word weights).
0

精彩评论

暂无评论...
验证码 换一张
取 消