Question
I am using the result of a myisam_ftdump to generate a search suggestions table. This process went smoothly, but many words appear in the index multiple times. Clearly, I could just SELECT distinct term FROM suggestions ORDER BY weight
, but doesn't this penalize words for showing up more开发者_JS百科 than once?
If it does, is there a concise formula for merging the rows?
If it does not, which rows should I keep (e.g., highest weighted, lowest weighted)?
Example Data
+-----+------------+----------+
| id | word | weight |
+-----+------------+----------+
| 670 | young | 0.416022 |
| 669 | york | 0.54944 |
| 668 | years | 0.281683 |
| 667 | years | 0.416022 |
| 666 | wrote | 0.416022 |
| 665 | written | 0.35841 |
| 664 | writing | 0.29518 |
| 663 | wright | 0.281683 |
| 662 | witness | 0.281683 |
| 661 | wiesenthal | 0.452452 |
| 660 | white | 0.35841 |
| 659 | white | 0.281683 |
| 658 | wgbh | 0.369332 |
| 657 | weighs | 0.35841 |
+-----+------------+----------+
See especially 'white' and 'years'.
It looks like you ran myisam_ftdump -d
. I think you want to use myisam_ftdump -c
instead.
That will give you one row per word, along with a count of how many times that word appears in the index, and its global weight.
Here's the doc on -c vs. -d:
-c, --count Calculate per-word stats (counts and global weights).
-d, --dump Dump index (incl. data offsets and word weights).
精彩评论