开发者

Ignoring apostrophes in sphinx indexes

开发者 https://www.devze.com 2022-12-18 00:08 出处:网络
In my sphinx config file, I have the following: ignore_chars: \"U+0027\" charset_table: \"0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a,

In my sphinx config file, I have the following:

ignore_chars: "U+0027"
charset_table: "0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a,
  U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e,
  U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i [SNIP]"
开发者_StackOverflow社区

(The charset_table entry is from here: http://speeple.com/unicode-maps.txt)

The expected result is that querying kyles will return all records matching kyles and/or kyle's, since I'm telling sphinx to exclude ' (single quote/apos) from the index (ab'cd -> abcd). However, in practice, this is not happening.


I believe adding it to the ignore_chars has the opposite of the desired effect. This is telling sphinx not to split on that character, but instead it will collapse the word around the characters to be ignored. So, kyle's will become kyles instead of kyle and s.

The solution I just tried for this issue that seems to have worked was to add s to my list of stopwords (might need 's in there also, can't remember). Sphinx seems to split kyle's up into the words kyle and 's. Because match all mode is on, some documents fail on the match for 's. Adding it to the stop words seems to have the desired effect.

It seems like the normal stemming should take care of this however, so maybe we're both doing something wrong...

0

精彩评论

暂无评论...
验证码 换一张
取 消