In my sphinx config file, I have the following:
ignore_chars: "U+0027"
charset_table: "0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a,
U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e,
U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i [SNIP]"
开发者_StackOverflow社区
(The charset_table entry is from here: http://speeple.com/unicode-maps.txt)
The expected result is that querying kyles
will return all records matching kyles
and/or kyle's
, since I'm telling sphinx to exclude ' (single quote/apos) from the index (ab'cd -> abcd). However, in practice, this is not happening.
I believe adding it to the ignore_chars has the opposite of the desired effect. This is telling sphinx not to split on that character, but instead it will collapse the word around the characters to be ignored. So, kyle's
will become kyles
instead of kyle
and s
.
The solution I just tried for this issue that seems to have worked was to add s
to my list of stopwords (might need 's
in there also, can't remember). Sphinx seems to split kyle's
up into the words kyle
and 's
. Because match all mode is on, some documents fail on the match for 's
. Adding it to the stop words seems to have the desired effect.
It seems like the normal stemming should take care of this however, so maybe we're both doing something wrong...
精彩评论