Many languages in Europe are inflectional. This means that one word can be written in multiple forms in text. For example, word 'computer' in polish "komputer" has multiple forms: "komputera", "komputerowi", "komputerem", "komputery" , etc..
How should I use django+haystack+whoosh properly to deal with language inflection?
Whenever I search for any form of "komputer", "komputera", "komputerowi" I mean this same thing ->"komputer".
In NLP there is a basic approach based either on stemming words (cutting suffixes) either on converting a form to the base form ("komputerowi" => "komputer"). There are some libraries that can help with that.
My first thought was to prepare some special template filter that will convert every recognized word in a given variable to the text with base forms rather then forms. Then I could use it in search index templates in django+ha开发者_JS百科ystack. If search query will be also converted before evaluate in whoosh engine this should work great. See example:
haystack search index template:
{{some_indexed_text|convert_to_base_form_filter}}
text to index: "Nie ma komputera" => "Nie ma komputer" <- this is really indexed
search query: "komputery" => "komputer" <-- this will match
But I don't think that this is "elegant" solution of this problem, also some other features won't work - like suggesting misspelling suggestions.
So - how should I solve this issue? Maybe I should use other search engine than whoosh?
Whoosh has, by default, only stemming for the english language.
To implement stemming for another language, first look inside:
/your_path_to_whoosh/whoosh/lang/analysis.py
This is where StemmingAnalyzer
(the default analyzer) is defined and an excellent starting point. The stem
function, imported from porter.py
, is the other important place to look in.
So, the three steps are:
Implement your own stemming function, taking as a reference the stem function in porter.py and any grammar and language references you will need to get the rules right.
Implement your own Analyzer taking as reference
StemmingAnalyzer
insideanalysis.py
. The file is heavily documented so you should have no problem navigating through it. You'll see thatStemmingAnalyzer
is basically a chaining of aTokenizer
with a regex to match words, a lowercase filter and the stemming filter which basically calls the above stemming function. You'll see thatStemFilter
takes the stemming function as a parameter, so you don't have to reimplement the filter.Pass your brand new Analyzer function at schema creation time, see here: http://files.whoosh.ca/whoosh/docs/latest/schema.html#creating-a-schema
For future readers: Whoosh can handle different languages with snowball stemmer.
from whoosh.lang.snowball.russian import RussianStemmer
stemmer_ru = RussianStemmer()
analyzer = StemmingAnalyzer(stemfn=stemmer_ru.stem)
schema = fields.Schema(
name=fields.TEXT(analyzer=analyzer),
)
Whoosh LanguageAnalyzer:
Configures a simple analyzer for the given language, with a LowercaseFilter, StopFilter, and StemFilter.
https://whoosh.readthedocs.io/en/latest/api/analysis.html#whoosh.analysis.LanguageAnalyzer
精彩评论