I want Lucene.NET to store a value while indexing a modified, stripped-down version of the stored value. e.g. Consider the value:
this_example-has some/weird (chars) 100%
I want it stored right like that (so that I can retrieve exactly t开发者_JS百科hat for showing in the results list), but I want lucene to index it as:
this example has some weird chars 100
(you see, like a "sanitized" version of the original value) for a simplified search.
I figure this would be the job of an analyzer, but I don't want to mess with rolling my own. Ideally, the solution should remove everything that is not a letter, a number or quotes, replacing the removed chars by a white-space before indexing.
Any suggestions on how to implement that?
This is because I am indexing products for an e-commerce search, and some have realy creepy names. I think this would improve search assertiveness.
Thanks in advance.
If you don't want a custom analyzer, try storing the value as a separate non-indexed field, and use a simple regex to generate the sanitized version.
var input = "this_example-has some/weird (chars) 100%";
var output = Regex.Replace(input, @"[\W_]+", " ");
You mention that you need another Analyzer for some searching functionality. Dont forget the PerFieldAnalyzerWrapper which will allow you to use different analyzers within the same document.
public static void Main() {
var wrapper = new PerFieldAnalyzerWrapper(defaultAnalyzer: new StandardAnalyzer(Version.LUCENE_29));
wrapper.AddAnalyzer(fieldName: "id", analyzer: new KeywordAnalyzer());
IndexWriter writer = null; // TODO: Retrieve these.
Document document = null;
writer.AddDocument(document, analyzer: wrapper);
}
You are correct that this is the work of the analyzer. And I'd start by using a tool like luke to see what the standard analyzer does with your term before getting into what to use -- it tends to do a good job stripping noise characters and words.
精彩评论