I have to index a lot documents that contain reference numbers like "aaa.bbb.ddd-fff". The structure can change but it's always some arbitrary numbers or characters combined with "/","-","_" or some other delimiter.
The users want to be able to search for any of the substrings like "aaa" or "ddd" and also for combinations like "aaa.bbb" or "ddd-fff". The best I have been able to come up with is to create my own token filter modeled after the synonym filter in "Lucene in action" which spits out multiple terms for each input. In my case I return "aaa.bbb", "bbb.ddd","bbb.ddd-ff开发者_开发问答f" and all other combinations of the substrings. This works pretty well but when I index large documents (100MB) that contain lots of such strings I tend to get out of memory exceptions because my filter returns multiple terms for each input string.
Is there a better way to index these strings?
I would try to build a token filter that:
- Extracts tokens delimited by the delimiters, e.g. aaa, bbb, ddd, fff.
- Extracts the delimiters as separate tokens.
- Maybe adds a separator token to prevent cross-number matches.
For the query, I would first try a boolean query with SHOULD terms. If this gives too many false positives, I would change this to MUST. If this is still too much, I would try a PhraseQuery.
精彩评论