For my experiment I need to define specific similarity metrics for each field of my collection documents.
For example, I need to measure the D开发者_运维技巧escription field similarity with tf.idf, and Geolocation fields with Harvesine distance.. etc...
I'm now studying the Similarity class. I was wondering if there is any good tutorial or example about this to procede faster...
thanks
EDIT: IIUC, you have a similarity formula per field, and you want to use it per document, running against all other documents. You can use several options, all at indexing time:
- Extend the DefaultSimilarity class.
- Extend the SimilarityDelegator class, if you only need to modify part of the methods.
In both methods, you may make use of payloads to store term-specific information (could be useful for the lat-long data).
After implementing a Similarity class using one of these methods, use Similarity.setDefault(mySimilarity) to set this as the Similarity instance for indexing and searching.
Only then index your text corpus, which you can search later - you will probably have to extend the Searcher class as well to get the raw similarity.
Having said that, I believe this approach is wrong for your use case - Lucene is optimized to get a few similar documents, not a score for every one, so I predict the runtime will be prohibitive - Hope I am wrong, but nevertheless I suggest you read Mining of Massive Datasets for a better approach - min hashes and shingling.
Good luck.
Patrick, I will first quote Grant Ingersoll about modifying the Similarity class: "Here be Dragons". Customizing Lucene's Similarity class is hard. I have done this.
It is not fun. Only do this if you absolutely have to.
I suggest you should first read Grant's spatial search paper, his findability paper and his 'debugging relevance' paper. These show other ways to get hits as needed.
精彩评论