I have a set of documents and i have calculate both
- Term -Frequency score
- Inverse-Frequency Score
- TF/IDF score
Now i need to calculate the similarity between a specific query and a document which will produce a score that will rank the document from the highest similarity to t开发者_如何转开发he lowest similarity towards the query.
I have search for a lot of information but i do no understand the formula.
source : http://en.wikipedia.org/wiki/Vector_space_model
Can anyone guide me ? I just need to know how to proceed from my current progress.
Lucene is a open source library that does all this for you.
Pangea has already given the correct answer: Don't reinvent the wheel, especially a complex wheel like document similarity. That being said, understanding how document similarity is computed is an interesting and worth while thing to do if you are going to be working in the field. I'll see if I can help a bit.
The basic assumption of the Vector space model you have linked is that each document can be represented as a vector in N dimensional space, where each dimension is a different word in the universe of documents. A document's value for a given word is that document's rank for the word in question. In this model, a query can be thought of as a very short document, and thus also represented as a vector in N space. The cosine measure is simply the cosine of the angle between the query vector and a given document vector.
Deriving N dimensional trigonometry is probably a math course in and of itself, but if you understand the basic idea, for the actual computation you can take the Wikipedia formula on faith (or look in a standard text for it if you prefer). The computational steps (vector dot products and norms) are also well documented individually and not terribly hard to implement. I'm sure there are also standard library implementations available.
The logic behind the cosine is that, as the similarity between the documents increases, the angle between the two vectors approaches zero (and thus the cosine approaches 1). You can verify this by hand with a universe of two words on the Cartesian plane. All the vector math does there is extrapolate the same concept into N dimensions.
I hope this clears up some confusion on this interesting topic. For actual implementation, I once again refer you to Pangea's suggestion to use Lucene.
精彩评论