I have following situation:
String a = "A Web crawler is a computer program that browses the World Wide Web internet automatically"; String b = "Web Crawler computer program browses the World Wide 开发者_Python百科Web";
Is there any idea or standard algorithm to calculate the percentage of similarity?
For instance, above case, the similarity estimated by manual looking should be 90%++.
My idea is to tokenize both Strings and compare the number of tokens matched. Something like (7 tokens /1 0 tokens) * 100. But, of course, it is not effective at all for this method. Compare number of characters matched also seem to be not effective....
Can anyone give some guidelines???
Above is part of my project, Plagiarism Analyzer.
Hence, the words matched will be exactly same without any synonyms.
The only matters in this case is that how to calculate a quite accurate percentage of similarity.
Thanks a lot for any helps.
As Konrad pointed out, your question depends heavily on what you mean by "similar". In general, I would say the following guidelines should be of use:
- normalize the input by reducing a word to it's base form and lowercase it
- use a word frequency list (obtainable easily on the web) and make the word's "similarity relevance" inversly proportional to it's position on the frequency list
- calculate the total sentence similarity as an aggregated similarity of the words appearing in both sentences divided by the total similarity relevance of the sentences
You can refine the technique to include differences between word forms, sentence word order, synonim lists etc. Although you'll never get perfect results, you have a lot of tweaking possibilities and I believe that in general you might get quite valuable measures of similarity.
That depends on your idea of similarity. Formally, you need to define a metric of what you consider “similar” strings to apply statistics to them. Usually, this is done via the hypothetical question: “how likely is it that the first string is a modified version of the first string where errors (e.g. by typing it) were introduced?”
A very simple yet effective measure for such similarity (or rather, the inverse) is the edit distance of two strings which can be computed using dynamic programming, which takes time O(nm) in general, where n and m are the lengths of the strings.
Depending on your usage, more elaborate measures (or completely unrelated, such as the soundex metric) measures might be required.
In your case, if you straightforwardly apply a token match (i.e. mere word count) you will never get a > 90% similarity. To get such a high similarity in a meaningful way would require advanced semantical analysis. If you get this done, please publish the paper because this is as yet a largely unsolved problem.
I second what Konrad Rudolf has already said.
Others may recommend different distance metrics. What I'm going to say accompanies those, but looks more at the problem of matching semantics.
Given what you seem to be looking for, I recommend that you apply some of the standard text processing methods. All of these have potential downfalls, so I list them in order of both application and difficulty to do well
- Sentence splitting. Figure out your units of comparison.
- stop-word removal: take out a, an, the, of, etc.
- bag of words percentage: what percentage of the overall words match, independent of ordering
- (much more aggressive) you could try synonym expansion, which counts synonyms as matched words.
The problem with this question is: the similarity may be either a humanized-similarity (as you say "+- 90% similarity") or a statistical-similarity (Kondrad Rudolph's answer).
The human-similarity can never be easily calculated: for instance these three words
cellphone car message
mobile automobile post
The statistical-similarity is very low, while actually it's quite similar. Thus: it'll be hard to solve this problem, and the only think I can point you to is a Bayesian filtering or Artificial Intelligence with Bayesian networks.
One common measure is the Levenshtein distance, a special case of the string edit distance. It is also included in the apache string util library
The Longest Common Sub-sequence is a well known as a string dis-similarity metric, which is implemented in Dynamic Programming
精彩评论