Probability algorithm: Finding probable correct item in a list (e.g John, John, Jon)_问答_开发者

Take for example the list (L):

John, John, John, John, Jon

We are to presume one item is to be correct (e.g. John in this case), and give a probability it is correct. First (and good!) attempt: MostFrequentItem(L).Count / L.Count (e.g. 4/5 or 80% likelihood)

But consider the cases:

John, John, Jon, Jonny

John, John, Jon, Jon

I want to consider the likelihood of the correct item being John to be higher in the first list! I know I have to count the SecondMostFrequent Item开发者_高级运维 and compare them.

Any ideas? This is really busting my brain!

Thx, Andrew

As an extremely simple solution, compared to the more correct but complicated solutions above, you could take counts of every variation, square the counts, and use those to calculate weights. So:

[John, John, Jon, Jonny]

would give John a weight of 4 and the other two a weight of 1, for a probability of 66% that John is correct.

[John, John, Jon, Jon]

would give weights of 4 to both John and Jon, so John's probability is only 50%.

Maybe Edit Distance? Just a direction to a solution, though...

Just off the top of my head, what if you compare the % occurrence vs the % if all items had equal number of occurences

In your example above
John, John, Jon, Jonny
50% John
25% Jon
25% Jonny
33.3% Normal? (I'm making up a word because I don't know what to call this. 3 items: 100%/3)
John's score = 50% - 33.3% = 16.7%

John, John, Jon, Jon
50% John
50% Jon
50% Normal (2 items, 100%/2)
John's score = 50% - 50% = 0%

If you had [John, John, John, Jon, Jon] then John's score would be 60%-50% = 10% which is lower than the first case, but higher than the 2nd (hopefully that's the desired result, otherwise you'll need to clarify more what the desired results should be)

In your first case [John, John, John, John, Jon] you'd get 80%-50% = 30%
For [John, John, John, John, Jon, Jonny] you'd get 66.6%-33.3% = 33.3%
That may or may not be what you want.

Where the above might factor in more is if you had John*97+Jon+Jonny+Johnny, that would give you 97%-25% = 72%, but John*99+Jon would only give you a score of 99-50% = 49%

You'd need to figure out how you want to handle the degenerate case of them all being the same, otherwise you'd get a score of 0% for that which is probably not what you want.

EDIT (okay I made lots of edits, but this one isn't just more examples :p)
To normalize the results, take the score as calculated above divide by the limit of max possible score given the number of different values. (Okay, that sounds way more complicated than it needs to, example time)

Example:
[John, John, Jon, Jonny] 50% - 33.3% = 16.7%. That's the previous score, but with 3 items the upper limit of your score would be 100%-33.3% = 66.6%, so if we take that into account, we have 16.7/66.6 = 25%

[John, John, Jon, Jon] gives (50-50) /50 = 0%
[John, John, John, Jon, Jon] gives (60-50) /50 = 20%
[John, John, John, John, Jon] gives (80-50)/50 = 60%
[John, John, John, John, Jon, Jonny] gives (66.6-33.3)/(66.6)= 50%
[John*97, Jon, Jonny, Johnny] gives (97-25)/75 = 96%
[John*99, Jon] gives (99-50)/50 = 98%

I think you'd need a kind of scoring system.

Solely identifying the different tokens is not sufficient:

[John, Johen, Jon, Jhon, Johnn]

With your algorithm there is no clear winner here, whereas the most probable name is 'John', the others being just 1 away in the Damerau-Levenshtein distance.

So I would do a 2-steps process:

Count the occurrences of each word
For each word, add a "bonus" for each other word, inversely proportional to their distance

For the bonus, I would propose the following formula:

lhs = 'John'
rhs = 'Johen'

d = distance(lhs,rhs)
D = max( len(lhs), len(rhs) ) # Maximum distance possible

tmp = score[lhs]
score[lhs] += (1-d/D)*score[rhs]
score[rhs] += (1-d/D)*tmp

Note that you should not apply this first for (John, Johen) and then for (Johen, John).

Example:

# 1. The occurences
John  => 1
Johen => 1
Jon   => 1
Jhon  => 1
Johnn => 1

# 2. After executing it for John
John  => 4.1  = 1 + 0.80 + 0.75 + 0.75 + 0.80
Johen => 1.8  = (1) + 0.80
Jon   => 1.75 = (1) + 0.75
Jhon  => 1.75 = (1) + 0.75
Johnn => 1.8  = (1) + 0.80

# 3. After executing it for Johen (not recounting John :p)
John  => 4.1  = (1 + 0.80 + 0.75 + 0.75 + 0.80)
Johen => 3.8  = (1 + 0.80) + 0.60 + 0.60 + 0.80
Jon   => 2.35 = (1 + 0.75) + 0.60
Jhon  => 2.35 = (1 + 0.75) + 0.60
Johnn => 2.6  = (1 + 0.80) + 0.80

# 4. After executing it for Jon (not recounting John and Johen)
John  => 4.1  = (1 + 0.80 + 0.75 + 0.75 + 0.80)
Johen => 3.8  = (1 + 0.80 + 0.60 + 0.60 + 0.80)
Jon   => 3.7  = (1 + 0.75 + 0.60) + 0.75 + 0.60
Jhon  => 3.1  = (1 + 0.75 + 0.60) + 0.75
Johnn => 3.2  = (1 + 0.80 + 0.80) + 0.60

# 5. After executing it for Jhon(not recounting John, Johen and Jon)
John  => 4.1  = (1 + 0.80 + 0.75 + 0.75 + 0.80)
Johen => 3.8  = (1 + 0.80 + 0.60 + 0.60 + 0.80)
Jon   => 3.7  = (1 + 0.75 + 0.60 + 0.75 + 0.60)
Jhon  => 3.7  = (1 + 0.75 + 0.60 + 0.75) + 0.60
Johnn => 3.8  = (1 + 0.80 + 0.80 + 0.60) + 0.60

I'm not sure it's perfect and I have no idea how to transform this into some kind of percentage... but I think it gives a pretty accurate idea (of the most likely). Perhaps the bonus ought to be lessened (which factor ?) But check this degenerate case:

[John*99, Jon]

# 1. Occurences
John => 99
Jon  => 1

# 2. Applying bonus for John
John => 99.8 = (99) + 0.80
Jon  => 80.2 = (1) + 0.80*99

As you can see, it can't be directly converted into some kind of percentages: 99.8% of the correct result being 'John' seems low here!

Note: Implementing the distance efficiently is hard, kudos to Peter Norvig for his Python solution!

First off, I suspect that you are using terms inconsistently. It will help if you use technical terms like "probability" and "likelihood" with strict correctness.

The probability of a thing allows us to reason from a parameter to an outcome. For example, suppose we have an unfair coin which is 60% likely to come up heads. The 60% is a parameter. From that we can reason that the probability of getting two heads in a row is 60% * 60% = 36%.

The likelihood of a thing allows us to reason from an outcome to a parameter. That is, we flip a pair of identical coins a thousand times and discover that we get two heads 36% of the time. We can compute "the likelihood of the probability of heads is 60% given the outcome that 36% of pairs were two heads".

Now, a reasonable question is "how confident can we be that we've deduced the correct parameter given the outcome?" If you flip pairs of coins a million times and get 36% double heads, it seems plausible that we can be very confident that the parameter for one coin is 60%. The likelihood is high. If we flip pairs of coins three times and get double heads 33% of the time, we have very little confidence that the parameter for one coin getting heads is close to 60%. It could be 50% or 40%, and we just got lucky one time in three. The likelihood is low.

All this is a preamble to simply asking you to clarify the question. You have an outcome: a bunch of results. Do you wish to make an estimate of the parameters that produced that outcome? Do you wish to get a confidence interval on that estimate? What exactly are you going for here?

I'm not sure why you need to calculate the second most frequent item. In the latter example, couldn't you simply look at (the number of matching entries) / (the total number of entries) and say that it's correct with 4/8 probability? Is that not a sufficient metric? You would then also say that John has a 3/8 probability of being correct and Jonny has 1/8?

Why would that be insufficient for your purposes?