I've searched online for a diff algorithm but none of them do what I am looking for. It is for a texting contest (as in cell phone) and I need the entry text compared to the master text recording the errors along the way. I am semi-new to C# and I get most of the string functions and didn't think this was going to be that hard of a problem, but alas I just can't wrap my head around it.
I have a form with 2 rich-text-boxes (one on top of the other) and 2 buttons. The top box is the master text (string) and the bottom box is the entry text (string). Every contestant is sending a text to an email account, from the email we copy and paste the text into the Entry RTB and compare to the Master RTB. For each single word an开发者_如何学God single space counts as a thing to check. A word, no matter how many errors it has, is still 1 error. And for every error add 1 sec. to their time.
Examples:
- Hello there! <= 3 checks (2 words and 1 space)
- Helothere! <= 2 errors (Helo and space)
- Hello there!! <= 1 error (extra ! at end of there!)
- Hello there! How are you? <= 9 checks (5 words and 4 spaces)
- Helothere!! How a re you? <= still 9 checks, 4 errors(helo, no space, extra !, and a space in are)
- Hello there!@ Ho are yu?? <= 3 errors (@ at end of there!, no w, no o and extra ? (all errors are still under the 1 word)
What I have so far: I've created 6 arrays (3 for master, 3 for entry) and they are
- CharArray of all chars
- StringArray of all strings(words) including the spaces
- IntArray with length of the string in each StringArray
My biggest trouble is if the entry text is wrong and it's shorter or longer than the master. I keep getting IndexOutOfRange exceptions (understandably) but can't fathom how to go about checking and writing the code to compensate. I hope I have made myself clear enough as to what I need help with. If anyone could give some code examples or something to shoot me in the right path would be very helpful.
Have you looked into the Levenshtein distance algorithm? It returns the number of differences between two strings, which, in your case would be texting errors. Implementing the algorithm based off the pseudo-code found on the wikipedia page passes the first 3 of your 4 use cases:
Assert.AreEqual(2, LevenshteinDistance("Hello there!", "Helothere!");
Assert.AreEqual(1, LevenshteinDistance("Hello there!", "Hello there!!"));
Assert.AreEqual(4, LevenshteinDistance("Hello there! How are you?", "Helothere!! How a re you?"));
Assert.AreEqual(3, LevenshteinDistance("Hello there! How are you?", "Hello there!@ Ho are yu??")); //fails, returns 4 errors
So while not perfect out of the box, it is probably a good starting point for you. Also, if you have too much trouble implementing your scoring rules, it might be worth revisiting them.
hth
Update:
Here is the result of the string you requested in the comments:
Assert.AreEqual(7, LevenshteinDistance("Hello there! How are you?", "Hlothere!! Hw a reYou?"); //fails, returns 8 errors
And here is my implementation of the Levenshtein Distance algorithm:
int LevenshteinDistance(string left, string right)
{
if (left == null || right == null)
{
return -1;
}
if (left.Length == 0)
{
return right.Length;
}
if (right.Length == 0)
{
return left.Length;
}
int[,] distance = new int[left.Length + 1, right.Length + 1];
for (int i = 0; i <= left.Length; i++)
{
distance[i, 0] = i;
}
for (int j = 0; j <= right.Length; j++)
{
distance[0, j] = j;
}
for (int i = 1; i <= left.Length; i++)
{
for (int j = 1; j <= right.Length; j++)
{
if (right[j - 1] == left[i - 1])
{
distance[i, j] = distance[i - 1, j - 1];
}
else
{
distance[i, j] = Min(distance[i - 1, j] + 1, //deletion
distance[i, j - 1] + 1, //insertion
distance[i - 1, j - 1] + 1); //substitution
}
}
}
return distance[left.Length, right.Length];
}
int Min(int val1, int val2, int val3)
{
return Math.Min(val1, Math.Min(val2, val3));
}
You need to come up with a scoring systems that works for you're situation.
I would make a word array after each space.
- If a word is found on the same index +5.
- If a word is found on the same index +-1 index location +3 (keep a counter how much words differ to increase the +- correction
- If a needed word is found as part of another word +2
etc.etc. Matching words is hard, getting up with a rules engine that works is 'easier'
I once implemented an algorithm (which I can't find at the moment, I'll post code when I find it) which looked at the total number of PAIRS in the target string. i.e. "Hello, World!" would have 11 pairs, { "He", "el", "ll",...,"ld", "d!" }.
You then do the same thing on an input string such as "Helo World" so you have { "He",...,"ld" }.
You can then calculate accuracy as a function of correct pairs (i.e. input pairs that are in the list of target pairs), incorrect pairs (i.e. input pairs that do not exists in the list of target pairs), compared to the total list of target pairs. Over long enough sentences, this measure would be very accurate fair.
A simple algorithm would be to check letter by letter. If the letters differ increment the num of errors. If the next pairing of letters match, its a switched letter so just continue. If the messup matches the next letter, it is an omission and treat it accordingly. If the next letter matches the messed up one, its an insertion and treat it accordingly. Else the person really messed up and continue.
This doesn't get everything but with a few modifications this could become comprehensive.
a weak attempt at pseudocode: edit: new idea. look at comments. I don't know the string functions off the top of my head so you'll have to figure that part out. The algorithm kinda fails for words that repeat a lot though...
string entry; //we'll pretend that this has stuff inside
string master; // this too...
string tempentry = entry; //stuff will be deleted so I need a copy to mess up
int e =0; //word index for entry
int m = 0; //word index for master
int errors = 0;
while(there are words in tempentry) //!tempentry.empty() ?
string mword = the next word in master;
m++;
int eplace = find mword in tempentry; //eplace is the index of where the mword starts in tempentry
if(eplace == -1) //word not there...
continue;
else
errors += m - e;
errors += find number of spaces before eplace
e = m // there is an error
tempentry = stripoff everything between the beginning and the next word// substring?
all words and spaces left in master are considered errors.
There are a couple of bounds checking errors that need to be fixed here but its a good start.
精彩评论