开发者

Add spaces between words in spaceless string

开发者 https://www.devze.com 2023-01-21 10:43 出处:网络
I\'m on OS X, and in objective-c I\'m trying to convert for example, \"Bobateagreenapple\" into \"Bob ate a green apple\"

I'm on OS X, and in objective-c I'm trying to convert

for example, "Bobateagreenapple"

into "Bob ate a green apple"

Is there any way to do this efficiently? Would something involving a spell checker work?

EDIT: Just some extra information: I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasi开发者_JAVA百科bly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.


One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.

Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.

Pseudo-code would look something like this:

FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
    if (l.empty() and w.empty())
        add s to sentences;
        return;
    if (l.empty())
        return;
    add first letter from l to w;
    if w in dictionary
    {
        add w to s;
        FindWords(sentences, s, empty word, l)
        remove w from s
    }
    FindWords(sentences, s, w, l)
    put last letter from w back onto l
}

There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.


Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.

A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.

This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.

Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it


I implemented a solution, the code is avaible on code project:

http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings

My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.

To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).

0

精彩评论

暂无评论...
验证码 换一张
取 消