algorithm for list identification and parsing_问答_开发者

I have data which in theory is a list, but historically has been input by the user as a free form text field. Now I need to separate each item of the list so that each element can be analysed.

Simplified examples of my data as input by users:

one, two, three, four, five 

one. two. three, four. five.

"I start with one, then do two, maybe three and four then five"

one  
two  
three  
four  
five.  

one, two. three four five

one two three four - five

"not even a list, no list-elements here! but list item separators may appear. grrr"

So, that's more or less what the data looks like. In reality a list item could be several words long. I need to process these lists (of which there are thousands) such that I end up arrays like this:

array[0] = "one"  
array[1] = "two"  
array[n] = n

I accept that sometimes my algorithm will completely fail to parse the list, I don't need a 100% success rate, 75% would be good. False positives are going to be very expensive for me so I would rather reject a list completely than generate a list that does not contain real data - assume some users type in meaningless gibberish.

I have some ideas around trying to identify which separator(s) is being used and how regularly data is separated in relation to the size of the content.

I pr开发者_JAVA技巧efer Java or Python, however any solution would be welcome :-)

The first step to solving this problem is to analyze, in detail, how it is that humans solve this problem. I'd break the problem down into two parts.

How do humans distinguish between lists and non-lists? For example, is it because non-lists are grammatical English sentences? If that's the case, you may be able to use one of the available natural language processing toolkits to distinguish between lists and non-lists.
How do humans identify the delimiters and list elements in lists? For example, do they recognize the list elements because of some particular domain knowledge? Or do they just recognize one of a small set of common delimiters? Are list elements always single words? If not, in what cases are they multiple words?

I'd also take a good look at several hundred samples, to see if there are any COMMON patterns that can be easily identified and parsed. If, for example, 30% of your entries are simple comma-separated lists then a regular expression will trivially identify and parse them. Perhaps a small set of regular expressions will address a large part of your corpus.

Finally, I assume that currently the data is not only entered and recognized by humans but also consumed by humans. Is your reason for breaking items into lists so that humans can be removed from the loop, or just to make the work easier for them? If the latter, I'd recommend providing them with BOTH the broken-up list elements and, as a backup, the originally-entered text. In other words, hedge your bets in case you get it wrong.

If you can't define your data ("The words can be anything, I there is no way to know beforehand a dictionary of what any individual list can contain. They will not be only numbers... it could be a list of anything") then you have serious problems.

Specifically, if you can't define your data, your problem cannot be solved.

You can try playing with nltk.

You may be able to discard the "noise words" (",", ".", "i", "start", "with", "then", "do", etc.) What's left may be this undefinable "words can be anything" that's left over.

Until you can better define your data, you're probably doomed to a lot of struggle.

I don't know if I understand your problem. If you want to extract alpha-numeric strings from messed string in python it would be:

>>> import re
>>> re.split('\W+','abaa, asodf ?. poasid - paosfi sec')
['abaa', 'asodf', 'poasid', 'paosfi', 'sec']

Or if you know the seperators:

>>> re.split('[,. -]+','abaa, asodf, poasid - paosfi sec')
['abaa', 'asodf', 'poasid', 'paosfi', 'sec']

Rather than focusing on the code, how about the method. Building a little on what swillden said...

If your lists are consumed by human users, you could ask them to correct you when you make a mistake (this correction is visible either to the person entering the text or to a later user viewing the text). If a given input looks a lot like a list but not enough to be sure, you show them the list and the raw input and ask them to choose.

To automatically categorise inputs as lists or as text you could create several metrics to base your decision on :

Given the separators (i.e [' ', '\t', ',', '.', 'and']) how many does this phrase use? expect one or two. Which ones?
Is the input composed of fragments (use some sort of grammar system) - fragments tend to indicate lists.
Does this input field (or context in your input) tend to have list items
The words in the list itself (some words might turn out to always mean a sentence or a list in your domain)

You then pass this information into a Bayesian filter and train it using your user's suggestions. Most of the items I mention would boil into special "keywords" that you tag an item with before you pass it into the filter. If the filter has a clear answer either way, treat it as a list or string. If the filter is uncertain, ask the user and use their answer to train the filter.

Edit

You could always train the system manually (i.e without exposing your system to the users) by first classifying lists using your existing scripts and then checking them by hand. Take a list of 500 inputs, run a filter looking for , or other easy lists and classify them as lists. Train the Bayesian filter on those (with everything else non-list) and then check the output by hand for all 500 for further training.

Each day someone could receive an email with all the edge cases for that day and could clink links in the email to correct the system if necessary.

As a side issue (relating to OP comment), in general Bayesian filters are much easier to implement, debug, test, analyse and scale than Neural networks.

In java, a string tokenizer will do this (i.e. StringTokenizer(inputString, delimiterList))

StringTokenizer st = new StringTokenizer( "A B|C-D", " |-" );
while ( st.hasMoreTokens() ) {
    System.out.println( st.nextToken() );
}

prints

The following will "parse" your input string into sequences of "word" characters separated by non-word characters.

String input = ...
String[] parts = input.split("[^\w]*");

I don't know how you are going to distinguish a list from gibberish. I think you will need to explain your problem domain some more ...

EDIT: If you cannot define the rules which you (as a human) use to distinguish a list from gibberish, then this problem is essentially unsolvable. Computers cannot do magic you know ...

Maybe you should just use the program to deal with the subset that are "definitely" lists, and classify the other ones by hand.

Either you know your dictionary of words or you have a precedence order for list delimiters. Otherwise the problem is too ill defined for a computer to handle.

I suppose your precedence could be commas, dots, hyphens, spaces. So, this means that you split by commas in preference to splitting by dots and so on.

Alternatively you could keep on splitting by each successive delimiter, until you hit a delimiter that isn't in the text.

I'm not sure what the best answer really is, But if you need to have few false positives, then perhaps what you should do is define a few patterns that are very likely to be lists, and strictly reject every other datum.

patterns = [
    re.compile(r'^\s*(\w+)(\s*,\s*(\w+))*\s*$'), 
    re.compile(r'^\s*(\w+)(\s*\.\s*(\w+))*\s*$'), 
    re.compile(r'^\s*(\w+)(\s*,\s*(\w+))*\s+and\s+(\w+)\s*^$')
]
acceptSet = [ line for line in candidateSet if 
              any(pattern.match(line) for pattern in patterns)]

A pyparsing stab at sift wheat from the chaff...

rawdata = """\
one, two, three, four, five
one. two. three, four. five.
"I start with one, then do two, maybe three and four then five"
one  
two  
three  
four  
five.  
one, two. three four five
one two three four - five
"not even a list, no list-elements here! but list item separators may appear. grrr"
a dog with a bone is a beautiful twosome""".splitlines()

from pyparsing import oneOf, WordStart, CharsNotIn, alphas, LineEnd
options = (WordStart() + oneOf("one two three four five") + (CharsNotIn(alphas)|LineEnd()))

for userinput in rawdata:
    print userinput
    print [opt[0] for opt in options.searchString(userinput)]
    print

Prints (note the added line with hidden 'one' and 'two' substrings that would not be desirable):

one, two, three, four, five
['one', 'two', 'three', 'four', 'five']

one. two. three, four. five.
['one', 'two', 'three', 'four', 'five']

"I start with one, then do two, maybe three and four then five"
['one', 'two', 'three', 'four', 'five']

one  
['one']

two  
['two']

three  
['three']

four  
['four']

five.  
['five']

one, two. three four five
['one', 'two', 'three', 'four', 'five']

one two three four - five
['one', 'two', 'three', 'four', 'five']

"not even a list, no list-elements here! but list item separators may appear. grrr"
[]

a dog with a bone is a beautiful twosome
[]