I am trying to tokenize a string using the pattern as below.
>>> splitter = re.compile(r'((\w*)(\d*)\-\s?(\w*)(\d*)|(?x)\$?\d+(\.\d+)?(\,\d+)?|([A-Z]\.)+|(Mr)\.|(Sen)\.|(Miss)\.|.$|\w+|[^\w\s])')
>>> splitter.split("Hello! Hi, I am debating this predicament called life. Can you help me?")
I get the following output. Could someone point out what I'd need to correct please? I'm confused about the whole bunch of "None"'s. Also if there is a better way to tokenize a string I'd really appreciate the additional help.
['', 'Hello', None, None, None, None, None, None, None, None, None, None, '', '!', None, None, None, None, None, None, None, None, None, None, ' ', 'Hi', None,None, None, None, None, None, None, None, None, None, '', ',', None, None, None, None, None, None, None, None, None, None, ' ', 'I', None, None, None, None, None, None, None, None, None, None, ' ', 'am', None, None, None, None, None, None,None, None, None, None, ' ', 'debating', None, None, None, None, None, None, None, None, None, None, ' ', 'this', None, None, None, None, None, None, None, None, None, None, ' ', 'predicament', None, None, None, None, None, None, None, None, None, None, ' ', 'called', None, None, None, None, None, None, None, None, None, None, ' ', 'life', None, None, None, None, None, None, None, None, None, None, '', '.', None, None, None, None, None, None, None, None, None, None, ' ', 'Can', None, None, None, None, None, None, None, None, None, None, ' ', 'you', None, None, None, None, N开发者_如何学编程one, None, None, None, None, None, ' ', 'help', None, None,None, None, None, None, None, None, None, None, ' ', 'me', None, None, None, None, None, None, None, None, None, None, '', '?', None, None, None, None, None, None, None, None, None, None, '']
The output that I'd like is:-
['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life', '.', 'Can', 'you', 'help', 'me', '?']
Thank you.
I recommend NLTK's tokenizers. Then you don't need to worry about tedious regular expressions yourself:
>>> import nltk
>>> nltk.word_tokenize("Hello! Hi, I am debating this predicament called life. Can you help me?")
['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life.', 'Can', 'you', 'help', 'me', '?']
re.split
rapidly runs out of puff when used as a tokeniser. Preferable is findall
(or match
in a loop) with a pattern of alternatives this|that|another|more
>>> s = "Hello! Hi, I am debating this predicament called life. Can you help me?"
>>> import re
>>> re.findall(r"\w+|\S", s)
['Hello', '!', 'Hi', ',', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life', '.', 'Can', 'you', 'help', 'me', '?']
>>>
This defines tokens as either one or more "word" characters, or a single character that's not whitespace. You may prefer [A-Za-z]
or [A-Za-z0-9]
or something else instead of \w
(which allows underscores). You may even want something like r"[A-Za-z]+|[0-9]+|\S"
If things like Sen.
, Mr.
and Miss
(what happened to Mrs
and Ms
?) are significant to you, your regex should not list them out, it should just define a token that ends in .
, and you should have a dictionary or set of probable abbreviations.
Splitting text into sentences is complicated. You may like to look at the nltk
package instead of trying to reinvent the wheel.
Update: if you need/want to distinguish between the types of tokens, you can get an index or a name like this without a (possibly long) chain of if/elif/elif/.../else:
>>> s = "Hello! Hi, I we 0 1 987?"
>>> pattern = r"([A-Za-z]+)|([0-9]+)|(\S)"
>>> list((m.lastindex, m.group()) for m in re.finditer(pattern, s))
[(1, 'Hello'), (3, '!'), (1, 'Hi'), (3, ','), (1, 'I'), (1, 'we'), (2, '0'), (2, '1'), (2, '987'), (3, '?')]
>>> pattern = r"(?P<word>[A-Za-z]+)|(?P<number>[0-9]+)|(?P<other>\S)"
>>> list((m.lastgroup, m.group()) for m in re.finditer(pattern, s))
[('word', 'Hello'), ('other', '!'), ('word', 'Hi'), ('other', ','), ('word', 'I'), ('word', 'we'), ('number', '0'), ('number', '1'), ('number', '987'), ('other'
, '?')]
>>>
Could be missing something but I beleive something like the following would work:
s = "Hello! Hi, I am debating this predicament called life. Can you help me?"
s.split(" ")
This is assuming you want spaces. You should get something along the lines of:
['Hello!', 'Hi,', 'I', 'am', 'debating', 'this', 'predicament', 'called', 'life.', 'Can', 'you', 'help', 'me?']
With this, if you needed a specific piece, you could probably loop though it to get what you need.
Hopefully this helps....
The reason you're getting all of those None
's is because you have lots of parenthesized groups in your regular expression separated by |
's. Every time your regular expression finds a match, it's only matching one of the alternatives given by the |
's. The parenthesized groups in the other, unused alternatives get set to None
. And re.split
by definition reports the values of all parenthesized groups every time it gets a match, hence lots of None
's in your result.
You could filter those out pretty easily (e.g. tokens = [t for t in tokens if t]
or something similar) but I think split
isn't really the tool you want for tokenizing. split
is meant for just throwing away whitespace. If you want really want to use regular expressions to tokenize something, here's a toy example of another method (I'm not going to even try to unpack that monster r.e. you're using...use the re.VERBOSE
option for the love of Ned...but hopefully this toy example will give you the idea):
tokenpattern = re.compile(r"""
(?P<words>\w+) # Things with just letters and underscores
|(?P<numbers>\d+) # Things with just digits
|(?P<other>.+?) # Anything else
""", re.VERBOSE)
The (?P<something>...
business lets you identify the type of token you're looking for by name in the code below:
for match in tokenpattern.finditer("99 bottles of beer"):
if match.group('words'):
# This token is a word
word = match.group('words')
#...
elif match.group('numbers'):
number = int(match.group('numbers')):
else:
other = match.group('other'):
Note that this is still a r.e. using a bunch of parenthesized groups separated by |
's, so the same thing is going to happen as in your code: for each match, one group will be defined and the others will be set to None
. This method checks for that explicitly.
Perhaps he didn't mean it as such, but John Machin's comment "str.split is NOT a place to get started" (as part of the exchange after Frank V's answer) came as a bit of a challenge. So ...
the_string = "Hello! Hi, I am debating this predicament called life. Can you help me?"
tokens = the_string.split()
punctuation = ['!', ',', '.', '?']
output_list = []
for token in tokens:
if token[-1] in punctuation:
output_list.append(token[:-1])
output_list.append(token[-1])
else:
output_list.append(token)
print output_list
This seems to provide the requested output.
Granted, John's answer is simpler in terms of number of lines of code. However, I have a couple points to make supporting this sort of solution.
I don't completely agree with Jamie Zawinski's 'Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.' (Neither did he from what I've read.) My point in quoting this is that regular expressions can be a pain to get working if you're not accustomed to them.
Also, while it won't normally be an issue, the performance of the above solution was consistently better than the regex solution, when measured with timeit. The above solution (with the print statement removed) came in at about 8.9 seconds; John's regular expression solution came in at about 11.8 seconds. This involved 10 tries each of 1 million iterations on a quad core dual processor system running at 2.4 GHz.
精彩评论