开发者

python regex finding all groups of words

开发者 https://www.devze.com 2023-01-22 16:10 出处:网络
Here is what I have so far text = \"Hello world. It is a nice day today. Don\'t 开发者_如何学JAVAyou think so?\"

Here is what I have so far

text = "Hello world. It is a nice day today. Don't 开发者_如何学JAVAyou think so?"
re.findall('\w{3,}\s{1,}\w{3,}',text)
#['Hello world', 'nice day', 'you think']

The desired output would be ['Hello world', 'nice day', 'day today', 'today Don't', 'Don't you', 'you think']

Can this be done with a simple regex pattern?


import itertools as it
import re 

three_pat=re.compile(r'\w{3}')
text = "Hello world. It is a nice day today. Don't you think so?"
for key,group in it.groupby(text.split(),lambda x: bool(three_pat.match(x))):
    if key:
        group=list(group)       
        for i in range(0,len(group)-1):
            print(' '.join(group[i:i+2]))

# Hello world.
# nice day
# day today.
# today. Don't
# Don't you
# you think

It not clear to me what you want done with all punctuation. On the one hand, it looks like you want periods to be removed, but single quotation marks to be kept. It would be easy to implement the removal of periods, but before I do, would you clarify what you want to happen to all punctuation?


map(lambda x: x[0] + x[1], re.findall('(\w{3,}(?=(\s{1,}\w{3,})))',text))

May be you can rewrite the lambda for shorter (like just '+') And BTW ' is not part of \w or \s


Something like this with additional checks for list boundaries should do:

>>> text = "Hello world. It is a nice day today. Don't you think so?"
>>> k = text.split()
>>> k
['Hello', 'world.', 'It', 'is', 'a', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']
>>> z = [x for x in k if len(x) > 2]
>>> z
['Hello', 'world.', 'nice', 'day', 'today.', "Don't", 'you', 'think', 'so?']

>>> [z[n]+ " " + z[n+1] for n in range(0, len(z)-1, 2)]
['Hello world.', 'nice day', "today. Don't", 'you think']
>>> 


There are two problems with your approach:

  1. Neither \w nor \s matches punctuation.
  2. When you match a string with a regular expression using findall, that part of the string is consumed. Searching for the next match commences immediately after the end of the previous match. Because of this a word can't be included in two separate matches.

To solve the first issue you need to decide what you mean by a word. Regular expressions aren't good for this sort of parsing. You might want to look at a natural language parsing library instead.

But assuming that you can come up with a regular expression that works for your needs, to fix the second problem you can use a lookahead assertion to check the second word. This won't return the entire match as you want but you can at least find the first word in each word pair using this method.

 re.findall('\w{3,}(?=\s{1,}\w{3,})',text)
                   ^^^            ^
                  lookahead assertion
0

精彩评论

暂无评论...
验证码 换一张
取 消