Here is an example of the type of text file I am trying to search (named usefile):
DOCK onomatopoeia DOCK blah blah
blah DOCK blah DOCK blah blah blah onomatopoeia blah blah blah blah blah DOCK DOCK blah blah DOCK blah onomatopoeiaI am using a 开发者_开发百科finditer statement to find everything between DOCK and onomatopoeia as follows:
re.finditer(r'((dock)(.+?)(onomatopoeia))', usefile, re.I|re.DOTALL)
Obviously Dock is a much more common word than onomatopoeia and I only want to grab text between the first instance of Dock before onomatopoeia. The regex I am using above grabs text between the first instance of Dock and stops when it hits onomatopoeia, so I might get Dock Dock Dock Dock onomatopoeia when I really only wanted Dock onomatopoeia.
To be clear what I want from above is:
1. DOCK onomatopoeia 2. DOCK blah blah blah onomatopoeia 3. DOCK blah onomatopoeiaIs there a way to search for onomatopoeia and go UP to the first instance of Dock, or a better way to solve my problem?
Thanks!
A negative lookahead assertion will do the trick.
DOCK((?!DOCK).)+?onomatopoeia
Here's an algorithmic approach:
- set pushing==false.
- Break your text apart into words (e.g. spans of letters) and loop over those.
- upon hitting a DOCK and pushing==false, push it onto a stack and set pushing = true
- if you hit ono... and pushing==true, print out whatever's on the stack plus ono..., then clear the stack and set pushing = false.
- any other word, if pushing==true, push it.
- DOCK, if pushing==true, clear the stack, then push your new DOCK.
精彩评论