I would like to search through a text file and print out a line and its subsequent 3 lines if a keyword is found in the line AND a different keyword is found within the subsequent 3 lines.
My code right now prints too much information. Is there a way to move forward to the next section of text once a portion is already printed?
text = """
here is some text 1
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
Don't print this line 7
Not this line either 8
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14
etc.
"""
text2 = open("tmp.txt","w")
text2.write(text)
text2.close()
searchlines = open("tmp.txt").readlines()
data = []
for m, line in enumerate(searchlines):
line = line.lower()
if "keyword" in line and any("keyword2" in l.lower() for l in searchlines[m:m+4]):
for line2 in searchlines[m:m+4]:
data.append(line2)
print ''.join(data)
The output right now is:
I want to print out this line and the following 3 lines only once keyword 2
print this line 开发者_JAVA技巧since it has a keyword2 3
print this line keyword 4
print this line 5
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14
I would like it to print out only:
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
So, as someone else has pointed out, your first keyword keyword
is a substring of your second keyword keyword2
. So I've implemented this using regexp objects, so that you can use the word boundary anchor \b
.
import re
from StringIO import StringIO
text = """
here is some text 1
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
Don't print this line 7
Not this line either 8
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14
etc.
"""
def my_scan(data,search1,search2):
buffer = []
for line in data:
buffer.append(line)
if len(buffer) > 4:
buffer.pop(0)
if len(buffer) == 4: # Valid search block
if search1.search(buffer[0]) and search2.search("\n".join(buffer[1:3])):
for item in buffer:
yield item
buffer = []
# First search term
s1 = re.compile(r'\bkeyword\b')
s2 = re.compile(r'\bkeyword2\b')
for row in my_scan(StringIO(text),s1,s2):
print row.rstrip()
Produces:
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
So you want to print out all blocks of 4 lines containing more than 2 keywords?
Anyway, thats what I've just came up with. Maybe you can use it:
text = """
here is some text 1
I want to print out this line and the following 3 lines only once keyword 2
print this line since it has a keyword2 3
print this line keyword 4
print this line 5
I don't want to print this line but I want to start looking for more text starting at this line 6
Don't print this line 7
Not this line either 8
I want to print out this line again and the following 3 lines only once keyword 9
please print this line keyword 10
please print this line it has the keyword2 11
please print this line 12
Don't print this line 13
Start again searching here 14
etc.
""".splitlines()
keywords = ['keyword', 'keyword2']
buffer, kw = [], set()
for line in text:
if len(buffer) == 0: # first line of a block
for k in keywords:
if k in line:
kw.add(k)
buffer.append(line)
continue
else: # continuous lines
buffer.append(line)
for k in keywords:
if k in line:
kw.add(k)
if len(buffer) > 3:
if len(kw) >= 2: # just print blocks with enough keywords
print '\n'.join(buffer)
buffer, kw = [], set()
Your keywords are overlapping: "keyword" is a subset of "keyword2".
Also, your data implies you don't want to see line 13 but acc. to the problem statement it should be printed.
I changed your first keyword from "keyword" to "firstkey" like this and your code works (except for line 13).
$ diff /tmp/q /tmp/q2
4c4
< I want to print out this line and the following 3 lines only once keyword 2
---
> I want to print out this line and the following 3 lines only once firstkey 2
6c6
< print this line keyword 4
---
> print this line firstkey 4
11,12c11,12
< I want to print out this line again and the following 3 lines only once keyword 9
< please print this line keyword 10
---
> I want to print out this line again and the following 3 lines only once firstkey 9
> please print this line firstkey 10
30c30
< if "keyword" in line and any("keyword2" in l.lower() for l in searchlines[m:m+4]):
---
> if "firstkey" in line and any("keyword2" in l.lower() for l in searchlines[m:m+4]):
First, you could correct your code like that:
text = """
0//
1// here is some text 1
A2// I want to print out this line and the following 3 lines only once keyword 2
b3// print this line since it has a keyword2 3
b4// print this line keyword 4
b5// print this line 5
6// I don't want to print this line but I want to start looking for more text starting at this line 6
7// Don't print this line 7
8// Not this line either 8
A9// I want to print out this line again and the following 3 lines only once keyword 9
b10// please print this line keyword 10
b11// please print this line it has the keyword2 11
b12// please print this line 12
13// Don't print this line 13
14// Start again searching here 14
15// etc.
"""
searchlines = map(str.lower,text.splitlines(1))
# splitlines(1) with argument 1 keeps the newlines
data,again = [],-1
for m, line in enumerate(searchlines):
if "keyword" in line and m>again and "keyword2" in ''.join(searchlines[m:m+4]):
data.extend(searchlines[m:m+4])
again = m+4
print ''.join(data)
.
Second, a short regex solution is
text = """
0//
1// here is some text 1
A2// I want to print out this line and the following 3 lines only once keyword 2
b3// print this line since it has a keyword2 3
b4// print this line keyword 4
b5// print this line 5
6// I don't want to print this line but I want to start looking for more text starting at this line 6
7// Don't print this line 7
8// Not this line either 8
A9// I want to print out this line again and the following 3 lines only once keyword 9
b10// please print this line keyword 10
b11// please print this line it has the keyword2 11
b12// please print this line 12
13// Don't print this line 13
14// Start again searching here 14
15// etc.
"""
import re
regx = re.compile('(^.*?(?<=[ \t]){0}(?=[ \t]).*\r?\n'
'.*?((?<=[ \t]){1}(?=[ \t]))?.*\r?\n'
'.*?((?<=[ \t]){1}(?=[ \t]))?.*\r?\n'
'.*?(?(1)|(?(2)|{1})).*)'.\
format('keyword','keyword2'),re.MULTILINE|re.IGNORECASE)
print '\n'.join(m.group(1) for m in regx.finditer(text))
result
A2// I want to print out this line and the following 3 lines only once keyword 2
b3// print this line since it has a keyword2 3
b4// print this line keyword 4
b5// print this line 5
b10// please print this line keyword 10
b11// please print this line it has the keyword2 11
b12// please print this line 12
13// Don't print this line 13
精彩评论