I am a complete newbie to Python, and I'm stuck with a regex problem. I'm trying to remove the line break character at the end of each line in a text file, but only if it follows a lowercase letter, i.e开发者_运维问答. [a-z]
. If the end of the line ends in a lower case letter, I want to replace the line break/newline character with a space.
This is what I've got so far:
import re
import sys
textout = open("output.txt","w")
textblock = open(sys.argv[1]).read()
textout.write(re.sub("[a-z]\z","[a-z] ", textblock, re.MULTILINE) )
textout.close()
Try
re.sub(r"(?<=[a-z])\r?\n"," ", textblock)
\Z
only matches at the end of the string, after the last linebreak, so it's definitely not what you need here. \z
is not recognized by the Python regex engine.
(?<=[a-z])
is a positive lookbehind assertion that checks if the character before the current position is a lowercase ASCII character. Only then the regex engine will try to match a line break.
Also, always use raw strings with regexes. Makes backslashes easier to handle.
Just as an alternative answer, although it takes more lines, I think the following may be clearer since the regular expression is simpler:
import re
import sys
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
if re.search('[a-z]$',line):
ofp.write(line.rstrip("\n\r")+" ")
else:
ofp.write(line)
... and that avoids loading the whole file into a string. If you want to use fewer lines, but still avoid postive lookbehind, you could do:
import re
import sys
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
ofp.write(re.sub('(?m)([a-z])[\r\n]+$','\\1 ',line))
The parts of that regular expression are:
(?m)
[turn on multiline matching]([a-z])
[match a single lower case character as the first group][\r\n]+
[match one or more of carriage returns or newlines, to cover\n
,\r\n
and\r
]$
[match the end of the string]
... and if that matches line, the lowercase letter and line ending are replaced by \\1
, which will the lower case letter followed by a space.
my point was that avoiding using positive lookbehind might make the code more readable
OK. Though, personally, I don't find it's less readable. It's a matter of taste.
In your EDIT:
First, (?m) is not necessary since for line in ifp: selects one line at a time and so there is only one newline at the end of each line's string
Secondly, $ as it is placed, has no utility because it will always match the end of the string line.
Any way, adopting your point of view, I found two manners to avoid the lookbehind assertion:
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
ante_newline,lower_last = re.match('(.*?([a-z])?$)',line).groups()
ofp.write(ante_newline+' ' if lower_last else line)
and
with open(sys.argv[1]) as ifp:
with open("output.txt", "w") as ofp:
for line in ifp:
ofp.write(line.strip('\r\n')+' ' if re.search('[a-z]$',line) else line)
the second one is better: only one line , a simple matching to test, no need of groups(), naturally logic
EDIT: oh I realize that this second code is simply your first code rewritten in one line, Longair
精彩评论