开发者

Regular expression to remove line breaks

开发者 https://www.devze.com 2023-02-12 18:59 出处:网络
I am a complete newbie to Python, and I\'m stuck with a regex problem. I\'m trying to remove the line break character at the end of each line in a text file, but only if it follows a lowercase letter,

I am a complete newbie to Python, and I'm stuck with a regex problem. I'm trying to remove the line break character at the end of each line in a text file, but only if it follows a lowercase letter, i.e开发者_运维问答. [a-z]. If the end of the line ends in a lower case letter, I want to replace the line break/newline character with a space.

This is what I've got so far:

import re
import sys

textout = open("output.txt","w")
textblock = open(sys.argv[1]).read()
textout.write(re.sub("[a-z]\z","[a-z] ", textblock, re.MULTILINE) )
textout.close()


Try

re.sub(r"(?<=[a-z])\r?\n"," ", textblock)

\Z only matches at the end of the string, after the last linebreak, so it's definitely not what you need here. \z is not recognized by the Python regex engine.

(?<=[a-z]) is a positive lookbehind assertion that checks if the character before the current position is a lowercase ASCII character. Only then the regex engine will try to match a line break.

Also, always use raw strings with regexes. Makes backslashes easier to handle.


Just as an alternative answer, although it takes more lines, I think the following may be clearer since the regular expression is simpler:

import re
import sys

with open(sys.argv[1]) as ifp:
    with open("output.txt", "w") as ofp:
        for line in ifp:
            if re.search('[a-z]$',line):
                ofp.write(line.rstrip("\n\r")+" ")
            else:
                ofp.write(line)

... and that avoids loading the whole file into a string. If you want to use fewer lines, but still avoid postive lookbehind, you could do:

import re
import sys

with open(sys.argv[1]) as ifp:
    with open("output.txt", "w") as ofp:
        for line in ifp:
            ofp.write(re.sub('(?m)([a-z])[\r\n]+$','\\1 ',line))

The parts of that regular expression are:

  • (?m) [turn on multiline matching]
  • ([a-z]) [match a single lower case character as the first group]
  • [\r\n]+ [match one or more of carriage returns or newlines, to cover \n, \r\n and \r]
  • $ [match the end of the string]

... and if that matches line, the lowercase letter and line ending are replaced by \\1, which will the lower case letter followed by a space.


my point was that avoiding using positive lookbehind might make the code more readable

OK. Though, personally, I don't find it's less readable. It's a matter of taste.

In your EDIT:

  • First, (?m) is not necessary since for line in ifp: selects one line at a time and so there is only one newline at the end of each line's string

  • Secondly, $ as it is placed, has no utility because it will always match the end of the string line.

Any way, adopting your point of view, I found two manners to avoid the lookbehind assertion:

with open(sys.argv[1]) as ifp:
    with open("output.txt", "w") as ofp:
        for line in ifp:
            ante_newline,lower_last = re.match('(.*?([a-z])?$)',line).groups()
            ofp.write(ante_newline+' ' if lower_last else line)

and

with open(sys.argv[1]) as ifp:
    with open("output.txt", "w") as ofp:
        for line in ifp:
            ofp.write(line.strip('\r\n')+' ' if re.search('[a-z]$',line) else line)

the second one is better: only one line , a simple matching to test, no need of groups(), naturally logic

EDIT: oh I realize that this second code is simply your first code rewritten in one line, Longair

0

精彩评论

暂无评论...
验证码 换一张
取 消