开发者

Easiest way to ignore blank lines when reading a file in Python

开发者 https://www.devze.com 2023-02-07 17:53 出处:网络
I have some code that reads a file of names and creates a list: 开发者_JAVA技巧names_list = open(\"names\", \"r\").read().splitlines()

I have some code that reads a file of names and creates a list:

开发者_JAVA技巧names_list = open("names", "r").read().splitlines()

Each name is separated by a newline, like so:

Allman
Atkinson

Behlendorf 

I want to ignore any lines that contain only whitespace. I know I can do this by by creating a loop and checking each line I read and then adding it to a list if it's not blank.

I was just wondering if there was a more Pythonic way of doing it?


I would stack generator expressions:

with open(filename) as f_in:
    lines = (line.rstrip() for line in f_in) # All lines including the blank ones
    lines = (line for line in lines if line) # Non-blank lines

Now, lines is all of the non-blank lines. This will save you from having to call strip on the line twice. If you want a list of lines, then you can just do:

with open(filename) as f_in:
    lines = (line.rstrip() for line in f_in) 
    lines = list(line for line in lines if line) # Non-blank lines in a list

You can also do it in a one-liner (exluding with statement) but it's no more efficient and harder to read:

with open(filename) as f_in:
    lines = list(line for line in (l.strip() for l in f_in) if line)

Update:

I agree that this is ugly because of the repetition of tokens. You could just write a generator if you prefer:

def nonblank_lines(f):
    for l in f:
        line = l.rstrip()
        if line:
            yield line

Then call it like:

with open(filename) as f_in:
    for line in nonblank_lines(f_in):
        # Stuff

update 2:

with open(filename) as f_in:
    lines = filter(None, (line.rstrip() for line in f_in))

and on CPython (with deterministic reference counting)

lines = filter(None, (line.rstrip() for line in open(filename)))

In Python 2 use itertools.ifilter if you want a generator and in Python 3, just pass the whole thing to list if you want a list.


You could use list comprehension:

with open("names", "r") as f:
    names_list = [line.strip() for line in f if line.strip()]

Updated: Removed unnecessary readlines().

To avoid calling line.strip() twice, you can use a generator:

names_list = [l for l in (line.strip() for line in f) if l]


If you want you can just put what you had in a list comprehension:

names_list = [line for line in open("names.txt", "r").read().splitlines() if line]

or

all_lines = open("names.txt", "r").read().splitlines()
names_list = [name for name in all_lines if name]

splitlines() has already removed the line endings.

I don't think those are as clear as just looping explicitly though:

names_list = []
with open('names.txt', 'r') as _:
    for line in _:
        line = line.strip()
        if line:
            names_list.append(line)

Edit:

Although, filter looks quite readable and concise:

names_list = filter(None, open("names.txt", "r").read().splitlines())


I guess there is a simple solution which I recently used after going through so many answers here.

with open(file_name) as f_in:   
    for line in f_in:
        if len(line.split()) == 0:
            continue

This just does the same work, ignoring all empty line.


When a treatment of text must be done to just extract data from it, I always think first to the regexes, because:

  • as far as I know, regexes have been invented for that

  • iterating over lines appears clumsy to me: it essentially consists to search the newlines then to search the data to extract in each line; that makes two searches instead of a direct unique one with a regex

  • way of bringing regexes into play is easy; only the writing of a regex string to be compiled into a regex object is sometimes hard, but in this case the treatment with an iteration over lines will be complicated too

For the problem discussed here, a regex solution is fast and easy to write:

import re
names = re.findall('\S+',open(filename).read())

I compared the speeds of several solutions:

import re
from time import clock

A,AA,B1,B2,BS,reg = [],[],[],[],[],[]
D,Dsh,C1,C2 = [],[],[],[]
F1,F2,F3  = [],[],[]

def nonblank_lines(f):
    for l in f:
        line = l.rstrip()
        if line:  yield line

def short_nonblank_lines(f):
    for l in f:
        line = l[0:-1]
        if line:  yield line

for essays in xrange(50):

    te = clock()
    with open('raa.txt') as f:
        names_listA = [line.strip() for line in f if line.strip()] # Felix Kling
    A.append(clock()-te)

    te = clock()
    with open('raa.txt') as f:
        names_listAA = [line[0:-1] for line in f if line[0:-1]] # Felix Kling with line[0:-1]
    AA.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        namesB1 = [ name for name in (l.strip() for l in f_in) if name ] # aaronasterling without list()
    B1.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        namesB2 = [ name for name in (l[0:-1] for l in f_in) if name ] # aaronasterling without list() and with line[0:-1]
    B2.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        namesBS = [ name for name in f_in.read().splitlines() if name ] # a list comprehension with read().splitlines()
    BS.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f:
        xreg = re.findall('\S+',f.read()) #  eyquem
    reg.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        linesC1 = list(line for line in (l.strip() for l in f_in) if line) # aaronasterling
    C1.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        linesC2 = list(line for line in (l[0:-1] for l in f_in) if line) # aaronasterling  with line[0:-1]
    C2.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        yD = [ line for line in nonblank_lines(f_in)  ] # aaronasterling  update
    D.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        yDsh = [ name for name in short_nonblank_lines(f_in)  ] # nonblank_lines with line[0:-1]
    Dsh.append(clock()-te)

    #-------------------------------------------------------
    te = clock()
    with open('raa.txt') as f_in:
        linesF1 = filter(None, (line.rstrip() for line in f_in)) # aaronasterling update 2
    F1.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        linesF2 = filter(None, (line[0:-1] for line in f_in)) # aaronasterling update 2 with line[0:-1]
    F2.append(clock()-te)

    te = clock()
    with open('raa.txt') as f_in:
        linesF3 =  filter(None, f_in.read().splitlines()) # aaronasterling update 2 with read().splitlines()
    F3.append(clock()-te)


print 'names_listA == names_listAA==namesB1==namesB2==namesBS==xreg\n  is ',\
       names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
print 'names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3\n  is ',\
       names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3,'\n\n\n'


def displ((fr,it,what)):  print fr + str( min(it) )[0:7] + '   ' + what

map(displ,(('* ', A,    '[line.strip() for line in f if line.strip()]               * Felix Kling\n'),

           ('  ', B1,   '    [name for name in (l.strip() for l in f_in) if name ]    aaronasterling without list()'),
           ('* ', C1,   'list(line for line in (l.strip() for l in f_in) if line)   * aaronasterling\n'),          

           ('* ', reg,  're.findall("\S+",f.read())                                 * eyquem\n'),

           ('* ', D,    '[ line for line in       nonblank_lines(f_in)  ]           * aaronasterling  update'),
           ('  ', Dsh,  '[ line for line in short_nonblank_lines(f_in)  ]             nonblank_lines with line[0:-1]\n'),

           ('* ', F1 ,  'filter(None, (line.rstrip() for line in f_in))             * aaronasterling update 2\n'),

           ('  ', B2,   '    [name for name in (l[0:-1]   for l in f_in) if name ]    aaronasterling without list() and with line[0:-1]'),
           ('  ', C2,   'list(line for line in (l[0:-1]   for l in f_in) if line)     aaronasterling  with line[0:-1]\n'),

           ('  ', AA,   '[line[0:-1] for line in f if line[0:-1]  ]                   Felix Kling with line[0:-1]\n'),

           ('  ', BS,   '[name for name in f_in.read().splitlines() if name ]        a list comprehension with read().splitlines()\n'),

           ('  ', F2 ,  'filter(None, (line[0:-1] for line in f_in))                  aaronasterling update 2 with line[0:-1]'),

           ('  ', F3 ,  'filter(None, f_in.read().splitlines()                        aaronasterling update 2 with read().splitlines()'))
    )

Solution with regex is straightforward and neat. Though, it isn't among the fastest ones. The solution of aaronasterling with filter() is surprisigly fast for me (I wasn't aware of this particular filter()'s speed) and times of optimized solutions go down until 27 % of the biggest time. I wonder what makes the miracle of the filter-splitlines association:

names_listA == names_listAA==namesB1==namesB2==namesBS==xreg
  is  True
names_listA == yD==yDsh==linesC1==linesC2==linesF1==linesF2==linesF3
  is  True 



* 0.08266   [line.strip() for line in f if line.strip()]               * Felix Kling

  0.07535       [name for name in (l.strip() for l in f_in) if name ]    aaronasterling without list()
* 0.06912   list(line for line in (l.strip() for l in f_in) if line)   * aaronasterling

* 0.06612   re.findall("\S+",f.read())                                 * eyquem

* 0.06486   [ line for line in       nonblank_lines(f_in)  ]           * aaronasterling  update
  0.05264   [ line for line in short_nonblank_lines(f_in)  ]             nonblank_lines with line[0:-1]

* 0.05451   filter(None, (line.rstrip() for line in f_in))             * aaronasterling update 2

  0.04689       [name for name in (l[0:-1]   for l in f_in) if name ]    aaronasterling without list() and with line[0:-1]
  0.04582   list(line for line in (l[0:-1]   for l in f_in) if line)     aaronasterling  with line[0:-1]

  0.04171   [line[0:-1] for line in f if line[0:-1]  ]                   Felix Kling with line[0:-1]

  0.03265   [name for name in f_in.read().splitlines() if name ]        a list comprehension with read().splitlines()

  0.03638   filter(None, (line[0:-1] for line in f_in))                  aaronasterling update 2 with line[0:-1]
  0.02198   filter(None, f_in.read().splitlines()                        aaronasterling update 2 with read().splitlines()

But this problem is particular, the most simple of all: only one name in each line. So the solutions are only games with lines, splitings and [0:-1] cuts.

On the contrary, regex doesn't matter with lines, it straightforwardly finds the desired data: I consider it is a more natural way of resolution, applying from the simplest to the more complex cases, and hence is often the way to be prefered in treatments of texts.

EDIT

I forgot to say that I use Python 2.7 and I measured the above times with a file containing 500 times the following chain

SMITH
JONES
WILLIAMS
TAYLOR
BROWN
DAVIES
EVANS
WILSON
THOMAS
JOHNSON

ROBERTS
ROBINSON
THOMPSON
WRIGHT
WALKER
WHITE
EDWARDS
HUGHES
GREEN
HALL

LEWIS
HARRIS
CLARKE
PATEL
JACKSON
WOOD
TURNER
MARTIN
COOPER
HILL

WARD
MORRIS
MOORE
CLARK
LEE
KING
BAKER
HARRISON
MORGAN
ALLEN

JAMES
SCOTT
PHILLIPS
WATSON
DAVIS
PARKER
PRICE
BENNETT
YOUNG
GRIFFITHS

MITCHELL
KELLY
COOK
CARTER
RICHARDSON
BAILEY
COLLINS
BELL
SHAW
MURPHY

MILLER
COX
RICHARDS
KHAN
MARSHALL
ANDERSON
SIMPSON
ELLIS
ADAMS
SINGH

BEGUM
WILKINSON
FOSTER
CHAPMAN
POWELL
WEBB
ROGERS
GRAY
MASON
ALI

HUNT
HUSSAIN
CAMPBELL
MATTHEWS
OWEN
PALMER
HOLMES
MILLS
BARNES
KNIGHT

LLOYD
BUTLER
RUSSELL
BARKER
FISHER
STEVENS
JENKINS
MURRAY
DIXON
HARVEY


Why are you all going the hard way?

with open("myfile") as myfile:
    nonempty = filter(str.rstrip, myfile)

Convert nonempty into a list if you have the urge to do so, although I highly suggest keeping nonempty a generator as it is in Python 3.x

In Python 2.x you may use itertools.ifilter to do your bidding instead.


You can use not:

for line in lines:
    if not line:
        continue


You can use Walrus operator for Python >= 3.8

with open('my_file') as fd:
    nonblank = [stripped for line in fd if (stripped := line.strip())]

Think 'blablabla if stripped (defined as line.strip) is Truthy'


@S.Lott

The following code processes lines one at a time and produces a result that isn't memory eager:

filename = 'english names.txt'

with open(filename) as f_in:
    lines = (line.rstrip() for line in f_in)
    lines = (line for line in lines if line)
    the_strange_sum = 0
    for l in lines:
        the_strange_sum += 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'.find(l[0])

print the_strange_sum

So the generator (line.rstrip() for line in f_in) is quite the same acceptable than the nonblank_lines() function.


What about LineSentence module, it will ignore such lines:

Bases: object

Simple format: one sentence = one line; words already preprocessed and separated by whitespace.

source can be either a string or a file object. Clip the file to the first limit lines (or not clipped if limit is None, the default).

from gensim.models.word2vec import LineSentence
text = LineSentence('text.txt')
0

精彩评论

暂无评论...
验证码 换一张
取 消