开发者

Iterate through words of a file in Python

开发者 https://www.devze.com 2023-04-13 02:23 出处:网络
I need to iterate through the words of a large file, which开发者_如何学Python consists of a single, long long line. I am aware of methods iterating through the file line by line, however they are not

I need to iterate through the words of a large file, which开发者_如何学Python consists of a single, long long line. I am aware of methods iterating through the file line by line, however they are not applicable in my case, because of its single line structure.

Any alternatives?


It really depends on your definition of word. But try this:

f = file("your-filename-here").read()
for word in f.split():
    # do something with word
    print word

This will use whitespace characters as word boundaries.

Of course, remember to properly open and close the file, this is just a quick example.


Long long line? I assume the line is too big to reasonably fit in memory, so you want some kind of buffering.

First of all, this is a bad format; if you have any kind of control over the file, make it one word per line.

If not, use something like:

line = ''
while True:
    word, space, line = line.partition(' ')
    if space:
        # A word was found
        yield word
    else:
        # A word was not found; read a chunk of data from file
        next_chunk = input_file.read(1000)
        if next_chunk:
            # Add the chunk to our line
            line = word + next_chunk
        else:
            # No more data; yield the last word and return
            yield word.rstrip('\n')
            return


You really should consider using Generator

def word_gen(file):
    for line in file:
        for word in line.split():
            yield word

with open('somefile') as f:
    word_gen(f)


There are more efficient ways of doing this, but syntactically, this might be the shortest:

 words = open('myfile').read().split()

If memory is a concern, you aren't going to want to do this because it will load the entire thing into memory, instead of iterating over it.


I've answered a similar question before, but I have refined the method used in that answer and here is the updated version (copied from a recent answer):

Here is my totally functional approach which avoids having to read and split lines. It makes use of the itertools module:

Note for python 3, replace itertools.imap with map

import itertools

def readwords(mfile):
    byte_stream = itertools.groupby(
      itertools.takewhile(lambda c: bool(c),
          itertools.imap(mfile.read,
              itertools.repeat(1))), str.isspace)

    return ("".join(group) for pred, group in byte_stream if not pred)

Sample usage:

>>> import sys
>>> for w in readwords(sys.stdin):
...     print (w)
... 
I really love this new method of reading words in python
I
really
love
this
new
method
of
reading
words
in
python
           
It's soo very Functional!
It's
soo
very
Functional!
>>>

I guess in your case, this would be the way to use the function:

with open('words.txt', 'r') as f:
    for word in readwords(f):
        print(word)


Read in the line as normal, then split it on whitespace to break it down into words?

Something like:

word_list = loaded_string.split()


After reading the line you could do:

l = len(pattern)
i = 0
while True:
    i = str.find(pattern, i)
    if i == -1:
        break
    print str[i:i+l] # or do whatever
    i += l

Alex.


What Donald Miner suggested looks good. Simple and short. I used the below in a code that I have written some time ago:

l = []
f = open("filename.txt", "rU")
for line in f:
    for word in line.split()
        l.append(word)

longer version of what Donald Miner suggested.

0

精彩评论

暂无评论...
验证码 换一张
取 消