开发者

Large TXT file Parsing Problem in python

开发者 https://www.devze.com 2023-03-22 13:28 出处:网络
Been trying to figure this one out all day. I have a large text file (546 MB) that I am trying to parse in python looking to pull out the text between the open 开发者_StackOverflow中文版tagand the clo

Been trying to figure this one out all day. I have a large text file (546 MB) that I am trying to parse in python looking to pull out the text between the open 开发者_StackOverflow中文版tag and the close tag and I keep getting memory problems. With the help of good folks on this board this is what I have so far.

answer = ''
output_file = open('/Users/Desktop/Poetrylist.txt','w')

with open('/Users/Desktop/2e.txt','r') as open_file:
    for each_line in open_file:
        if each_line.find('<A>'):
            start_position = each_line.find('<A>')
            start_position = start_position + 3
            end_position = each_line[start_position:].find('</W>')

            answer = each_line[start_position:end_position] + '\n'
            output_file.write(answer)

output_file.close()

I am getting this error message:

Traceback (most recent call last):
  File "C:\Users\Adam\Desktop\OEDsearch3.py", line 9, in <module>
    end_position = each_line[start_position:].find('</W>')
MemoryError

I have little to no programming experience and I am trying to figure this out for a poetry project I am working on. Any help is greatly appreciated.


  1. Your logic is wrong because .find() returns -1 if the string is not found, and -1 is a true-ish value, so your code will think every line has <A> in it.

  2. You don't need to make a new substring to find the '</W>', because .find() also has an optional start argument.

  3. Neither of these explain why you are running out of memory. Do you have an unusually small-memory machine?

  4. Are you sure you're showing us all the code?

EDITED: OK, now I think your file only has one line in it.

Try changing your code like this:

with open('/Users/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Desktop/2e.txt','r') as open_file:
        the_whole_file = open_file.read()
        start_position = 0
        while True:
            start_position = the_whole_file.find('<A>', start_position)
            if start_position < 0:
                break
            start_position += 3
            end_position = the_whole_file.find('</W>', start_position)
            output_file.write(the_whole_file[start_position:end_position])
            output_file.write("\n")    
            start_position = end_position + 4


I think you might be running into a problem with line endings. iter(open_file) is supposed to return each line separately, but it might incorrectly guess at the line terminatior, which varies from os to os. You can get python to treat any line ending for any os as a line ending for the purposes of readlines/iter by adding a "U" to the flags to open. Try this:

with open('/Users/Desktop/2e.txt','rU') as open_file:
#                                   ^

with the rest all the same. (comment added for emphasis).


Are you sure you wont to use

if each_line.find(''):

find() returns -1 if substring is not found, thus even if you have no matches the clause will be true

0

精彩评论

暂无评论...
验证码 换一张
取 消