Large TXT file Parsing Problem in python_问答_开发者

Large TXT file Parsing Problem in python

开发者 https://www.devze.com 2023-03-22 13:28 出处：网络

Been trying to figure this one out all day. I have a large text file (546 MB) that I am trying to parse in python looking to pull out the text between the open 开发者_StackOverflow中文版tag and the close tag and I keep getting memory problems. With the help of good folks on this board this is what I have so far.

answer = ''
output_file = open('/Users/Desktop/Poetrylist.txt','w')

with open('/Users/Desktop/2e.txt','r') as open_file:
    for each_line in open_file:
        if each_line.find('<A>'):
            start_position = each_line.find('<A>')
            start_position = start_position + 3
            end_position = each_line[start_position:].find('</W>')

            answer = each_line[start_position:end_position] + '\n'
            output_file.write(answer)

output_file.close()

I am getting this error message:

Traceback (most recent call last):
  File "C:\Users\Adam\Desktop\OEDsearch3.py", line 9, in <module>
    end_position = each_line[start_position:].find('</W>')
MemoryError

I have little to no programming experience and I am trying to figure this out for a poetry project I am working on. Any help is greatly appreciated.

Your logic is wrong because .find() returns -1 if the string is not found, and -1 is a true-ish value, so your code will think every line has <A> in it.
You don't need to make a new substring to find the '</W>', because .find() also has an optional start argument.
Neither of these explain why you are running out of memory. Do you have an unusually small-memory machine?
Are you sure you're showing us all the code?

EDITED: OK, now I think your file only has one line in it.

Try changing your code like this:

with open('/Users/Desktop/Poetrylist.txt','w') as output_file:
    with open('/Users/Desktop/2e.txt','r') as open_file:
        the_whole_file = open_file.read()
        start_position = 0
        while True:
            start_position = the_whole_file.find('<A>', start_position)
            if start_position < 0:
                break
            start_position += 3
            end_position = the_whole_file.find('</W>', start_position)
            output_file.write(the_whole_file[start_position:end_position])
            output_file.write("\n")    
            start_position = end_position + 4

I think you might be running into a problem with line endings. iter(open_file) is supposed to return each line separately, but it might incorrectly guess at the line terminatior, which varies from os to os. You can get python to treat any line ending for any os as a line ending for the purposes of readlines/iter by adding a "U" to the flags to open. Try this:

with open('/Users/Desktop/2e.txt','rU') as open_file:
#                                   ^

with the rest all the same. (comment added for emphasis).

Are you sure you wont to use

if each_line.find(''):

find() returns -1 if substring is not found, thus even if you have no matches the clause will be true

Large TXT file Parsing Problem in python

精彩评论

关注公众号

热门标签

图文推荐

Large TXT file Parsing Problem in python

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：