开发者

A RAM error of big array

开发者 https://www.devze.com 2022-12-22 14:15 出处:网络
I need to get the numbers of one line randomly, and put each line in other array, then get the numbers of one col.

I need to get the numbers of one line randomly, and put each line in other array, then get the numbers of one col.

I have a big file, more than 400M. In that file, there are 13496*13496 number, means 13496 rows and 13496 cols. I want to read them to a array. This is my code:

_L1 = [[0 for col in range(13496)] for row in range(13496)]
_L1file = open('distanceCMD.function.txt')
while (i<13496):
    print "i="+开发者_开发百科str(i)
    _strlf = _L1file.readline()
    _strlf = _strlf.split('\t')
    _strlf = _strlf[:-1]
    _L1[i] = _strlf
    i += 1
_L1file.close()

And this is my error message:

MemoryError:
File "D:\research\space-function\ART3.py", line 30, in <module>
  _strlf = _strlf.split('\t')


you might want to approach your problem in another way. Process the file line by line. I don't see a need to store the whole big file into array. Otherwise, you might want to tell us what you are actually trying to do.

for line in open("400MB_file"):
     # do something with line.

Or

f=open("file")
for linenum,line in enumerate(f):
    if linenum+1 in [2,3,10]:
         print "there are ", len(line.split())," columns" #assuming you want to split on spaces
         print "100th column value is: ", line.split()[99]
    if linenum+1>10:
         break # break if you want to stop after the 10th line
f.close()


This is a simple case of your program demanding more memory than is available to the computer. An array of 13496x13496 elements requires 182,142,016 'cells', where a cell is a minimum of one byte (if storing chars) and potentially several bytes (if storing floating-point numerics, for example). I'm not even taking your particular runtimes' array metadata into account, though this would typically be a tiny overhead on a simple array.

Assuming each array element is just a single byte, your computer needs around 180MB of RAM to hold it in memory in its' entirety. Trying to process it could be impractical.

You need to think about the problem a different way; as has already been mentioned, a line-by-line approach might be a better option. Or perhaps processing the grid in smaller units, perhaps 10x10 or 100x100, and aggregating the results. Or maybe the problem itself can be expressed in a different form, which avoids the need to process the entire dataset altogether...?

If you give us a little more detail on the nature of the data and the objective, perhaps someone will have an idea to make the task more manageable.


Short answer: the Python object overhead is killing you. In Python 2.x on a 64-bit machine, a list of strings consumes 48 bytes per list entry even before accounting for the content of the strings. That's over 8.7 Gb of overhead for the size of array you describe. On a 32-bit machine it'll be a bit better: only 28 bytes per list entry.

Longer explanation: you should be aware that Python objects themselves can be quite large: even simple objects like ints, floats and strings. In your code you're ending up with a list of lists of strings. On my (64-bit) machine, even an empty string object takes up 40 bytes, and to that you need to add 8 bytes for the list pointer that's pointing to this string object in memory. So that's already 48 bytes per entry, or around 8.7 Gb. Given that Python allocates memory in multiples of 8 bytes at a time, and that your strings are almost certainly non-empty, you're actually looking at 56 or 64 bytes (I don't know how long your strings are) per entry.

Possible solutions:

(1) You might do (a little) better by converting your entries from strings to ints or floats as appropriate.

(2) You'd do much better by either using Python's array type (not the same as list!) or by using numpy: then your ints or floats would only take 4 or 8 bytes each.

Since Python 2.6, you can get basic information about object sizes with the sys.getsizeof function. Note that if you apply it to a list (or other container) then the returned size doesn't include the size of the contained list objects; only of the structure used to hold those objects. Here are some values on my machine.

>>> import sys
>>> sys.getsizeof("")
40
>>> sys.getsizeof(5.0)
24
>>> sys.getsizeof(5)
24
>>> sys.getsizeof([])
72
>>> sys.getsizeof(range(10))  # 72 + 8 bytes for each pointer
152


MemoryError exception:

Raised when an operation runs out of memory but the situation may still be rescued (by deleting some objects). The associated value is a string indicating what kind of (internal) operation ran out of memory. Note that because of the underlying memory management architecture (C’s malloc() function), the interpreter may not always be able to completely recover from this situation; it nevertheless raises an exception so that a stack traceback can be printed, in case a run-away program was the cause.

It seems that, at least in your case, reading the entire file into memory is not a doable option.


Replace this:

_strlf = _strlf[:-1]

with this:

_strlf = [float(val) for val in _strlf[:-1]]

You are making a big array of strings. I can guarantee that the string "123.00123214213" takes a lot less memory when you convert it to floating point.

You might want to include some handling for null values.

You can also go to numpy's array type, but your problem may be too small to bother.

0

精彩评论

暂无评论...
验证码 换一张
取 消