开发者

Python 3.1 - Memory Error during sampling of a large list

开发者 https://www.devze.com 2023-02-04 07:07 出处:网络
The input list can be more than 1 million numbers. When I run the following code with smaller \'repeats\', its fine;

The input list can be more than 1 million numbers. When I run the following code with smaller 'repeats', its fine;

def sample(x):
    length = 1000000 
    new_array = random.sample((list(x)),length)
    return (new_array)

def repeat_sample(x):    
    i = 0
    repeats = 100
    list_of_samples = []
    for i in range(repeats):
       list_of_samples.append(sample(x))
    return(list_of_samples)

repeat_sample(large_array)

However, using high repeats such as the 100 above, results in MemoryError. Traceback is as follows;

Traceback (most recent call last):
  File "C:\Python31\rnd.py", line 221, in <module>
 开发者_StackOverflow   STORED_REPEAT_SAMPLE = repeat_sample(STORED_ARRAY)
  File "C:\Python31\rnd.py", line 129, in repeat_sample
    list_of_samples.append(sample(x))
  File "C:\Python31\rnd.py", line 121, in sample
    new_array = random.sample((list(x)),length)
  File "C:\Python31\lib\random.py", line 309, in sample
    result = [None] * k
MemoryError

I am assuming I'm running out of memory. I do not know how to get around this problem.

Thank you for your time!


Expanding on my comment:

Let's say the processing you do to each sample is calculate its mean.

def mean(samplelists):
    means = []
    n = float(len(samplelists[0]))
    for sample in samplelists:
        mean = sum(sample)/n
        means.append(mean)
    return means

calc_means(repeat_sample(large_array))

This is going to make you sweat holding all those lists in memory. You can get it much lighter like this:

def mean(sample, n):
    n = float(n)
    mean = sum(sample)/n
    return mean

def sample(x):
    length = 1000000 
    new_array = random.sample(x, length)
    return new_array

def repeat_means(x):    
    repeats = 100
    list_of_means = []
    for i in range(repeats):
        list_of_means.append(mean(sample(x)))
    return list_of_means    

repeat_means(large_array)

But that's still not good enough... You can do it all with only ever constructing your list of results:

import random

def sampling_mean(population, k, times):
    # Part of this is lifted straight from random.py
    _int = int
    _random = random.random

    n = len(population)
    kf = float(k)
    result = []

    if not 0 <= k <= n:
        raise ValueError, "sample larger than population"

    for t in range(times):
        selected = set()
        sum_ = 0
        selected_add = selected.add

        for i in xrange(k):
            j = _int(_random() * n)
            while j in selected:
                j = _int(_random() * n)
            selected_add(j)
            sum_ += population[j]

        mean = sum_/kf
        result.append(mean)
    return result

sampling_mean(x, 1000000, 100)

Now, can your algorithm be streamlined like this?


Two answers:

  1. Unless you're using an old machine, it's unlikely that you actually run out of memory. You get a MemoryError because you're probably using a 32-bit build of Python and that you can't allocate more than 2GB of memory.

  2. Your approach is wrong. You should use a random sample generator instead of building a list of samples.


A generator version of random.sample() would also help:

from random import random
from math import ceil as _ceil, log as _log

def xsample(population, k):
    """A generator version of random.sample"""
    n = len(population)
    if not 0 <= k <= n:
        raise ValueError("sample larger than population")
    _int = int
    setsize = 21        # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize or hasattr(population, "keys"):
        # An n-length list is smaller than a k-length set, or this is a
        # mapping type so the other algorithm wouldn't work.
        pool = list(population)
        for i in range(k):         # invariant:  non-selected at [0,n-i)
            j = _int(random() * (n-i))
            yield pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        try:
            selected = set()
            selected_add = selected.add
            for i in range(k):
                j = _int(random() * n)
                while j in selected:
                    j = _int(random() * n)
                selected_add(j)
                yield population[j]
        except (TypeError, KeyError):   # handle (at least) sets
            if isinstance(population, list):
                raise
            for x in sample(tuple(population), k):
                yield x


The only improvement you can do is to change your code to:

list_of_samples = [random.sample(x, length) for _ in range(repeats)]

It wouldn't change the fact, however, that you cannot create arbitrary-length list in the real world.


You could try to use array object http://docs.python.org/py3k/library/array.html. It should be much more memory effective than list, but probably a little harder to use.

0

精彩评论

暂无评论...
验证码 换一张
取 消