开发者

Faster replacing in list with a lot of matches

开发者 https://www.devze.com 2023-02-24 01:33 出处:网络
just a small problem with list and replacing some list entries. Maybe some informations around my problem. My idea is really simple and easy. I use the module mmap to read out bigger files. It\'s so

just a small problem with list and replacing some list entries.

Maybe some informations around my problem. My idea is really simple and easy. I use the module mmap to read out bigger files. It's some FORTRAN-files which have 7 columns and one million lines. Some values didn't fulfill the format of the FORT开发者_C百科RAN-output and I just have ten stars. I can't change the format of the output inside the source code and I have to deal with this problem. After loading the file with mmap I use str.split() to convert the data to a list and then I search for the bad values. Look at the following source code:

f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).split()
for i in range(A.count('********')):
    A[A.index('********')] = '0.0'

I know it's probably not the best solution but it's quick and dirty. Ok. It's quick if A.count('********') is small. Actually this is my problem. For some files the replacing method doesn't work really fast. If it's to big it take a lot of time. Is there any other method or a total other way to replace my bad values and don't waste a lot of time?

Thanks for any help or any suggestions.

EDIT:

How does the method list.count() works? I can also run through whole list and replacing it by my own.

for k in range(len(A)):
    if A[k] == '**********': A[k] = '0.0'

This would be faster for many replacements. But would it be faster if I only would have one match?


The main problem in your code is the use of "A.index" inside the loop -. The index method will walk linearly through your list, from the start up to the next ocurrence of "**" - this turns a O(n) problem into O(n²) - hence your perceived lack of performance.

While using Python the most obvious way is usually the best way to do it: so walking through your list in a Python forloop in this case will undoubtley be better than O(n²) loops in C with the cound and index methods. The not so obvious part is the recomended usage of the built-in function "enumerate" to get both an item value and its index from the list on the for loop.

f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).split()
for i, value in enumerate(A):
    if value == "********":
       A[i] = "0.0"


If you are eventually going to convert this to an array, you might consider using numpy and the np.genfromtxt which has the ability to deal with missing data:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

With a binary file, you can use np.memmap and then use masked arrays to deal with the missing elements.


fin = open(fname, 'r')
fout = open(fname + '_fixed', 'w')
for line in fin:
    # replace 10 asterisks by 7 spaces + '0.0'
    # If you don't mind losing the fixed-column-width format, 
    # omit the seven spaces
    line = line.replace('**********', '       0.0')
    fout.write(line)
fin.close()
fout.close()

Alternatively if your file is smallish, replace the loop by this:

fout.write(fin.read().replace('**********', '       0.0'))


If after converting A to one huge string representation, you first could change all the bad values with a single call to the A.replace('********', '0.0') method and then split it, you'd have the same result, likely a lot faster. Something like:

f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).replace('********', '0.0').split()

It would use a lot of memory, but that's often the trade-off for speed.


Instead of manipulating A, try using a list comprehension to make a new A:

A = [v if v != '********' else 0.0 for v in A]

I think you'll find this surprisingly fast.

0

精彩评论

暂无评论...
验证码 换一张
取 消