just a small problem with list and replacing some list entries.
Maybe some informations around my problem. My idea is really simple and easy. I use the module mmap
to read out bigger files. It's some FORTRAN-files which have 7 columns and one million lines. Some values didn't fulfill the format of the FORT开发者_C百科RAN-output and I just have ten stars. I can't change the format of the output inside the source code and I have to deal with this problem. After loading the file with mmap
I use str.split()
to convert the data to a list and then I search for the bad values. Look at the following source code:
f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).split()
for i in range(A.count('********')):
A[A.index('********')] = '0.0'
I know it's probably not the best solution but it's quick and dirty. Ok. It's quick if A.count('********')
is small. Actually this is my problem. For some files the replacing method doesn't work really fast. If it's to big it take a lot of time. Is there any other method or a total other way to replace my bad values and don't waste a lot of time?
Thanks for any help or any suggestions.
EDIT:
How does the method list.count()
works? I can also run through whole list and replacing it by my own.
for k in range(len(A)):
if A[k] == '**********': A[k] = '0.0'
This would be faster for many replacements. But would it be faster if I only would have one match?
The main problem in your code is the use of "A.index" inside the loop -. The index
method will walk linearly through your list, from the start up to the next ocurrence of "**" - this turns a O(n) problem into O(n²) - hence your perceived lack of performance.
While using Python the most obvious way is usually the best way to do it: so walking through your list in a Python for
loop in this case will undoubtley be better than O(n²) loops in C with the cound and index methods. The not so obvious part is the recomended usage of the built-in function "enumerate" to get both an item value and its index from the list on the for loop.
f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).split()
for i, value in enumerate(A):
if value == "********":
A[i] = "0.0"
If you are eventually going to convert this to an array, you might consider using numpy and the np.genfromtxt
which has the ability to deal with missing data:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
With a binary file, you can use np.memmap
and then use masked arrays to deal with the missing elements.
fin = open(fname, 'r')
fout = open(fname + '_fixed', 'w')
for line in fin:
# replace 10 asterisks by 7 spaces + '0.0'
# If you don't mind losing the fixed-column-width format,
# omit the seven spaces
line = line.replace('**********', ' 0.0')
fout.write(line)
fin.close()
fout.close()
Alternatively if your file is smallish, replace the loop by this:
fout.write(fin.read().replace('**********', ' 0.0'))
If after converting A
to one huge string representation, you first could change all the bad values with a single call to the A.replace('********', '0.0')
method and then split it, you'd have the same result, likely a lot faster. Something like:
f = open(fname,'r+b')
A = str(mmap.mmap(f.fileno(),0)[:]).replace('********', '0.0').split()
It would use a lot of memory, but that's often the trade-off for speed.
Instead of manipulating A, try using a list comprehension to make a new A:
A = [v if v != '********' else 0.0 for v in A]
I think you'll find this surprisingly fast.
精彩评论