I'm having a time series data sets comprising of 10 Hz data over several years. For one year my data has around 3.1*10^8 rows of data (each row has a time stamp and 8 float values). My data has gaps which I need to identify and fill with 'NaN'. My python code below is capable of doing so but the performance is by far too bad for my kind of problem. I cannot get though my data set in any开发者_StackOverflow社区thing even close to a reasonable time.
Below an minimal working example. I have for example series (time-seris-data) and data as lits with same lengths:
series = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1]
data_a = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]
I would like series to advance in intervals of 1, hence the gaps of series are 4.1, 5.1, 6.1, 11.1, 12.1, 13.1, 17.1, 18.1, 19.1. The data_a and data_b lists shall be filled with float(nan)'s. so data_a for example should become:
[1.2, 1.2, 1.2, nan, nan, nan, 2.2, 2.2, 2.2, 2.2, nan, nan, nan, 3.2, 3.2, 3.2, nan, nan, nan, 4.2]
I archived this using:
d_max = 1.0 # Normal increment in series where no gaps shall be filled
shift = 0
for i in range(len(series)-1):
diff = series[i+1] - series[i]
if diff > d_max:
num_fills = round(diff/d_max)-1 # Number of fills within one gap
for it in range(num_fills):
data_a.insert(i+1+it+shift, float(nan))
data_b.insert(i+1+it+shift, float(nan))
shift = int(shift + num_fills) # Shift the index by the number of inserts from the previous gap filling
I searched for other solutions to this problems but only came across the use of the find() function yielding the indices of the gaps. Is the function find() faster than my solution? But then how would I insert NaN's in data_a and data_b in a more efficient way?
First, realize that your innermost loop is not necessary:
for it in range(num_fills):
data_a.insert(i+1+it+shift, float(nan))
is the same as
data_a[i+1+shift:i+1+shift] = [float(nan)] * int(num_fills)
That might make it slightly faster because there's less allocation and less moving items going on.
Then, for large numerical problems, always use NumPy. It may take some effort to learn, but the performance is likely to go up orders of magnitude. Start with something like:
import numpy as np
series = np.array([1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1])
data_a = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]
d_max = 1.0 # Normal increment in series where no gaps shall be filled
shift = 0
# the following two statements use NumPy's broadcasting
# to implicit run some loop at the C level
diff = series[1:] - series[:-1]
num_fills = np.round(diff / d_max) - 1
for i in np.where(diff > d_max)[0]:
nf = num_fills[i]
nans = [np.nan] * nf
data_a[i+1+shift:i+1+shift] = nans
data_b[i+1+shift:i+1+shift] = nans
shift = int(shift + nf)
IIRC, inserts into python lists are expensive, with the size of the list.
I'd recommend not loading your huge data sets into memory, but to iterate through them with a generator function something like:
from itertools import izip
series = [1.1, 2.1, 3.1, 7.1, 8.1, 9.1, 10.1, 14.1, 15.1, 16.1, 20.1]
data_a = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
data_b = [1.2, 1.2, 1.2, 2.2, 2.2, 2.2, 2.2, 3.2, 3.2, 3.2, 4.2]
def fillGaps(series,data_a,data_b,d_max=1.0):
prev = None
for s, a, b in izip(series,data_a,data_b):
if prev is not None:
diff = s - prev
if s - prev > d_max:
for x in xrange(int(round(diff/d_max))-1):
yield (float('nan'),float('nan'))
prev = s
yield (a,b)
newA = []
newB = []
for a,b in fillGaps(series,data_a,data_b):
newA.append(a)
newB.append(b)
E.g. read the data into the izip and write it out instead of list appends.
精彩评论