I have time-series data with some missing data for which I am running some estimation function over rolling windows. The windows are not uniform length and each variable has differing start and end dates. I want to remove any windows which have any missing data. The windows overlap, so a single missing observation will often remove a great many windows from consideration. What I want is a mapping from each date into the windows which contain it.
Currently, I have a logical matrix which has a row for every possible day and then each column represents one of the windows with true values for the days of that window. I can then subset that matrix to rows representing the missing data, and whichever columns contain any true values are the invalid windows. The problem is that the logical matrix gets large (10k x 10k ~100mb) and there can be开发者_StackOverflow社区 many. I can convert to sparse which solves the size problem, but the calculation of which windows to remove becomes very slow when the windows are long.
This doesn't smell like a problem that should be resource intensive (either memory or computation), is there a better way?
Edit: Let me add an example so this may be a little clearer. Say the full set of dates range from 1 to 100. Windows are 1:10, 2:11, 3:12 and so on through to 91:100 (these are uniform, but it doesn't matter for the example). I have a series that runs from 5 to 25, but has NaN for 17.
That one NaN knocks out ten windows (8:17 through to 17:26). I want an efficient mapping from observation 17 to windows 8:17. Clearly, it's pretty easy when the windows are uniform length, but what is an efficient method when the windows are irregular?
Logical comparisons cost little time and resources. Are you really sure that this is a bottleneck?
If the window creation is time-consuming, you may want do it in a while
-loop, so that you can skip a few entries if you encounter a NaN
inside a window.
If window creation is fast, which it would certainly be with uniform lengths, you can simply do
%# startEnd is created according to your example, but can be whatever quick method
startEnd = [(1:(100-windowSize+1))',(windowSize+1:100)'];
nanIdx = find(isnan(data))'; %'#
%# This line temporarily creates a logical array of size nWindows-by-numberOfNaNs
%# which is most likely smaller than nWindows-by-nWindows
badWindows = any(bsxfun(@le,startEnd(:,1),nanIdx) & bsxfun(@ge,startEnd(:,2),nanIdx),2);
startEnd(badWindows,:) = [];
I coded up Jonas' solution and it ended up being slower than what I already had in place. It did remove the need for the large array, however, and it got me thinking about the problem differently. I just needed to go from a window -> (obs start, obs end) mapping to an obs -> (indices of each window that obs falls into) approach.
So I build a cell-array that contains each set of window indices for each day (I could use a NumObsx2 matrix, but I want to allow for possibly complicated window definition). For each timeseries, I use the indices of each missing data point to subset the mapping to get the indices of all the windows that need to be removed. Then cell2mat pulls the indices out of the cell-array and I can remove the bad windows (Matlab doesn't care about repeated indices in assignments, thankfully).
In my timings, this method is about 10x my original method and 15x than Jonas' method. The indices in the map can be stored as uint16, so the memory required is far less than my orignal solution (but still more than Jonas' method).
As a bonus, I can use accumarray to count up the indices if I want to use more complicated criteria such as no more than 5 NaNs (cf no more than one as described).
精彩评论