I have some data in 3 arrays with shapes:
docLengths.shape = (10000,)
docIds.shape = (10000,)
docCounts.shape = (68,10000)
I want to obtain relative counts and their means and standard deviations for some i:
docRelCounts = docCounts/docLengths
relCountMeans = docRelCounts[i,:].mean()
relCountDeviations = docRelCounts[i,:].std()
Problem is, some elements of 开发者_开发百科docLengths are zero. This produces NaN elements in docRelCounts and the means and deviations are thus also NaN.
I need to remove the data for documents of zero length. I could write a loop, locating zero length doc's and removing them, but I was hoping for some numpy array magic that would do this more efficiently. Any ideas?
Try this:
docRelCounts = docCounts/docLengths
goodDocRelCounts = docRelCounts[i,:][np.invert(np.isnan(docRelCounts[i,:]))]
relCountMeans = goodDocRelCounts.mean()
relCountDeviations = goodDocRelCounts.std()
np.isnan
returns an array of the same shape with True
where original array is NaN
, False
elsewhere. And np.invert
inverts this and then you get goodDocRelCounts
with only the values that are not NaN
.
Use nanmean and nanstd from scipy.stats:
from scipy.stats import nanmean, nanstd
In the end I did this (I'd actually worked it out before I saw eumiro's answer - it's a bit simpler, but otherwise not any better, just different, so I thought I'd include it :)
goodData = docLengths!=0 # find zero elements
docLen = docLen[goodData]
docCounts = docCounts[:,goodData]
docRelCounts = docCounts/docLen
means = map(lambda x:x.mean(), docRelCounts)
stds = map(lambda x:x.std(), docRelCounts)
精彩评论