开发者

how do I remove columns from a collection of ndarrays that correspond to zero elements of one of the arrays?

开发者 https://www.devze.com 2023-02-25 12:43 出处:网络
I have some data in 3 arrays with shapes: docLengths.shape = (10000,) docIds.shape = (10000,) docCounts.shape = (68,10000)

I have some data in 3 arrays with shapes:

docLengths.shape = (10000,)
docIds.shape = (10000,)
docCounts.shape = (68,10000)

I want to obtain relative counts and their means and standard deviations for some i:

docRelCounts = docCounts/docLengths
relCountMeans = docRelCounts[i,:].mean()
relCountDeviations = docRelCounts[i,:].std()

Problem is, some elements of 开发者_开发百科docLengths are zero. This produces NaN elements in docRelCounts and the means and deviations are thus also NaN.

I need to remove the data for documents of zero length. I could write a loop, locating zero length doc's and removing them, but I was hoping for some numpy array magic that would do this more efficiently. Any ideas?


Try this:

docRelCounts = docCounts/docLengths

goodDocRelCounts = docRelCounts[i,:][np.invert(np.isnan(docRelCounts[i,:]))]
relCountMeans = goodDocRelCounts.mean()
relCountDeviations = goodDocRelCounts.std()

np.isnan returns an array of the same shape with True where original array is NaN, False elsewhere. And np.invert inverts this and then you get goodDocRelCounts with only the values that are not NaN.


Use nanmean and nanstd from scipy.stats:

from scipy.stats import nanmean, nanstd


In the end I did this (I'd actually worked it out before I saw eumiro's answer - it's a bit simpler, but otherwise not any better, just different, so I thought I'd include it :)

goodData = docLengths!=0  # find zero elements
docLen = docLen[goodData]
docCounts = docCounts[:,goodData]

docRelCounts = docCounts/docLen
means = map(lambda x:x.mean(), docRelCounts)
stds = map(lambda x:x.std(), docRelCounts)
0

精彩评论

暂无评论...
验证码 换一张
取 消