Imagine that we have drawn a random sample y1, y2, ...,yn
from some population, so double y[]
and int n
are known. And there are groups in our population but we do not know exactly which observation is allocated on a particular group. So to each yi
we introduce an allocation variable zi
that tells us from which group yi
has been drawn. Now we assume that there are int k
groups, so zi e {0, .., k-1} for all i
. Now to make inferences for the groups I need to iterate my algorithm several number of times say 50,000 or 100,000. And at each iteration we will allocate probabilistically each observation to some group so my array of allocations int z[]
will be changing. In this case to count the number of observations in each group and minimum is very easy;
int nj[k], yj_min[k];
/* initializing the variables at each iteration */
for(j=0; j<k; j++){
nj[j]=0;
yj_min[j]=y[n]; /* y[] are ordered so y[n] is the maximum*/
}
for(i=0; i<n; i++){
nj[z[i]] = nj[z[i]] + 1;
if(yj_min[z[i]]) < y[z[i]]){
yj_min[z[i]] = y[z[i]];
}
}
but if we introduce a further allocation variable di for each observation yi that will indicate the sub-group from which yi
has been sampled (as well sampled probabilistically). There are int m sub-groups, so di e {0, .., m-1}
. Then (zi=j, di=s)
indicates that the observation yi
h开发者_开发技巧as been drawn from the group j
and sub-group s
.
How could I calculate EFFICIENTLY, as I have to do this at each iteration, the minimum yjs_min
over {i:zi=j, di=s}
? i.e. the minimum over yi
such that zi=j
and di=s
with j=0, ..k-1
and s=0,..,m-1
It would be great to do something like
for(i=0; i<n; i++){
njs[z[i]][d[i]] = njs[z[i]][d[i]] + 1;
if(yjs_min[z[i]][d[i]]) < y[z[i]][d[i]]){
yjs_min[z[i]][d[i]] = y[z[i]][d[i]];
}
}
but obviously this is impossible!!! So please any ideas?
Cheers, Carlos
It looks like you're trying to do something like a Fisher exact test or a permutation test. If so, you might try using a statistics package like R, which is designed to do this kind of stuff, and is likely to have the most efficient algorithms built in already.
That aside, as I understand it, you are stratifying the sample into n subgroups (y), and then each of those subgroups into k sub-subgroups. You want to find the minimum element of each sub-subgroup.
One reasonably efficient solution is: create n*k unique identifiers, and a map that indicates which sub-subgroup each of them corresponds to. Then, randomly allocate these numbers, (using the same distribution) to your sample observations (like you were before). Use an efficient in-place sort (like quicksort with a properly selected pivot) to sort the sample by identifier, so that all elements with the same identifier are stored in a contiguous block of memory. This takes log-linear time, so it should be very quick.
Then you just need to walk through the array in order, and find the minimum element for each unique identifier. This should take linear time and n*k extra space.
Hope that helps.
精彩评论