For the purpose of conducting a psychological experiment I have to divide a set of pictures (240) described by 4 features (real n开发者_运维知识库umbers) into 3 subsets with equal number of elements in each subset (240/3 = 80) in such a way that all subsets are approximately balanced with respect to these features (in terms of mean and standard deviation).
Can anybody suggest an algorithm to automate that? Are there any packages/modules in Python or R that I could use to do that? Where should I start?
If I understand correctly your problem, you might use random.sample()
in python:
import random
pool = set(["foo", "bar", "baz", "123", "456", "789"]) # your 240 elements here
slen = len(pool) / 3 # we need 3 subsets
set1 = set(random.sample(pool, slen)) # 1st random subset
pool -= set1
set2 = set(random.sample(pool, slen)) # 2nd random subset
pool -= set2
set3 = pool # 3rd random subset
I would tackle this as follows:
- Divide into 3 equal subsets.
- Figure out the mean and variance of each subset. From them construct an "unevenness" measure.
- Compare each pair of elements, if swapping would reduce the "unevenness", swap them. Continue until there are either no more pairs to compare, or the total unevenness is below some arbitrary "good enough" threshold.
You can easily do this using the plyr
library in R. Here is the code.
require(plyr)
# CREATE DUMMY DATA
mydf = data.frame(feature = sample(LETTERS[1:4], 240, replace = TRUE))
# SPLIT BY FEATURE AND DIVIDE INTO THREE SUBSETS EQUALLY
ddply(mydf, .(feature), summarize, sub = sample(1:3, 60, replace = TRUE))
In case you are still interested in the exhaustive search question. You have 240 choose 80 possibilities to choose the first set and then another 160 choose 80 for the second set, at which point the third set is fixed. In total, this gives you:
120554865392512357302183080835497490140793598233424724482217950647 * 92045125813734238026462263037378063990076729140
Clearly, this is not an option :)
Order your items by their decreasing Mahalanobis distance from the mean; they will be ordered from most extraordinary to most boring, including the effects of whatever correlations exist amongst the measures.
Assign X[3*i] X[3*i+1] X[3*i+2] to the subsets A, B, C, choosing for each i the ordering of A/B/C that minimizes your mismatch measure.
Why decreasing order? The statistically heavy items will be assigned first, and the choice of permutation in the larger number of subsequent rounds will have a better chance of evening out initial imbalances.
The point of this procedure is to maximize the chance that whatever outliers exist in the data set will be assigned to separate subsets.
精彩评论