开发者

Aggregate or summarize to get ratios

开发者 https://www.devze.com 2023-01-28 07:23 出处:网络
The following is a toy problem that demonstrates my question. I have a data frame that contains a bunch of employees; for each employee, it has a name, salary, gender and state.

The following is a toy problem that demonstrates my question.

I have a data frame that contains a bunch of employees; for each employee, it has a name, salary, gender and state.

aggregate(salary ~ state)  # Returns the average salary per state
aggregate(salary ~ state + gender, data, FUN = mean)  # Avg salary per state/gender

What I actually need is a summary of the fraction of the total salary earned by women in each state.

aggregate(salary ~ state + gender, data, FUN = sum)  

returns the total salary earned by women (and men) in each state ,but what I really need is salary_w / salary_total on a p开发者_StackOverflow中文版er-state level. I can write a for-loop, etc -- but I am wondering if there is some way to use aggregate to do that.


Another option would be using plyr. ddply() expects a data.frame as an input and will return a data.frame as an output. The second argument is how you want to split the data frame. The third argument is what we want to apply to the chunks, here we are using summarise to create a new data.frame from the existing data.frame.

library(plyr)

#Using the sample data from kohske's answer above

> ddply(d, .(state), summarise, ratio = sum(salary[gender == "Woman"]) / sum(salary))
  state     ratio
1     1 0.5789860
2     2 0.4530224


probably reshape or reshape2 would help your work.

Here is a sample script:

library(reshape2) # from CRAN

# sample data
d <- data.frame(expand.grid(state=gl(2,2),gender=gl(2,1, labels=c("Men","Wemon"))),
  salaly=runif(8))

d2 <- dcast(d, state~gender, sum)
d2$frac <- d2$Wemon/(d2$Men+d2$Wemon)


The ave function is good for problems like this.

Data$ratio <- ave(Data$salary, Data$state, Data$gender, FUN=sum) /
              ave(Data$salary, Data$state, FUN=sum)


Another solution is to use xtabs and prop.table:

prop.table(xtabs(salary ~ state + gender,data),margin=1)


It's generally not advisable to name your datasets, "data", so I will change the problem slightly to name the dataset "dat1".

       with( subset(dat1, gender="Female"), aggregate(salary, state, sum )/ 
 # should return a vector
       with( data=dat1,                   aggregate(salary,  state, sum ) 
             # using R's element-wise division

I think you are also using attach and there are good reasons to reconsider that decision, despite what you might read in Crawley.


Since you want the results on a per state basis the tapply might be what you want.

To illustrate let's generate some arbitrary data to play with:

set.seed(349)   # For replication
n <- 20000      # Sample size
gender <- sample(c('M', 'W'), size = n, replace = TRUE) # Random selection of gender
state <- c('AL','AK','AZ','AR','CA','CO','CT','DE','DC','FL','GA','HI',
           'ID','IL','IN','IA','KS','KY','LA','ME','MD','MA','MI','MN',
           'MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH',
           'OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA',
           'WV','WI','WY')      # All US states
state <- sample(state, size = n, replace = TRUE)  # Random selection of the states

state_index <- tapply(state, state)     # Just for the data generatino part ...
gender_index <- tapply(gender, gender)

# Generate salaries
salary <- runif(length(unique(state)))[state_index]     # Make states different
salary <- salary + c(.02, -.02)[gender_index]           # Make gender different
salary <- salary + log(50) + rnorm(n)                   # Add mean and error term
salary <- exp(salary)                                   # The variable of interest

What you asked for, the sum of salaries for the women per state and the sum of total salaries per state:

salary_w <- tapply(salary[gender == 'W'], state[gender == 'W'], sum)
salary_total <- tapply(salary, state, sum)

Or if it is in a data-frame:

salary_w <- with(myData, tapply(salary[gender == 'W'], state[gender == 'W'], sum))
salary_total <- with(myData, tapply(salary, state, sum))

Then the answer is:

> salary_w / salary_total
       AK        AL        AR        AZ        CA        CO        CT        DC 
0.4667424 0.4877013 0.4554831 0.4959573 0.5382478 0.5544388 0.5398104 0.4750799 
       DE        FL        GA        HI        IA        ID        IL        IN 
0.4684846 0.5365707 0.5457726 0.4788805 0.5409347 0.4596598 0.4765021 0.4873932 
       KS        KY        LA        MA        MD        ME        MI        MN 
0.5228247 0.4955802 0.5604342 0.5249406 0.4890297 0.4939574 0.4882687 0.5611435 
       MO        MS        MT        NC        ND        NE        NH        NJ 
0.5090843 0.5342312 0.5492702 0.4928284 0.5180169 0.5696885 0.4519603 0.4673822 
       NM        NV        NY        OH        OK        OR        PA        RI 
0.4391634 0.4380065 0.5366625 0.5362918 0.5613301 0.4583937 0.5022793 0.4523672 
       SC        SD        TN        TX        UT        VA        VT        WA 
0.4862358 0.4895377 0.5048047 0.4443220 0.4881062 0.4880047 0.5338397 0.5136393 
       WI        WV        WY 
0.4787588 0.5495602 0.5029816 
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号