Using reshape + cast to aggregate over multiple columns_问答_开发者

Using reshape + cast to aggregate over multiple columns

开发者 https://www.devze.com 2022-12-29 15:47 出处：网络

In R, I have a data frame with columns for Seat (factor), Party (factor) and Votes (numeric). I want to create a summary data frame with columns for Seat, Winning party, and Vote share. For example, f

相关专题：r reshape

df <- data.frame(party=rep(c('Lab','C','LD'),times=4),
                 votes=c(1,12,2,11,3,10,4,9,5,8,6,15),
                 seat=rep(c('A','B','C','D'),each=3))

I want to get the output

  seat winner voteshare
1    A      C 0.8000000
2    B    Lab 0.4583333
3    C      C 0.5000000
4    D     LD 0.5172414

I can figure out how to achieve this. But I'm sure there must be a better way, probably a cunning one-liner using Hadley Wickham's reshape package. Any suggestions?

For what it's worth, my solution uses a function from my package djwutils_2.10.zip and is invoked as follows开发者_开发知识库. But there are all sorts of special cases it doesn't deal with, so I'd rather rely on someone else's code.

aggregateList(df, by=list(seat=seat),
              FUN=list(winner=function(x) x$party[which.max(x$votes)],
                       voteshare=function(x) max(x$votes)/sum(x$votes)))

Hadley's plyr package may help you:

ddply(df, .(seat), function(x) data.frame(winner=x[which.max(x$votes),]$party, voteshare=max(x$votes)/sum(x$votes)))

You may be right that there's a cunning one-liner. I tend to favour the approach that understandable is better than clever, especially when you're first looking at something. Here's the more verbose alternative.

votes_by_seat_and_party <- as.matrix(cast(df, seat ~ party, value="votes"))

   C Lab LD
A 12   1  2
B  3  11 10
C  9   4  5
D  6   8 15

seats <- rownames(votes_by_seat_and_party)
parties <- colnames(votes_by_seat_and_party)

winner_col <- apply(votes_by_seat_and_party, 1, which.max)
winners <- parties[winner_col]
voteshare_of_winner_by_seat <- apply(votes_by_seat_and_party, 1, function(x) max(x) / sum(x))

results <- data.frame(seat = seats, winner = winners, voteshare = voteshare_of_winner_by_seat)

  seat winner voteshare
1    A      C 0.8000000
2    B    Lab 0.4583333
3    C      C 0.5000000
4    D     LD 0.5172414

# Full voteshare matrix, if you're interested
total_votes_by_seat <- rowSums(votes_by_seat_and_party)
voteshare_by_seat_and_party <- votes_by_seat_and_party / total_votes_by_seat

OK, so 3 solutions... here's another more compact solution using raw R. It is 4 sparse code lines. I'm assuming missing values are 0, or just missing, because it won't matter. My guess is that this would be your fastest code for a large set of data.

#get a sum for dividing
s <- aggregate(df$votes, list(seat = df$seat), sum)
#extract the winner and seat
temp <- aggregate(df$votes, list(seat = df$seat), max)
res <- df[df$seat %in% temp$seat & df$votes %in% temp$x,]
res$votes <- res$votes / s$x

Rename the columns if you wish...

res$names <- c('party', 'voteshare', 'winner')

(this will return an error in the event of a tie... you'll be able to see it in the temp data frame)