In R, I have a data frame with columns for Seat (factor), Party (factor) and Votes (numeric). I want to create a summary data frame with columns for Seat, Winning party, and Vote share. For example, from the data frame
df <- data.frame(party=rep(c('Lab','C','LD'),times=4),
votes=c(1,12,2,11,3,10,4,9,5,8,6,15),
seat=rep(c('A','B','C','D'),each=3))
I want to get the output
seat winner voteshare
1 A C 0.8000000
2 B Lab 0.4583333
3 C C 0.5000000
4 D LD 0.5172414
I can figure out how to achieve this. But I'm sure there must be a better way, probably a cunning one-liner using Hadley Wickham's reshape
package. Any suggestions?
For what it's worth, my solution uses a function from my package
djwutils_2.10.zip
and is invoked as follows开发者_开发知识库. But there are all sorts of special cases it doesn't deal with, so I'd rather rely on someone else's code.
aggregateList(df, by=list(seat=seat),
FUN=list(winner=function(x) x$party[which.max(x$votes)],
voteshare=function(x) max(x$votes)/sum(x$votes)))
Hadley's plyr package may help you:
ddply(df, .(seat), function(x) data.frame(winner=x[which.max(x$votes),]$party, voteshare=max(x$votes)/sum(x$votes)))
You may be right that there's a cunning one-liner. I tend to favour the approach that understandable is better than clever, especially when you're first looking at something. Here's the more verbose alternative.
votes_by_seat_and_party <- as.matrix(cast(df, seat ~ party, value="votes"))
C Lab LD
A 12 1 2
B 3 11 10
C 9 4 5
D 6 8 15
seats <- rownames(votes_by_seat_and_party)
parties <- colnames(votes_by_seat_and_party)
winner_col <- apply(votes_by_seat_and_party, 1, which.max)
winners <- parties[winner_col]
voteshare_of_winner_by_seat <- apply(votes_by_seat_and_party, 1, function(x) max(x) / sum(x))
results <- data.frame(seat = seats, winner = winners, voteshare = voteshare_of_winner_by_seat)
seat winner voteshare
1 A C 0.8000000
2 B Lab 0.4583333
3 C C 0.5000000
4 D LD 0.5172414
# Full voteshare matrix, if you're interested
total_votes_by_seat <- rowSums(votes_by_seat_and_party)
voteshare_by_seat_and_party <- votes_by_seat_and_party / total_votes_by_seat
OK, so 3 solutions... here's another more compact solution using raw R. It is 4 sparse code lines. I'm assuming missing values are 0, or just missing, because it won't matter. My guess is that this would be your fastest code for a large set of data.
#get a sum for dividing
s <- aggregate(df$votes, list(seat = df$seat), sum)
#extract the winner and seat
temp <- aggregate(df$votes, list(seat = df$seat), max)
res <- df[df$seat %in% temp$seat & df$votes %in% temp$x,]
res$votes <- res$votes / s$x
Rename the columns if you wish...
res$names <- c('party', 'voteshare', 'winner')
(this will return an error in the event of a tie... you'll be able to see it in the temp data frame)
精彩评论