Re-bucketing data in R_问答_开发者_运维开发者技术经验分享

I've been using the "hist" function to bucket my data in R. What I would to do now is have a hist function that not only takes a list of values to bucket, but the value and the count for each. I've written one in R to do it for me, but its 10-50x slower(very rough estimate) than the built in hist.

Is there anyone way to do this 'natively'?

So for example, maybe a list(or vector) of the form (1, 200) (2, 30) (3, 50)

Where the first value is the value, and the second is the number of instances of that data (I can move my data into other forms, this is just an example)

Thanks!

Update: I'm (basically) mapping a continuous domain to a arbitrary discrete domain. So say I have a hundred values between 0 and 10, and I want an output of how many are between 0 and 1, 1 and 2 etc..(or between 0 and 2, 2 and 4 or whatever). So the hist function works fine for that(I tell it where to divide the 'buckets') and it outputs the discretized counts(I can pass in a flag not to draw the graph).

But what I have now is not just a set of value from 0 to 10, but a set of values, AND how many instances there are of that. So instead of having 0.1, 0.1, 0.1, 0.1, 0.2, 0.2, 0.5 as 7 different values I have it in the form (0.1, 4), (0.2, 2), (0.5, 1) which shows the values and the count. And I want to be able to run the 'hist' function(or something like it) over the data and get the same output as if it was in the 'expanded' form.

So I've written a function to do that, but it runs A LOT slower than the original hist. "Unrolling" the data would make it too lar开发者_如何学编程ge in memory for what I need.

I am not sure what you mean under "bucketing data", but if I am right, you are up to get the categories/breaks made by hist function and store the results.

This could be done easily without calling graphics, e.g.:

> table(cut(data, 5))
(-0.000908,0.198]     (0.198,0.397]     (0.397,0.595]     (0.595,0.794] 
               19                20                17                21 
    (0.794,0.993] 
               23

Data was made up for demonstrating purposes by data <- runif(100).

In the above command cut does the main job: it cuts the continous variable to the specified number of intervals (above: it was 5). I called table to show the frequencies.

I might be missing something, but I think this might help:

#Generate the data
x <- c(rep(1, 200), rep(2, 30), rep(3, 50))

#Since the midpoints of each bucket will be used and the desired bucket width
#is 1, start the bucket breaks at -0.5
buc <- seq(-0.5, 5, 1)

#Get a histogram using the above bucket breaks
res <- hist(x, breaks=buc)

#Build a data frame with the results
df <- data.frame(mids=res$mids, counts=res$counts)
df

  mids counts
1    0      0
2    1    200
3    2     30
4    3     50
5    4      0

Use names to look at which variables are available from hist

names(res)

[1] "breaks"      "counts"      "intensities" "density"     "mids"        "xname"       "equidist"

Along with the other responder, I am not exactly sure what you want, but am guessing that you want an expansion of a tabular description of a larger vector:

unlist( mapply("rep", x=c(1,2,3), times=c(200,30,50) ) )

  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [34] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [67] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[100] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[133] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[166] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[199] 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3
[232] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
[265] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

Do you mean

barplot(height=c(200,30,50),names.arg=1:3,space=0,ylab="Count")

You could also do this by hacking your data into the format returned by hist and calling graphics:::plot.histogram, i.e.

## must specify counts, mid, breaks, and specify that the bars are equidistant
h <- list(counts=c(200,30,50),mid=1:3,breaks=seq(0.5,3.5,by=1),equidist=TRUE)
graphics:::plot.histogram(h,freq=TRUE)

edit: It depends what form your data are in and how flexible you want to be about re-bucketing.

A crude simple version, if you want to take an existing set of breaks, midpoints, and counts, and lump together every set of agg bins (in your example agg=2) would be:

mids <- seq(0.1,0.6,by=0.1)
breaks <- seq(0.05,0.65,by=0.1)
counts <- c(3,7,6,9,6,7)

agg <- 2
bnames <- apply(matrix(mids,byrow=TRUE,ncol=agg),1,
                      function(x) paste(head(x,1),tail(x,1),sep="-"))
bmids <- rowMeans(matrix(mids,byrow=TRUE,ncol=agg))
bbreaks <- breaks[seq(1,length(breaks),by=agg)]
bcount <- rowSums(matrix(counts,byrow=TRUE,ncol=agg))

h <- list(counts=bcount,mid=bmids,breaks=bbreaks,equidist=TRUE)
graphics:::plot.histogram(h,freq=TRUE)