I have a bunch of data in Excel that I need to get certain percentile information from. The problem is that instead of having the data set made up of each value, I instead have info on the number of or "bucket" data.
For example, imagine that my actual data set looks like this: 1,1,2,2,2,开发者_Python百科2,3,3,4,4,4
The data set that I have is this:
Value No. of occurrences
1 2
2 4
3 2
4 3
Is there an easy way for me to calculate percentile information (as well as the median) without having to explode the summary data out to full data set? (Once I did that, I know that I could just use the Percentile(A1:A5, p) function)
This is important because my data set is very large. If I exploded the data out, I would have hundreds of thousands of rows and I would have to do it for a couple of hundred data sets.
Help!
The median in your example is quite easy as you are showing an odd number of total No. of occurrences
. By observation, the median in 2. Relative to the last 2 (the sixth value in the sequence) there are five values less than or equal [1,1,2,2,2] and five values greater than or equal [3,3,4,4,4].
This can be calculated from your summary data with a formula such as =(1+SUM(No._of_occurrences))/2
where No._of_occurences
is the named range containing the array of your No. of occurrences [2,4,2,3].
A data set with an even number of datapoints does not have a median so any result from adding one datapoint (say 4) is suspect. The formula would return 6.5 in that case, with the half indicating an invalid result (there are two middle values). Though if taking a fairly conventional approach of averaging these two values, then the formula result can be interpreted as the mean of the sixth [2] and seventh [3] values – ie 2.5.
Individual values for your binned No. of occurrences multiplied by 100 and divided by the total No. of occurrences [11] would give the percentages each bin contributes to the total. A cumulative total of these gives the percentile for the upper limit of each bin. Taking say the lower 30th percentile, this arises in the second bin, hence in this case is 2. The lower 20th and the 50th percentile (median) are in the same bin so for them the answer is also 2.
This works because you chose one bin per data point value. Had these, as is more usual, been ranges (say 1-5, 6-10 etc) then the lower 20th and the 50th percentile may still have been in the same bin but would not necessarily have had the same value. However, to determine the value only the contents of that bin would require further examination to determine the exact value, rather than your entire dataset.
精彩评论