making binned boxplot in matplotlib with numpy and scipy in Python_问答_开发者

I have a 2-d array containing pairs of values and I'd like to make a boxplot of the y-values by different bins of the x-values. I.e. if the array is:

my_array = array([[1, 40.5], [4.5, 60], ...]])

then I'd like to bin my_array[:, 0] and then for each of the bins, produce a boxplot of the corresponding my_array[:, 1] values that fall into each box. So in the end I want the plot to contain number of bins-many box plots.

I tried the following:

min_x = min(my_array[:, 0])
max_x = max(my_array[:, 1])

num_bins = 3
bins = linspace(min_x, max_x, num_bins)
elts_to_bins = digitize(my_array[:, 0], bins)

However, this gives me values in elts_to_bins that range from 1 to 3. I thought I should get 0-based indices for the bins, and I only wanted 3 bins. I'm assuming this is due to some trickyness with how b开发者_运维百科ins are represented in linspace vs. digitize.

What is the easiest way to achieve this? I want num_bins-many equally spaced bins, with the first bin containing the lower half of the data and the upper bin containing the upper half... i.e., I want each data point to fall into some bin, so that I can make a boxplot.

thanks.

You're getting the 3rd bin for the maximum value in the array (I'm assuming you have a typo there, and max_x should be "max(my_array[:,0])" instead of "max(my_array[:,1])"). You can avoid this by adding 1 (or any positive number) to the last bin.

Also, if I'm understanding you correctly, you want to bin one variable by another, so my example below shows that. If you're using recarrays (which are much slower) there are also several functions in matplotlib.mlab (e.g. mlab.rec_groupby, etc) that do this sort of thing.

Anyway, in the end, you might have something like this (to bin x by the values in y, assuming x and y are the same length)

def bin_by(x, y, nbins=30):
    """
    Bin x by y.
    Returns the binned "x" values and the left edges of the bins
    """
    bins = np.linspace(y.min(), y.max(), nbins+1)
    # To avoid extra bin for the max value
    bins[-1] += 1 

    indicies = np.digitize(y, bins)

    output = []
    for i in xrange(1, len(bins)):
        output.append(x[indicies==i])

    # Just return the left edges of the bins
    bins = bins[:-1]

    return output, bins

As a quick example:

In [3]: x = np.random.random((100, 2))

In [4]: binned_values, bins = bin_by(x[:,0], x[:,1], 2)

In [5]: binned_values
Out[5]: 
[array([ 0.59649575,  0.07082605,  0.7191498 ,  0.4026375 ,  0.06611863,
        0.01473529,  0.45487203,  0.39942696,  0.02342408,  0.04669615,
        0.58294003,  0.59510434,  0.76255006,  0.76685052,  0.26108928,
        0.7640156 ,  0.01771553,  0.38212975,  0.74417014,  0.38217517,
        0.73909022,  0.21068663,  0.9103707 ,  0.83556636,  0.34277006,
        0.38007865,  0.18697416,  0.64370535,  0.68292336,  0.26142583,
        0.50457354,  0.63071319,  0.87525221,  0.86509534,  0.96382375,
        0.57556343,  0.55860405,  0.36392931,  0.93638048,  0.66889756,
        0.46140831,  0.01675165,  0.15401495,  0.10813141,  0.03876953,
        0.65967335,  0.86803192,  0.94835281,  0.44950182]),
 array([ 0.9249993 ,  0.02682873,  0.89439141,  0.26415792,  0.42771144,
        0.12292614,  0.44790357,  0.64692616,  0.14871052,  0.55611472,
        0.72340179,  0.55335053,  0.07967047,  0.95725514,  0.49737279,
        0.99213794,  0.7604765 ,  0.56719713,  0.77828727,  0.77046566,
        0.15060196,  0.39199123,  0.78904624,  0.59974575,  0.6965413 ,
        0.52664095,  0.28629324,  0.21838664,  0.47305751,  0.3544522 ,
        0.57704906,  0.1023201 ,  0.76861237,  0.88862359,  0.29310836,
        0.22079126,  0.84966201,  0.9376939 ,  0.95449215,  0.10856864,
        0.86655289,  0.57835533,  0.32831162,  0.1673871 ,  0.55742108,
        0.02436965,  0.45261232,  0.31552715,  0.56666458,  0.24757898,
        0.8674747 ])]

Hope that helps a bit!

Numpy has a dedicated function for creating histograms the way you need to:

histogram(a, bins=10, range=None, normed=False, weights=None, new=None)

which you can use like:

(hist_data, bin_edges) = histogram(my_array[:,0], weights=my_array[:,1])

The key point here is to use the weights argument: each value a[i] will contribute weights[i] to the histogram. Example:

a = [0, 1]
weights = [10, 2]

describes 10 points at x = 0 and 2 points at x = 1.

You can set the number of bins, or the bin limits, with the bins argument (see the official documentation for more details).

The histogram can then be plotted with something like:

bar(bin_edges[:-1], hist_data)

If you only need to do a histogram plot, the similar hist() function can directly plot the histogram:

hist(my_array[:,0], weights=my_array[:,1])

making binned boxplot in matplotlib with numpy and scipy in Python

精彩评论

关注公众号

热门标签

图文推荐

making binned boxplot in matplotlib with numpy and scipy in Python

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：