(be sure to check out the EDIT at the end of the post 开发者_C百科before reading too deeply into the source)
I'm plotting a histogram of a population that seems to be of log Laplacian distribution:
I'm trying to draw a line of best fit for it to verify my hypothesis, but I'm having problems getting meaningful results.
I'm using the Laplacian PDF definition from Wikipedia and taking 10 to the power of the PDF (to "reverse" the effects of the log histogram).
What am I doing wrong?
Here is my code. I pipe things through standard input (cat pop.txt | python hist.py
) -- here's a sample population.
from pylab import *
import numpy
def laplace(x, mu, b):
return 10**(1.0/(2*b) * numpy.exp(-abs(x - mu)/b))
def main():
import sys
num = map(int, sys.stdin.read().strip().split(' '))
nbins = max(num) - min(num)
n, bins, patches = hist(num, nbins, range=(min(num), max(num)), log=True, align='left')
loc, scale = 0., 1.
x = numpy.arange(bins[0], bins[-1], 1.)
pdf = laplace(x, 0., 1.)
plot(x, pdf)
width = max(-min(num), max(num))
xlim((-width, width))
ylim((1.0, 10**7))
show()
if __name__ == '__main__':
main()
EDIT
OK, here is the attempt to match it to a regular Laplacian distribution (as opposed to a log Laplacian). Differences from above attempt:
- histogram is normed
- histogram is linear (not log)
laplace
function defined exactly as specified in the Wikipedia article
Output:
As you can see, it isn't the best match, but the figures (the histogram and the Laplace PDF) are at least now in the same ballpark. I think the log Laplace will match a lot better. My approach (source above) didn't work. Can anybody suggest a way to do this what works?
Source:
from pylab import *
import numpy
def laplace(x, mu, b):
return 1.0/(2*b) * numpy.exp(-abs(x - mu)/b)
def main():
import sys
num = map(int, sys.stdin.read().strip().split(' '))
nbins = max(num) - min(num)
n, bins, patches = hist(num, nbins, range=(min(num), max(num)), log=False, align='left', normed=True)
loc, scale = 0., 0.54
x = numpy.arange(bins[0], bins[-1], 1.)
pdf = laplace(x, loc, scale)
plot(x, pdf)
width = max(-min(num), max(num))
xlim((-width, width))
show()
if __name__ == '__main__':
main()
Your laplace() function does not seem to be a Laplace distribution. Besides,
numpy.log()
is a natural logarithm (basee
), not decimal.Your histogram does not seem to be normalized, while the distribution is.
EDIT:
Don't use blanket imports
from pyplot import *
, it'll bite you.If you're checking consistency with Laplace distribution (or its log), use the fact that the latter is symmetric around
mu
: fixmu
at a maximum of your histogram, and you have a single-parameter problem. And you can only use either half of the histogram as well.Use
numpy
's histogram function -- this way you can get the histogram itself, which you can then fit with a Laplace distribution (and/or its log). The chi-square will tell you how good (or bad) the consistency is. For fitting you can use, e.g.scipy.optimize.leastsq
routine (http://www.scipy.org/Cookbook/FittingData)
I've found a work-around for the problem I was having. Instead of using matplotlib.hist
, I use numpy.histogram
in combination with matplotlib.bar
to calculate the histogram and draw it in two separate steps.
I'm not sure if there's a way to do this using matplotlib.hist
-- it would definitely be more convenient, though.
You can see that it's a much better match.
My problem now is I need to estimate the scale
parameter of the PDF.
Source:
from pylab import *
import numpy
def laplace(x, mu, b):
"""http://en.wikipedia.org/wiki/Laplace_distribution"""
return 1.0/(2*b) * numpy.exp(-abs(x - mu)/b)
def main():
import sys
num = map(int, sys.stdin.read().strip().split(' '))
nbins = max(num) - min(num)
count, bins = numpy.histogram(num, nbins)
bins = bins[:-1]
assert len(bins) == nbins
#
# FIRST we take the log of the histogram, THEN we normalize it.
# Clean up after divide by zero
#
count = numpy.log(count)
for i in range(nbins):
if count[i] == -numpy.inf:
count[i] = 0
count = count/max(count)
loc = 0.
scale = 4.
x = numpy.arange(bins[0], bins[-1], 1.)
pdf = laplace(x, loc, scale)
pdf = pdf/max(pdf)
width=1.0
bar(bins-width/2, count, width=width)
plot(x, pdf, color='r')
xlim(min(num), max(num))
show()
if __name__ == '__main__':
main()
精彩评论