I try to implement Hampel tanh estimators to normalize highly asymmetric data. In order to do this, I need to perform the following calculation:
Given x
- a sorted list of numbers and m
- the median of x
, I need to find a
such that approximately 70% of the values in x
fall into the range (m-a; m+a)
. We know nothing about the distribution of values in x
. I write in python using numpy, and the best idea that I had is to write some sort of stochastic iterative search (for example, as was described by Solis and Wets), but I suspect that there is a better approach, either in form of better algorithm or as a ready function. I searched the numpy and scipy documentation, but couldn't find any useful hint.
EDIT
Seth suggested to use开发者_StackOverflow scipy.stats.mstats.trimboth, however in my test for a skewed distribution, this suggestion didn't work:
from scipy.stats.mstats import trimboth
import numpy as np
theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)
trimmedList = trimboth(theList, proportiontocut=0.15)
a = (trimmedList.max() - trimmedList.min()) * 0.5
#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))
print np.sum(sel) / float(len(theList))
The output is 0.79 (~80%, instead of 70)
You need to first symmetrize your distribution by folding all values less than the mean over to the right. Then you can use the standard scipy.stats
functions on this one-sided distribution:
from scipy.stats import scoreatpercentile
import numpy as np
theList = np.log10(1+np.arange(.1, 100))
theMedian = np.median(theList)
oneSidedList = theList[:] # copy original list
# fold over to the right all values left of the median
oneSidedList[theList < theMedian] = 2*theMedian - theList[theList < theMedian]
# find the 70th centile of the one-sided distribution
a = scoreatpercentile(oneSidedList, 70) - theMedian
#check how many elements fall into the range
sel = (theList > (theMedian - a)) * (theList < (theMedian + a))
print np.sum(sel) / float(len(theList))
This gives the result of 0.7
as required.
Restate the problem slightly. You know the length of the list, and what fraction of the numbers in the list to consider. Given that, you can determine the difference between the first and last indices in the list that give you the desired range. The goal then is to find the indices that will minimize a cost function corresponding to the desired symmetric values about the median.
Let the smaller index be n1
and the larger index by n2
; these are not independent. The values from the list at the indices are x[n1] = m-b
and x[n2]=m+c
. You now want to choose n1
(and thus n2
) so that b
and c
are as close as possible. This occurs when (b - c)**2
is minimal. That's pretty easy using numpy.argmin
. Paralleling the example in the question, here's an interactive session illustrating the approach:
$ python
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01)
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> theList = np.log10(1+np.arange(.1, 100))
>>> theMedian = np.median(theList)
>>> listHead = theList[0:30]
>>> listTail = theList[-30:]
>>> b = np.abs(listHead - theMedian)
>>> c = np.abs(listTail - theMedian)
>>> squaredDiff = (b - c) ** 2
>>> np.argmin(squaredDiff)
25
>>> listHead[25] - theMedian, listTail[25] - theMedian
(-0.2874888056626983, 0.27859407466756614)
What you want is scipy.stats.mstats.trimboth. Set proportiontocut=0.15
. After trimming, take (max-min)/2
.
精彩评论