compute how nice it looks a set of values (how nice is the distribution)_问答_开发者

compute how nice it looks a set of values (how nice is the distribution)

开发者 https://www.devze.com 2023-02-06 05:26 出处：网络

this set of values: 1 2 3 3 4 1 looks pretty nice if you 开发者_如何学Gothink of it on a bar chart:

*   *
* * * *
=======
1 2 3 4

while this one looks bad.. 1 2 2 2 2 2 2 2 2 9 8

  *
  *
  * 
  * 
  *
  *
  *
* *           * *
=================
1 2 3 4 5 6 7 8 9

This is because there are a lot of 2 and a big gap between the 2 and the 8...

I need to find a formula which computes how nice a set of number looks.. I think I'll need some deviation function.. any idea?

thanks

A chi-square analysis is probably what you're looking for. If used in the right way it will give you a number describing how close your distribution is to a discrete uniform distribution. A discrete uniform distribution will be flat (i.e. have approximately the same number of elements in each of the histogram buckets), which seems to fit your definition of 'nice'.

This seems reasonable to me, but I have pretty limited knowledge of statistics:

from collections import Counter
def tonums( s ):
        return [int(x) for x in s if x!=' ']

def nice( nums ):
    # how far do they spread
    used_range = range(min(nums), max(nums)+1)

    # how often would each number occur if they were equally distributed
    expected = 1.0*len(nums)/len(used_range)

    # how often do they actually occur
    counter = Counter(nums)

    # compute the variance
    return sum((count-expected)**2 for item, count in counter.iteritems())


# should be fst < snd
print nice(tonums('1 2 3 3 4 1'))
print nice(tonums('1 2 2 2 2 2 2 2 2 9 8'))

# these should be 0
print nice(tonums('1'))
print nice(tonums('1 1 1 1'))

# should be equal
print nice(tonums('1 1 2 3'))
print nice(tonums('1 2 2 3'))

Your definition of "nice" is somewhat broad. I'd suggest two approaches to it based on my interpretation of what you mean by nice

Compute (or estimate) how far away from being normally distributed your data is. A stats textbook or stats package should discuss this.
Perform some kind of Fourier transform - lot of high frequency components probably aren't "nice".