ANOVA over time in Python, what am I doing?_问答_开发者

I really like statistics, but haven't taken a course in over 6 years. I'm having trouble figuring out what kind of test I need here, and the best numpy/scipy/R function to use for these kinds of issues.

I've got a table of visitors and their corresponding properties (e.g. "Browser = Moz开发者_运维知识库illa, Referrer = Google"), as well as a variable value per visitor (e.g. $5), grouped into data points over time.

My goal is to:

A) Find the most significant property families, with a score for "how significant" the family is

Example of a conclusion I want to draw*:

Referrer has 10x larger effect size upon value-per-visitor than Browser
=> PropertyFamily('browser').significance = 1
=> PropertyFamily('referrer').significance = 10

AND

B) Find the most significant properties within families, with significance scores.

Sample of a conclusion I'd like to draw:

GIVEN THAT Value:Baseline => $5/hit
5 hits from IE @ $5/hit (equal to baseline) => no significance
1 hit from Netscape @ $0 => little significance (not enough data)
10 hits from FF @ $10/hit => HIGH significance (hits and delta_value both high)

My questions are:

1) Are there numpy/scipy/R functions to make my life easy here?

2) Can anyone that knows a bit more about ANOVA (analysis of variance) and ANOVA-over-time please provide feedback? I'm not positive that I'm even doing this right, and could be missing something simple. Confirmation or correction are both appreciated.

Note that these are ARRAYS of (hits, values, days) over the last 30 days. For example, if there's a large peak (relative to baseline) in Value-Of-Mozilla on Monday, and a drop (below baseline) in Value-Of-Mozilla on Tuesday, I want Mozilla to show up as a "significant" property (rather than the peak/drop canceling each other out)

Example of my input data, before map/reducing:

data = {
'baseline': [(hits, value, day) for hits, value, day in last_thirty_days('baseline')],
'browser': {
  'mozilla': [(hits, value, day) for hits, value, day in last_thirty_days('browser', 'mozilla')],
  ... etc ...
  }
}
... etc ...

Here's my current code -- It runs on Dumbo/Hadoop, and provides a number for "significance" that I basically invented the formula for. While my formula works, and gives meaningful data, my values for "significance" aren't well defined (a "significant" property will usually have a score >= 100, but this changes with the size of the dataset) and I know that there's probably a "real formula" for this.

# Runs after each (hits, value, date) tuple has been grouped
# into corresponding "plot points", as they would appear on a graph
pp = PlotPoint(property, date, hits, value)
pp.epc = float(pp.value/pp.hits) if pp.hits else 0

# Finds PlotPoint('baseline', date)
# if pp = PlotPoint('firefox', '1-1-10')
#  then pp.baseline == PlotPoint('baseline', '1-1-10')
baseline = pp.baseline()
if baseline.hits == 0:
    volume_ratio = 0 
else:
    volume_ratio = round(100*pp.hits/baseline.hits)
value_ratio = baseline.epc - pp.epc

# Make up a significance value --
# e.g. (10% of visitors * ($1 delta from baseline))^2
pp.significance = math.sqrt(volume_ratio * value_ratio **2)

# OK, we have values for each plotpoint, now sum them up
# to get values for the whole property (over a 30day period) 
pps = property.plotpoint_set.all()
property.hits = sum([p.hits for p in pps])
property.value = sum([p.value for p in pps])
property.epc = property.value/property.hits
value_delta = baseline.epc - property.epc

# Make up a significance for the Property, based on each point's significance
property.significance = math.log(sum(
                [sss.significance**2 for sss in pps]
                )*abs(value_delta)+1)

Thanks in advance!

AFAIK, the statistical tests available in numpy/scipy are fairly basic. You might want to look into R, a language more or less dedicated to statistics, and with a lot of advanced functions available.

Also, I don't think a MANOVA is really what you want to do. MANOVA is for when you have several interacting dependent variables. This is really just an ANOVA.

Examples of what you could do in R:

bybrowser = lm(value ~ browser, data=visitors)
anova(bybrowser)
byreferrer = lm(value ~ referrer, data=visitors)
anova(byreferrer)
byreferrerandbrowser = lm(value ~ browser * referrer, data=visitors)
anova(byreferrerandbrowser)

Note that this all assumes that your values are normally distributed. You should check this assumption (hist(visitors$value) is a good start.). If they're not, either find a way to normalise them (try taking the log), or use an appropriate non-parametric test.

Oh, and finally, if you want advice on stats, there's a sister site dedicated to just that: https://stats.stackexchange.com/