开发者

How can I correlate pageviews with memory spikes?

开发者 https://www.devze.com 2022-12-19 15:03 出处:网络
I\'m having some memory problems with an application, but it\'s a bit difficult to figure out exactly where it is.I have two sets of data:

I'm having some memory problems with an application, but it's a bit difficult to figure out exactly where it is. I have two sets of data:

Pageviews

  • The page that was requested
  • The time said page was requested

Memory use

  • The amount of memory being used
  • The time this memory use was recorded

开发者_StackOverflow社区I'd like to see exactly which pageviews are correlated with high memory usage. My guess is that I'll be doing a T-test of some kind to determine which pageviews are correlated with increased memory usage. However, I'm a bit uncertain as to what kind of T-test to go with. Can someone at least point me in the right direction?


I would suggest constructing a dataset with two columns. The first would be the proportion of each page appearances in the highest memory usage times of the distribution, and the second the proportion of those (same) pages for the rest of the values of the memory distribution.

Then you would have to perform a paired test to check if the median of the differences (high - rest) is less or equal to zero (H0), against the alternative hypothesis that the median of difference is greater than zero (H1). I would suggest using the non parametric test Wilcoxon Signed Ranks Test which is a variation of Mann - Whitney Test for paired samples. It also takes into account the magnitude of the differences in each pair, something that other tests ignore (e.g. sign test).

Keep in mind that ties (zero differences) present numerous problems in derivations of nonparametric methods and should be avoided. The preferable way to deal with ties is to add a slight bit of "noise" to the data. That is, complete the test after modifying tied values by adding a small enough random variable that will not affect the ranking of the differences

I hope that test's results and plotting the differences distribution will give you insight into where the problem is.

This is an implementation of Wilcoxon Signed Ranks Test in R language


Jason,

You ask good statistical questions. Think about the amount of memory being used as a random variable. The first step is to look at the distribution of this r.v. It may not fit any known distribution, but don't let that stop us. One simple approach would be to take the highest memory usage (top 5-10%) and see if those pageviews (or the times when they were requested) are any different than the pageviews for the rest. I think you'll need some non-parametric test that compares the proportion of pageviews of the low memory sample to the proportion of pageviews in the high memory sample. Hope this helps.


What you pose is certainly an interesting statistical problem, but might I suggest a graphical approach with a good ol' spreadsheet instead?

Assign each of your pages a unique number, and make a scatter plot of page # vs memory usage. You should get a bunch of vertical lines of markers. Hopefully the culprit will be obvious.

If there are so many data points that the lines turn solid, then you can add a small amount of noise to the page numbers to broaden the lines. If the requests are overlapping then you may have to try tricks like dividing the memory by the number of concurrent requests, but your eyes should be able to pick out the offender even with a lot of noise.


Here is another idea: If you are able to join Pageviews and Memory use by the timestamp values, you could form a table like this

Page A | Page B | Page C | Page D | Page E |....| Memory_use

The value for each of the page columns might be a bit [0,1], showing that the page was requested or not, or a count of pages, depending on your data. In Memory_use column you could have the relevant memory load proportions, or counts in MB. In this way, Memory_use can be thought of a dependent variable and the pages as explanatory ones. So you could fit an appropriate (depending on the form of the dependent variable) generalized linear model to this dataset. The results of this analysis will give you insight on the following

-Which pages significantly affect the value of memory use

-The extent to which each page contributes to the load (by its coefficient in the model)

-The possibility that other factors, not measured, play a significant role in memory load (overdispersion), with the worst case that all the predictor variables may turn out to be unimportant.

0

精彩评论

暂无评论...
验证码 换一张
取 消