开发者

Java: How to determine programmatically that a dataset doesn't follow a normal distribution?

开发者 https://www.devze.com 2022-12-21 16:56 出处:网络
In a Java progra开发者_开发知识库m, how can I determine if a dataset I have is following or not a normal distribution?

In a Java progra开发者_开发知识库m, how can I determine if a dataset I have is following or not a normal distribution?

Is it possible?

Is there an API or an algorithm that I can use that determines this?


There are two questions here: how to determine if a distribution is normal and how to do so in Java. As the first link will show you, there are varying degrees of how certain you want to be that you are looking at normal data from the formal to the informal. The second link shows that there aren't standard Java packages for statistical analysis but many other ways to implement them.


This is a somewhat difficult statistical question and if you're not an expert in statistics, it seems deceptively simple. Your goal apparently is to determine whether the data could plausibly have come from any normal distribution, not one with a pre-specified mean and variance. Probably the best way to do this is with D'Agostino test, which is based on measuring the skewness and kurtosis of the distribution and comparing these to what's expected under normality.

As far as Java implementations, there are none that I'm aware of, although I don't regularly use Java. I would be slightly surprised if there is one, as it's a relatively obscure statistical function and Java isn't the most common language to use for statistics. However, my D language implementation (search in this file for dAgostinoK()) could probably be trivially translated to Java if you already have functions for computing skewness, kurtosis and the CDF of the Chi-Square distribution.


I'm not sure if there's an API available for this, but what you can use is the chi-square test http://en.wikipedia.org/wiki/Pearson%27s_chi-square_test. Assuming your dataset is large enough, you can test for the fit to a normal distribution.


Easiest way is "If I have n > 30 data points, then it approximates a normal distribution via the central limit theorem." ;)

As others mention, determining if the data points came from a normal distribution is significantly more difficult.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号