kipedia.org/wiki/Kolmogorov-Smirnov_test"> Kolmogorov-Smirnov (K-S) test for equality of continuous distributions.
The K-S test uses a statistic based on the maximum deviation of the empirical distribution of sample data points from the distribution expected under the null hypothesis. For one-sample tests evaluating the null hypothesis that a set of sample data points follow a given distribution, the test statistic is \(D_n=\sup_x |F_n(x)-F(x)|\), where \(F\) is the expected distribution and \(F_n\) is the empirical distribution of the \(n\) sample data points. The distribution of \(D_n\) is estimated using a method based on [1] with certain quick decisions for extreme values given in [2].
Two-sample tests are also supported, evaluating the null hypothesis that the two samples {@code x} and {@code y} come from the same underlying distribution. In this case, the teststatistic is \(D_{n,m}=\sup_t | F_n(t)-F_m(t)|\) where \(n\) is the length of {@code x}, \(m\) is the length of {@code y}, \(F_n\) is the empirical distribution that puts mass \(1/n\) at each of the values in {@code x} and \(F_m\) is the empirical distribution of the {@code y} values. Thedefault 2-sample test method, {@link #kolmogorovSmirnovTest(double[],double[])} works asfollows:
- For very small samples (where the product of the sample sizes is less than {@value #SMALL_SAMPLE_PRODUCT}), the exact distribution is used to compute the p-value for the 2-sample test.
- For mid-size samples (product of sample sizes greater than or equal to {@value #SMALL_SAMPLE_PRODUCT} but less than {@value #LARGE_SAMPLE_PRODUCT}), Monte Carlo simulation is used to compute the p-value. The simulation randomly generates partitions of \(m + n\) into an \(m\)-set and an \(n\)-set and reports the proportion that give \(D\) values exceeding the observed value.
- When the product of the sample sizes exceeds {@value #LARGE_SAMPLE_PRODUCT}, the asymptotic distribution of \(D_{n,m}\) is used. See {@link #approximateP(double,int,int)} for details onthe approximation.
In the two-sample case, \(D_{n,m}\) has a discrete distribution. This makes the p-value associated with the null hypothesis \(H_0 : D_{n,m} \ge d \) differ from \(H_0 : D_{n,m} > d \) by the mass of the observed value \(d\). To distinguish these, the two-sample tests use a boolean {@code strict} parameter. This parameter is ignored for large samples.
The methods used by the 2-sample default implementation are also exposed directly:
- {@link #exactP(double,int,int,boolean)} computes exact 2-sample p-values
- {@link #monteCarloP(double,int,int,boolean,int)} computes 2-sample p-values by MonteCarlo simulation
- {@link #approximateP(double,int,int)} uses the asymptotic distribution The {@code boolean}arguments in the first two methods allow the probability used to estimate the p-value to be expressed using strict or non-strict inequality. See {@link #kolmogorovSmirnovTest(double[],double[],boolean)}.
References:
Note that [1] contains an error in computing h, refer to
MATH-437 for details.
@since 3.3