Overview

The objective of the application is to identify the parametric distribution (from a given list of standard distributions) that produces the best match for the underlying sample data. The sample data is assumed to be stationary.

Implementation

The application is designed as follows.
  1. Inputs. There are two Excel applications to analyze sample distribution. The first application studies a single sample and prvides a detailed analysis of the sample. The second application studies multiple samples but provides a more high-level analysis of samples distributions.

    1. Single sample analysis. The input data is a data sample of size n. The sample distribution is compared to a selected standard distribution.

    2. Multiple samples analysisThe input data is a kxn matrix X that represents k estimated samples (where each sample size equals n). The analysis is performed separately for each individual sample.

  2. Calculator. The distribution estimation is perfomed by implementing the steps below.

    1. Single sample analysis

      1. Construct the distribution object based on the input data sample and estimate the distribution parameters;

      2. For each distribution from the list of standard distributions, estimate the parameters of the distribution based on the input data sample.

      3. For each estimated parametric distribution, construct the chi-squared (or Kolmogorov-Smirnov) statistics (and related p-value) which is used to test the sample distribution. The testing procedure details are desribed below.

      4. Identify the distribution with the largest chi-squared (or Kolmogorov-Smirnov) p-value and match it to the related sample.

      5. Construct the histogram, pdf, and cdf for the data sample based on matched standard distribution.

      6. For the matched or selected alternative distribution object, generate data sample of the same size as the input data sample.

      7. Construct the histogram, pdf, and cdf for the generated sample and compare it to the histogram, pdf, and cdf objects constructed for the input data sample.

    2. Multiple samples analysis

      1. For each sample from the list of k samples and each distribution from the list of standard distributions, estimate the parameters of the distribution.

      2. For each sample and each estimated parametric distribution, construct the chi-squared (or Kolmogorov-Smirnov) statistics (and related p-value) which is used to test the sample distribution. The test details and chi-squared statistics estimation details are desribed below.

      3. Identify the distribution with the largest chi-squared (or Kolmogorov-Smirnov) p-value and match it to the related sample.

      4. Estimate the histogram for each sample and compare it to the matched distribution.

  3. Output. The output is represented by the following objects.

    1. List of k histogram objects constructed for each of k samples.

    2. Table of chi-squared statistics constructed for each sample and each standard distribution. The distribution with the smallest chi-squared statistics is matched to the related sample.

    3. The list of distributions matched to each related sample based on the estimated chi-squared statistics.

    4. The table with the distribution parameters estimated for each sample and each standard distribution.

List of standard distributions
  1. Normal distribution N(μ, σ) where μ is the mean and σ is the standard deviation of the distribution; (https://en.wikipedia.org/wiki/Normal_distribution)

  2. Log-Normal distribution LogN(μ, σ) where μ is the mean and σ is the standard deviation of the natural logarithm of the distribution. The log-normal distribution has support {x>0} so that the logarithm of the values is distributed normally with parameters μ and σ. (https://en.wikipedia.org/wiki/Log-normal_distribution)

  3. Beta distribution: p(x) = xa-1(1-x)β-1 / B(a, β) where B(a, β) = Γ(a) Γ(β) / Γ (a + β). The support of the distribution is x ∈ [0, 1]. (https://en.wikipedia.org/wiki/Beta_distribution)

  4. Cauchy distribution: p(x) = 1 / [π γ ( 1+ ((x-x0)/γ)2] (https://en.wikipedia.org/wiki/Cauchy_distribution)

  5. Chi-squared distribution: p(x) = xk/2-1 e-x/2 / [2k/2 Γk/2] (https://en.wikipedia.org/wiki/Chi-squared_distribution

  6. Exponential distribution: p(x) = λ e-λ x (https://en.wikipedia.org/wiki/Exponential_distribution

  7. F (Fisher-Snedecor) distriution: p(x) = sqrt[(nx)nmm / (nx+m)n+m] / [x B(n/2, m/2)] (https://en.wikipedia.org/wiki/F-distribution)

  8. Gamma distribution: p(x) = βa xa-1 e-βx / Γ(a) (https://en.wikipedia.org/wiki/Gamma_distribution)

  9. Geometric distribution: p(k) = p (1-p)k (https://en.wikipedia.org/wiki/Geometric_distribution)

  10. Laplace distribution: p(x) = e-|x-μ|/b / (2b) (https://en.wikipedia.org/wiki/Laplace_distribution)

  11. Logistic distribution: p(x) = q(x) / [s (1+q(x))2], where q(x) = e-(x-μ)/s (https://en.wikipedia.org/wiki/Logistic_distribution)

  12. Poisson distribution: p(k) = λke / k! (https://en.wikipedia.org/wiki/Poisson_distribution)

  13. T (Student) distribution: p(x) = (1+x2/ν)-(ν+1)/2 Γ((1+ν)/2) / [Γ(ν/2) sqrt(νπ)] (https://en.wikipedia.org/wiki/Student%27s_t-distribution)

  14. Triangular distribution: p(x) = 2(x-a)/D for a ≤ x < c and p(x) = 2(b-x)/D for c ≤ x < b, where D = (b-a)(b-c); (https://en.wikipedia.org/wiki/Triangular_distribution)

  15. Uniform distribution: p(x) = 1 / (b-a) for a ≤ x ≤ b (https://en.wikipedia.org/wiki/Uniform_distribution_(continuous))

  16. Weibull distribution: p(x) = (k/λ) (x/λ)k-1exp(-(x/λ)k), x ≥ 0 (https://en.wikipedia.org/wiki/Weibull_distribution)

Distribution parameters
  1. Normal: μ = E[x], σ = stdev[x];

  2. Log-Normal: μ = E[ln x], σ = stdev[ln x];

  3. Beta: α / (α + β) = E[x]; (α-1)/(α+β-2) = Mode     →     α = (1 - 2 Mode) / (1 - Mode/μ); β = α (1/μ-1)

  4. Cauchy: x0 = Median (= Mode); x0 + γ tan[π(F-0.5)] = Quantile;

  5. Chi-square: k = E[x];

  6. Exponential: λ = 1/E[x];

  7. F: df1/(df1-2) = E[x]; [(df2-2)/df2] * [[df1/(df1+2)] = Mode[x];     →     df1 = 2 E[x] / (E[x]-1); df2 = 2 / (1 - Mode[x] [(df1+2)/df1]);

  8. Gamma: α / β = E[x]; α / β2 = var[x];     →     β = E[x] / var[x]; α = E2[x] / var[x]

  9. Geometric: p = 1 / E[x];

  10. Laplace: μ = E[x]; 2b2 = var[x];     →     b = stdev[x] / √ 2

  11. Logistic: μ = E[x]; s2π2 / 3 = var[x];     →     s = √ 3 stdev[x] / π

  12. Poisson: λ = E[x];

  13. T (Student): ν / (ν-2) = var[x];     →     ν = 2 var[x] / (var[x]-1);

  14. Triangular a = min[x]; b = max[x]; (a+b+c)/3=E[x];     →     c = 3 E[x] - a - b;

  15. Uniform a = min[x]; b = max[x];

  16. Weibull: λ Γ(1+1/k) = E[x]; λ (ln 2)1/k = Mode, where λ is scale parameter and k is shape parameter.


Application of different distributions

Testing for the distribution

  1. Kolmogorov-Smirnov (KS) test. ( http://commons.apache.org/proper/commons-math/javadocs/api-3.4/org/apache/commons/math3/stat/inference/KolmogorovSmirnovTest.html)
    The test is based on the distance between the empirical and tested cdf functions:
    KS = maxx[Fn-F(x)]

    where Fn is empirical cdf function and F(x) is the tested cdf function. In the KS test, the general-form cdf function is transformed into the uniform cdf function. If xn is a sample with underlying cdf F(x), then yn = F(xn) has uniform distribution. The KS statistics is constructed then as
    KS = maxy[Un-y]

    where Un is empirical cdf function constructed using yn sample.

  2. Pearson Chi-squared test. ( https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
    Suppose that the number of unknown distribution parameters is s and number of intervals use to construct chi-squared statistic is m (m>p=s+1). The chi-squared statistic is calculated as
    χ2 = ∑i (Oi-Ei)2 / Ei = N ∑i (Oi/N-pi)2 / pi

    where Oi is number of sample observations in interval i, Ei is expected number of sample observations under tested cdf function, N is sample size, and pi is the probability of interval i calculated using the tested cdf function. Under the null hypothesis that tested cdf is true, the calculated statistics is distributed as χ2(m-p).

    The chi-squared statistics does not depend on the selection of the distribution support partition into intervals (it is assumed that the probability of each interval is positive pi). A natural choice is to partition the support into intervals so that each intervals has an equal probability.
Histogram estimation