Alexa Statistics

Author:

Konstantin Rybakov

Publication Date:

First publication date: March 2018
Last update: March 2018

Notes:

pdf : distribution_analysis.pdf
docx : distribution_analysis.docx

Excel tools:

Excel tool: distrib-single.xlsm
pdf screen: distrib-single.pdf

Objective:

The objective of the tool is to estimate empirical distribution for a simple sample and test the empirical distribution against a collection od standard parametric distributions.

Overview

The objective of the application is to identify the parametric distribution (from a given list of standard distributions) that produces the best match for the underlying sample data. The sample data is assumed to be stationary.

Implementation

The application is designed as follows.

Inputs. There are two Excel applications to analyze sample distribution. The first application studies a single sample and prvides a detailed analysis of the sample. The second application studies multiple samples but provides a more high-level analysis of samples distributions.
1. Single sample analysis. The input data is a data sample of size n. The sample distribution is compared to a selected standard distribution.
2. Multiple samples analysisThe input data is a kxn matrix X that represents k estimated samples (where each sample size equals n). The analysis is performed separately for each individual sample.

Calculator. The distribution estimation is perfomed by implementing the steps below.
1. Single sample analysis
  1. Construct the distribution object based on the input data sample and estimate the distribution parameters;
  2. For each distribution from the list of standard distributions, estimate the parameters of the distribution based on the input data sample.
  3. For each estimated parametric distribution, construct the chi-squared (or Kolmogorov-Smirnov) statistics (and related p-value) which is used to test the sample distribution. The testing procedure details are desribed below.
  4. Identify the distribution with the largest chi-squared (or Kolmogorov-Smirnov) p-value and match it to the related sample.
  5. Construct the histogram, pdf, and cdf for the data sample based on matched standard distribution.
  6. For the matched or selected alternative distribution object, generate data sample of the same size as the input data sample.
  7. Construct the histogram, pdf, and cdf for the generated sample and compare it to the histogram, pdf, and cdf objects constructed for the input data sample.
2. Multiple samples analysis
  1. For each sample from the list of k samples and each distribution from the list of standard distributions, estimate the parameters of the distribution.
  2. For each sample and each estimated parametric distribution, construct the chi-squared (or Kolmogorov-Smirnov) statistics (and related p-value) which is used to test the sample distribution. The test details and chi-squared statistics estimation details are desribed below.
  3. Identify the distribution with the largest chi-squared (or Kolmogorov-Smirnov) p-value and match it to the related sample.
  4. Estimate the histogram for each sample and compare it to the matched distribution.

Output. The output is represented by the following objects.
1. List of k histogram objects constructed for each of k samples.
2. Table of chi-squared statistics constructed for each sample and each standard distribution. The distribution with the smallest chi-squared statistics is matched to the related sample.
3. The list of distributions matched to each related sample based on the estimated chi-squared statistics.
4. The table with the distribution parameters estimated for each sample and each standard distribution.

List of standard distributions

Normal distribution N(μ, σ) where μ is the mean and σ is the standard deviation of the distribution; (https://en.wikipedia.org/wiki/Normal_distribution)

Log-Normal distribution LogN(μ, σ) where μ is the mean and σ is the standard deviation of the natural logarithm of the distribution. The log-normal distribution has support {x>0} so that the logarithm of the values is distributed normally with parameters μ and σ. (https://en.wikipedia.org/wiki/Log-normal_distribution)

Beta distribution: p(x) = x^a-1(1-x)^β-1 / B(a, β) where B(a, β) = Γ(a) Γ(β) / Γ (a + β). The support of the distribution is x ∈ [0, 1]. (https://en.wikipedia.org/wiki/Beta_distribution)

Cauchy distribution: p(x) = 1 / [π γ ( 1+ ((x-x₀)/γ)²] (https://en.wikipedia.org/wiki/Cauchy_distribution)

Chi-squared distribution: p(x) = x^k/2-1 e^-x/2 / [2^k/2 Γ^k/2] (https://en.wikipedia.org/wiki/Chi-squared_distribution

Exponential distribution: p(x) = λ e^{-λ x} (https://en.wikipedia.org/wiki/Exponential_distribution

F (Fisher-Snedecor) distriution: p(x) = sqrt[(nx)ⁿm^m / (nx+m)^n+m] / [x B(n/2, m/2)] (https://en.wikipedia.org/wiki/F-distribution)

Gamma distribution: p(x) = β^a x^a-1 e^-βx / Γ(a) (https://en.wikipedia.org/wiki/Gamma_distribution)

Geometric distribution: p(k) = p (1-p)^k (https://en.wikipedia.org/wiki/Geometric_distribution)

Laplace distribution: p(x) = e^-|x-μ|/b / (2b) (https://en.wikipedia.org/wiki/Laplace_distribution)

Logistic distribution: p(x) = q(x) / [s (1+q(x))²], where q(x) = e^-(x-μ)/s (https://en.wikipedia.org/wiki/Logistic_distribution)

Poisson distribution: p(k) = λ^ke^-λ / k! (https://en.wikipedia.org/wiki/Poisson_distribution)

T (Student) distribution: p(x) = (1+x²/ν)^-(ν+1)/2 Γ((1+ν)/2) / [Γ(ν/2) sqrt(νπ)] (https://en.wikipedia.org/wiki/Student%27s_t-distribution)

Triangular distribution: p(x) = 2(x-a)/D for a ≤ x < c and p(x) = 2(b-x)/D for c ≤ x < b, where D = (b-a)(b-c); (https://en.wikipedia.org/wiki/Triangular_distribution)

Uniform distribution: p(x) = 1 / (b-a) for a ≤ x ≤ b (https://en.wikipedia.org/wiki/Uniform_distribution_(continuous))

Weibull distribution: p(x) = (k/λ) (x/λ)^k-1exp(-(x/λ)^k), x ≥ 0 (https://en.wikipedia.org/wiki/Weibull_distribution)

Distribution parameters

Normal: μ = E[x], σ = stdev[x];

Log-Normal: μ = E[ln x], σ = stdev[ln x];

Beta: α / (α + β) = E[x]; (α-1)/(α+β-2) = Mode → α = (1 - 2 Mode) / (1 - Mode/μ); β = α (1/μ-1)

Cauchy: x₀ = Median (= Mode); x₀ + γ tan[π(F-0.5)] = Quantile;

Chi-square: k = E[x];

Exponential: λ = 1/E[x];

F: df₁/(df₁-2) = E[x]; [(df₂-2)/df₂] * [[df₁/(df₁+2)] = Mode[x]; → df₁ = 2 E[x] / (E[x]-1); df₂ = 2 / (1 - Mode[x] [(df₁+2)/df₁]);

Gamma: α / β = E[x]; α / β² = var[x]; → β = E[x] / var[x]; α = E²[x] / var[x]

Geometric: p = 1 / E[x];

Laplace: μ = E[x]; 2b² = var[x]; → b = stdev[x] / √ 2

Logistic: μ = E[x]; s²π² / 3 = var[x]; → s = √ 3 stdev[x] / π

Poisson: λ = E[x];

T (Student): ν / (ν-2) = var[x]; → ν = 2 var[x] / (var[x]-1);

Triangular a = min[x]; b = max[x]; (a+b+c)/3=E[x]; → c = 3 E[x] - a - b;

Uniform a = min[x]; b = max[x];

Weibull: λ Γ(1+1/k) = E[x]; λ (ln 2)^1/k = Mode, where λ is scale parameter and k is shape parameter.

Application of different distributions

Testing for the distribution

Kolmogorov-Smirnov (KS) test. ( http://commons.apache.org/proper/commons-math/javadocs/api-3.4/org/apache/commons/math3/stat/inference/KolmogorovSmirnovTest.html)
The test is based on the distance between the empirical and tested cdf functions:

KS = max_x[F_n-F(x)]

where F_n is empirical cdf function and F(x) is the tested cdf function. In the KS test, the general-form cdf function is transformed into the uniform cdf function. If x_n is a sample with underlying cdf F(x), then y_n = F(x_n) has uniform distribution. The KS statistics is constructed then as

KS = max_y[U_n-y]

where U_n is empirical cdf function constructed using y_n sample.

Pearson Chi-squared test. ( https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
Suppose that the number of unknown distribution parameters is s and number of intervals use to construct chi-squared statistic is m (m>p=s+1). The chi-squared statistic is calculated as

χ² = ∑_i (O_i-E_i)² / E_i = N ∑_i (O_i/N-p_i)² / p_i

where O_i is number of sample observations in interval i, E_i is expected number of sample observations under tested cdf function, N is sample size, and p_i is the probability of interval i calculated using the tested cdf function. Under the null hypothesis that tested cdf is true, the calculated statistics is distributed as χ²(m-p).

The chi-squared statistics does not depend on the selection of the distribution support partition into intervals (it is assumed that the probability of each interval is positive p_i). A natural choice is to partition the support into intervals so that each intervals has an equal probability.

Histogram estimation