Author: | Konstantin Rybakov |
Publication Date: | First publication date: March 2018 Last update: March 2018 |
Notes: | pdf :
distribution_analysis.pdf docx : distribution_analysis.docx |
Excel tools: | Excel tool:
distrib-single.xlsm pdf screen: distrib-single.pdf |
Objective: | The objective of the tool is to estimate empirical distribution for a simple sample and test the empirical distribution against a collection od standard parametric distributions. |
Overview
The objective of the application is to identify the parametric distribution (from a given list of standard distributions) that produces the best match for the underlying sample data. The sample data is assumed to be stationary.
Implementation
The application is designed as follows.
- Inputs. There are two Excel applications to analyze sample distribution.
The first application studies a single sample and prvides a detailed analysis
of the sample. The second application studies multiple samples but provides a
more high-level analysis of samples distributions.
- Single sample analysis. The input data is a data sample of size n. The sample distribution is compared to a selected standard distribution.
- Multiple samples analysisThe input data is a kxn matrix X that represents k estimated samples (where each sample size equals n). The analysis is performed separately for each individual sample.
- Calculator. The distribution estimation is perfomed by implementing
the steps below.
- Single sample analysis
- Construct the distribution object based on the input data sample and estimate the distribution parameters;
- For each distribution from the list of standard distributions, estimate the parameters of the distribution based on the input data sample.
- For each estimated parametric distribution, construct the chi-squared (or Kolmogorov-Smirnov) statistics (and related p-value) which is used to test the sample distribution. The testing procedure details are desribed below.
- Identify the distribution with the largest chi-squared (or Kolmogorov-Smirnov) p-value and match it to the related sample.
- Construct the histogram, pdf, and cdf for the data sample based on matched standard distribution.
- For the matched or selected alternative distribution object, generate data sample of the same size as the input data sample.
- Construct the histogram, pdf, and cdf for the generated sample and compare it to the histogram, pdf, and cdf objects constructed for the input data sample.
- Multiple samples analysis
- For each sample from the list of k samples and each distribution from the list of standard distributions, estimate the parameters of the distribution.
- For each sample and each estimated parametric distribution, construct the chi-squared (or Kolmogorov-Smirnov) statistics (and related p-value) which is used to test the sample distribution. The test details and chi-squared statistics estimation details are desribed below.
- Identify the distribution with the largest chi-squared (or Kolmogorov-Smirnov) p-value and match it to the related sample.
- Estimate the histogram for each sample and compare it to the matched distribution.
- Single sample analysis
- Output. The output is represented by the following objects.
- List of k histogram objects constructed for each of k samples.
- Table of chi-squared statistics constructed for each sample and each standard distribution. The distribution with the smallest chi-squared statistics is matched to the related sample.
- The list of distributions matched to each related sample based on the estimated chi-squared statistics.
- The table with the distribution parameters estimated for each sample and each standard distribution.
List of standard distributions
- Normal distribution N(μ, σ) where μ is the mean and σ is the standard deviation of the distribution; (https://en.wikipedia.org/wiki/Normal_distribution)
- Log-Normal distribution LogN(μ, σ) where μ is the mean and σ is the standard deviation of the natural logarithm of the distribution. The log-normal distribution has support {x>0} so that the logarithm of the values is distributed normally with parameters μ and σ. (https://en.wikipedia.org/wiki/Log-normal_distribution)
- Beta distribution: p(x) = xa-1(1-x)β-1 / B(a, β) where B(a, β) = Γ(a) Γ(β) / Γ (a + β). The support of the distribution is x ∈ [0, 1]. (https://en.wikipedia.org/wiki/Beta_distribution)
- Cauchy distribution: p(x) = 1 / [π γ ( 1+ ((x-x0)/γ)2] (https://en.wikipedia.org/wiki/Cauchy_distribution)
- Chi-squared distribution: p(x) = xk/2-1 e-x/2 / [2k/2 Γk/2] (https://en.wikipedia.org/wiki/Chi-squared_distribution
- Exponential distribution: p(x) = λ e-λ x (https://en.wikipedia.org/wiki/Exponential_distribution
- F (Fisher-Snedecor) distriution: p(x) = sqrt[(nx)nmm / (nx+m)n+m] / [x B(n/2, m/2)] (https://en.wikipedia.org/wiki/F-distribution)
- Gamma distribution: p(x) = βa xa-1 e-βx / Γ(a) (https://en.wikipedia.org/wiki/Gamma_distribution)
- Geometric distribution: p(k) = p (1-p)k (https://en.wikipedia.org/wiki/Geometric_distribution)
- Laplace distribution: p(x) = e-|x-μ|/b / (2b) (https://en.wikipedia.org/wiki/Laplace_distribution)
- Logistic distribution: p(x) = q(x) / [s (1+q(x))2], where q(x) = e-(x-μ)/s (https://en.wikipedia.org/wiki/Logistic_distribution)
- Poisson distribution: p(k) = λke-λ / k! (https://en.wikipedia.org/wiki/Poisson_distribution)
- T (Student) distribution: p(x) = (1+x2/ν)-(ν+1)/2 Γ((1+ν)/2) / [Γ(ν/2) sqrt(νπ)] (https://en.wikipedia.org/wiki/Student%27s_t-distribution)
- Triangular distribution: p(x) = 2(x-a)/D for a ≤ x < c and p(x) = 2(b-x)/D for c ≤ x < b, where D = (b-a)(b-c); (https://en.wikipedia.org/wiki/Triangular_distribution)
- Uniform distribution: p(x) = 1 / (b-a) for a ≤ x ≤ b (https://en.wikipedia.org/wiki/Uniform_distribution_(continuous))
- Weibull distribution: p(x) = (k/λ) (x/λ)k-1exp(-(x/λ)k), x ≥ 0 (https://en.wikipedia.org/wiki/Weibull_distribution)
Distribution parameters
- Normal: μ = E[x], σ = stdev[x];
- Log-Normal: μ = E[ln x], σ = stdev[ln x];
- Beta: α / (α + β) = E[x]; (α-1)/(α+β-2) = Mode → α = (1 - 2 Mode) / (1 - Mode/μ); β = α (1/μ-1)
- Cauchy: x0 = Median (= Mode); x0 + γ tan[π(F-0.5)] = Quantile;
- Chi-square: k = E[x];
- Exponential: λ = 1/E[x];
- F: df1/(df1-2) = E[x]; [(df2-2)/df2] * [[df1/(df1+2)] = Mode[x]; → df1 = 2 E[x] / (E[x]-1); df2 = 2 / (1 - Mode[x] [(df1+2)/df1]);
- Gamma: α / β = E[x]; α / β2 = var[x]; → β = E[x] / var[x]; α = E2[x] / var[x]
- Geometric: p = 1 / E[x];
- Laplace: μ = E[x]; 2b2 = var[x]; → b = stdev[x] / √ 2
- Logistic: μ = E[x]; s2π2 / 3 = var[x]; → s = √ 3 stdev[x] / π
- Poisson: λ = E[x];
- T (Student): ν / (ν-2) = var[x]; → ν = 2 var[x] / (var[x]-1);
- Triangular a = min[x]; b = max[x]; (a+b+c)/3=E[x]; → c = 3 E[x] - a - b;
- Uniform a = min[x]; b = max[x];
- Weibull: λ Γ(1+1/k) = E[x]; λ (ln 2)1/k = Mode, where λ is scale parameter and k is shape parameter.
Application of different distributions
Testing for the distribution
- Kolmogorov-Smirnov (KS) test.
(
http://commons.apache.org/proper/commons-math/javadocs/api-3.4/org/apache/commons/math3/stat/inference/KolmogorovSmirnovTest.html)
The test is based on the distance between the empirical and tested cdf functions:
KS = maxx[Fn-F(x)]
where Fn is empirical cdf function and F(x) is the tested cdf function. In the KS test, the general-form cdf function is transformed into the uniform cdf function. If xn is a sample with underlying cdf F(x), then yn = F(xn) has uniform distribution. The KS statistics is constructed then as
KS = maxy[Un-y]
where Un is empirical cdf function constructed using yn sample. - Pearson Chi-squared test.
(
https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test
Suppose that the number of unknown distribution parameters is s and number of intervals use to construct chi-squared statistic is m (m>p=s+1). The chi-squared statistic is calculated as
χ2 = ∑i (Oi-Ei)2 / Ei = N ∑i (Oi/N-pi)2 / pi
where Oi is number of sample observations in interval i, Ei is expected number of sample observations under tested cdf function, N is sample size, and pi is the probability of interval i calculated using the tested cdf function. Under the null hypothesis that tested cdf is true, the calculated statistics is distributed as χ2(m-p).
The chi-squared statistics does not depend on the selection of the distribution support partition into intervals (it is assumed that the probability of each interval is positive pi). A natural choice is to partition the support into intervals so that each intervals has an equal probability.