Title: Application of statistical methods for the comparison of data distributions
1Application of statistical methods for the
comparison of data distributions
- Susanna Guatelli, Barbara Mascialino, Andreas
Pfeiffer, Maria Grazia Pia, Alberto Ribon, Paolo
Viarengo
2Outline
- The comparison of two data distribution is
fundamental in experimental practice - Many algorithms are available for the comparison
of two data distributions (the two-sample
problem) - Aim of this study compare the algorithms
available in statistics literature to select the
most appropriate one in every specific case
Detector monitoring (current versus reference
data) Simulation validation (experiment versus
simulation) Reconstruction versus
expectation Regression testing (two versions of
the same software) Physics analysis (measurement
versus theory, experiment A versus experiment B)
Parametric statistics
Non-parametric statistics (Goodness-of-Fit
testing)
3The two-sample problem
EXAMPLE 1 binned data
EXAMPLE 2 unbinned data
X-ray fluorescence spectrum
Dosimetric distribution from a medical LINAC
Which is the most suitable goodness-of-fit test?
4- Applies to binned distributions
- It can be useful also in case of unbinned
distributions, but the data must be grouped into
classes - Cannot be applied if the counting of the
theoretical frequencies in each class is lt 5 - When this is not the case, one could try to unify
contiguous classes until the minimum theoretical
frequency is reached - Otherwise one could use Yates formula
5Tests based on the supremum statistics
unbinned distributions
- Kolmogorov-Smirnov test
- Goodman approximation of KS test
- Kuiper test
Dmn
SUPREMUM STATISTICS
6binned/unbinned distributions
7G.A.P Cirrone, S. Donadio, S. Guatelli, A.
Mantero, B. Mascialino, S. Parlati, M.G. Pia, A.
Pfeiffer, A. Ribon, P. Viarengo A
Goodness-of-Fit Statistical Toolkit IEEE-
Transactions on Nuclear Science (2004), 51 (5)
October issue.
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s/
8The power of a test is the probability of
rejecting the null hypothesis correctly
Power evaluation
Parent distribution 1
Parent distribution 2
N1000 Monte Carlo replications
Pseudoexperiment a random drawing of two
samples from two parent distributions
GoF test
Sample 1 n
Sample 2 m
Confidence Level 0.05
Power
pseudoexperiments with p-value lt (1-CL)
pseudoexperiments
For each test, the p-value computed by the GoF
Toolkit derives from analytical calculation of
the asymptotic distribution, often depending on
the samples sizes.
9Parent distributions
10Skewness and tailweight
Skewness
Tailweight
Parent S T
f1(x) Uniform 1 1.267
f2(x) Gaussian 1 1.704
f3(x) Double exponential 1 2.161
f4(x) Cauchy 1 5.263
f5(x) Exponential 4.486 1.883
f6(x) Contamined normal 1 1 1.991
f7(x) Contamined normal 2 1.769 1.693
11Case Parent1 Parent 2
The location-scale problem
Kolmogorov-Smirnov test CL 0.05
Power increases as a function of the sample size
(analytical calculation of the asymptotic
distribution)
Power
small sized samples
moderate sized samples
N sample
12Case Parent1 ? Parent 2
The general shape problem
A) Symmetric versus symmetric
(S1 S2 1)
Distribution 1 Double exponential (T1 2.161)
B) Skewed versus symmetric
T2
Distribution1 Distribution 2 KS CVM AD
CN2-Normal 55.61.8 15.21.1 86.11.1
CN2-CN1 24.91.4 25.21.1 44.81.6
CN2-Double Exponential 37.61.5 40.21.6 51.61.6
13Comparative evaluation of tests
Tailweight
Short (Tlt1.5) Medium (1.5 lt T lt 2) Long (Tgt2)
S1 KS KS CVM CVM - AD
Sgt1.5 KS - AD AD CVM - AD
Skewness
14Results for the data examples
EXAMPLE 1 binned data
EXAMPLE 2 unbinned data
Extremely skewed medium tail ANDERSON-DARLI
NG TEST A20.085 pgt0.05
Moderate skewed medium tail KOLMOGOROV-SMIR
NOV TEST D0.27 pgt0.05
15Conclusions
- Studied several goodness-of-fit tests for
location-scale alternatives and general
alternatives - There is no clear winner for all the considered
distributions in general - To select one test in practice
- 1. first classify the type of the distributions
in terms of skewness S and tailweight T - 2. choose the most appropriate test for the
classified type of distribution
Topic still subject to research activity in the
domain of statistics