Application of statistical methods for the comparison of data distributions presentation

About This Presentation

Transcript and Presenter's Notes

Title: Application of statistical methods for the comparison of data distributions

1
Application of statistical methods for the
comparison of data distributions

Susanna Guatelli, Barbara Mascialino, Andreas
Pfeiffer, Maria Grazia Pia, Alberto Ribon, Paolo
Viarengo

2
Outline

The comparison of two data distribution is
fundamental in experimental practice
Many algorithms are available for the comparison
of two data distributions (the two-sample
problem)
Aim of this study compare the algorithms
available in statistics literature to select the
most appropriate one in every specific case

Detector monitoring (current versus reference
data) Simulation validation (experiment versus
simulation) Reconstruction versus
expectation Regression testing (two versions of
the same software) Physics analysis (measurement
versus theory, experiment A versus experiment B)
Parametric statistics
Non-parametric statistics (Goodness-of-Fit
testing)
3
The two-sample problem
EXAMPLE 1 binned data
EXAMPLE 2 unbinned data
X-ray fluorescence spectrum
Dosimetric distribution from a medical LINAC
Which is the most suitable goodness-of-fit test?
4

Applies to binned distributions
It can be useful also in case of unbinned
distributions, but the data must be grouped into
classes
Cannot be applied if the counting of the
theoretical frequencies in each class is lt 5
When this is not the case, one could try to unify
contiguous classes until the minimum theoretical
frequency is reached
Otherwise one could use Yates formula

5
Tests based on the supremum statistics
unbinned distributions

Kolmogorov-Smirnov test
Goodman approximation of KS test
Kuiper test

Dmn
SUPREMUM STATISTICS
6
binned/unbinned distributions
7
G.A.P Cirrone, S. Donadio, S. Guatelli, A.
Mantero, B. Mascialino, S. Parlati, M.G. Pia, A.
Pfeiffer, A. Ribon, P. Viarengo A
Goodness-of-Fit Statistical Toolkit IEEE-
Transactions on Nuclear Science (2004), 51 (5)
October issue.
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s/
8
The power of a test is the probability of
rejecting the null hypothesis correctly
Power evaluation
Parent distribution 1
Parent distribution 2
N1000 Monte Carlo replications
Pseudoexperiment a random drawing of two
samples from two parent distributions
GoF test
Sample 1 n
Sample 2 m
Confidence Level 0.05
Power
pseudoexperiments with p-value lt (1-CL)
pseudoexperiments
For each test, the p-value computed by the GoF
Toolkit derives from analytical calculation of
the asymptotic distribution, often depending on
the samples sizes.
9
Parent distributions
10
Skewness and tailweight
Skewness
Tailweight
Parent S T
f1(x) Uniform 1 1.267
f2(x) Gaussian 1 1.704
f3(x) Double exponential 1 2.161
f4(x) Cauchy 1 5.263
f5(x) Exponential 4.486 1.883
f6(x) Contamined normal 1 1 1.991
f7(x) Contamined normal 2 1.769 1.693
11
Case Parent1 Parent 2
The location-scale problem
Kolmogorov-Smirnov test CL 0.05
Power increases as a function of the sample size
(analytical calculation of the asymptotic
distribution)
Power
small sized samples
moderate sized samples
N sample
12
Case Parent1 ? Parent 2
The general shape problem
A) Symmetric versus symmetric
(S1 S2 1)
Distribution 1 Double exponential (T1 2.161)
B) Skewed versus symmetric
T2
Distribution1 Distribution 2 KS CVM AD
CN2-Normal 55.61.8 15.21.1 86.11.1
CN2-CN1 24.91.4 25.21.1 44.81.6
CN2-Double Exponential 37.61.5 40.21.6 51.61.6
13
Comparative evaluation of tests
Tailweight
Short (Tlt1.5) Medium (1.5 lt T lt 2) Long (Tgt2)
S1 KS KS CVM CVM - AD
Sgt1.5 KS - AD AD CVM - AD
Skewness
14
Results for the data examples
EXAMPLE 1 binned data
EXAMPLE 2 unbinned data
Extremely skewed medium tail ANDERSON-DARLI
NG TEST A20.085 pgt0.05
Moderate skewed medium tail KOLMOGOROV-SMIR
NOV TEST D0.27 pgt0.05
15
Conclusions

Studied several goodness-of-fit tests for
location-scale alternatives and general
alternatives
There is no clear winner for all the considered
distributions in general
To select one test in practice
1. first classify the type of the distributions
in terms of skewness S and tailweight T
2. choose the most appropriate test for the
classified type of distribution

Topic still subject to research activity in the
domain of statistics

Write a Comment

User Comments (0)

About PowerShow.com

Application of statistical methods for the comparison of data distributions PowerPoint PPT Presentation