Application of statistical methods for the comparison of data distributions PowerPoint PPT Presentation

presentation player overlay
1 / 15
About This Presentation
Transcript and Presenter's Notes

Title: Application of statistical methods for the comparison of data distributions


1
Application of statistical methods for the
comparison of data distributions
  • Susanna Guatelli, Barbara Mascialino, Andreas
    Pfeiffer, Maria Grazia Pia, Alberto Ribon, Paolo
    Viarengo

2
Outline
  • The comparison of two data distribution is
    fundamental in experimental practice
  • Many algorithms are available for the comparison
    of two data distributions (the two-sample
    problem)
  • Aim of this study compare the algorithms
    available in statistics literature to select the
    most appropriate one in every specific case

Detector monitoring (current versus reference
data) Simulation validation (experiment versus
simulation) Reconstruction versus
expectation Regression testing (two versions of
the same software) Physics analysis (measurement
versus theory, experiment A versus experiment B)
Parametric statistics
Non-parametric statistics (Goodness-of-Fit
testing)
3
The two-sample problem
EXAMPLE 1 binned data
EXAMPLE 2 unbinned data
X-ray fluorescence spectrum
Dosimetric distribution from a medical LINAC
Which is the most suitable goodness-of-fit test?
4
  • Applies to binned distributions
  • It can be useful also in case of unbinned
    distributions, but the data must be grouped into
    classes
  • Cannot be applied if the counting of the
    theoretical frequencies in each class is lt 5
  • When this is not the case, one could try to unify
    contiguous classes until the minimum theoretical
    frequency is reached
  • Otherwise one could use Yates formula

5
Tests based on the supremum statistics
unbinned distributions
  • Kolmogorov-Smirnov test
  • Goodman approximation of KS test
  • Kuiper test

Dmn
SUPREMUM STATISTICS
6
binned/unbinned distributions
7
G.A.P Cirrone, S. Donadio, S. Guatelli, A.
Mantero, B. Mascialino, S. Parlati, M.G. Pia, A.
Pfeiffer, A. Ribon, P. Viarengo A
Goodness-of-Fit Statistical Toolkit IEEE-
Transactions on Nuclear Science (2004), 51 (5)
October issue.
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s/
8
The power of a test is the probability of
rejecting the null hypothesis correctly
Power evaluation
Parent distribution 1
Parent distribution 2
N1000 Monte Carlo replications
Pseudoexperiment a random drawing of two
samples from two parent distributions
GoF test
Sample 1 n
Sample 2 m
Confidence Level 0.05
Power
pseudoexperiments with p-value lt (1-CL)
pseudoexperiments
For each test, the p-value computed by the GoF
Toolkit derives from analytical calculation of
the asymptotic distribution, often depending on
the samples sizes.
9
Parent distributions
10
Skewness and tailweight
Skewness
Tailweight
Parent S T
f1(x) Uniform 1 1.267
f2(x) Gaussian 1 1.704
f3(x) Double exponential 1 2.161
f4(x) Cauchy 1 5.263
f5(x) Exponential 4.486 1.883
f6(x) Contamined normal 1 1 1.991
f7(x) Contamined normal 2 1.769 1.693
11
Case Parent1 Parent 2
The location-scale problem
Kolmogorov-Smirnov test CL 0.05
Power increases as a function of the sample size
(analytical calculation of the asymptotic
distribution)
Power
small sized samples
moderate sized samples
N sample
12
Case Parent1 ? Parent 2
The general shape problem
A) Symmetric versus symmetric
(S1 S2 1)
Distribution 1 Double exponential (T1 2.161)
B) Skewed versus symmetric
T2
Distribution1 Distribution 2 KS CVM AD
CN2-Normal 55.61.8 15.21.1 86.11.1
CN2-CN1 24.91.4 25.21.1 44.81.6
CN2-Double Exponential 37.61.5 40.21.6 51.61.6
13
Comparative evaluation of tests
Tailweight
Short (Tlt1.5) Medium (1.5 lt T lt 2) Long (Tgt2)
S1 KS KS CVM CVM - AD
Sgt1.5 KS - AD AD CVM - AD
Skewness
14
Results for the data examples
EXAMPLE 1 binned data
EXAMPLE 2 unbinned data
Extremely skewed medium tail ANDERSON-DARLI
NG TEST A20.085 pgt0.05
Moderate skewed medium tail KOLMOGOROV-SMIR
NOV TEST D0.27 pgt0.05
15
Conclusions
  • Studied several goodness-of-fit tests for
    location-scale alternatives and general
    alternatives
  • There is no clear winner for all the considered
    distributions in general
  • To select one test in practice
  • 1. first classify the type of the distributions
    in terms of skewness S and tailweight T
  • 2. choose the most appropriate test for the
    classified type of distribution

Topic still subject to research activity in the
domain of statistics
Write a Comment
User Comments (0)
About PowerShow.com