Title: PHYSTAT05%20Highlights:%20Statistical%20Problems%20in%20Particle%20Physics,%20%20%20Astrophysics%20and%20Cosmology
1PHYSTAT05 HighlightsStatistical Problems in
Particle Physics, Astrophysics and Cosmology
1
Phystat05 Highlights
Müge Karagöz Ünel Oxford University
- University College London
- 03/11/2006
MKU
2Outline
2
- Conference Information and History
- Introduction to statistics
- Selection of hot topics
- Available tools
- Astrophysics and cosmology
- Conclusions
Phystat05 Highlights
MKU
3PHYSTAT History
3
Phystat05 Highlights
MKU
44
Phystat05 Highlights
Poster
MKU
5Chronology of PHYSTAT05
5
Where CERN Fermilab Durham SLAC Oxford
When Jan 2000 March 2000 March 2002 Sept 2003 Sept 2005
Issues Limits Limits Wider range of topics Wider range of topics Wider range of topics
Physicists Particles Particles 3 astrophysicists Particles 3 astrophysicists Particles Astro Cosmo Particles Astro Cosmo
Statisticians 3 3 2 Many Many
Phystat05 Highlights
MKU
6PHYSTAT05 Programme
6
7 Invited talks by Statisticians 9 Invited talks
by Physicists 38 Contributed talks 8
Posters Panel Discussion 3 Conference
Summaries 90 participants
Phystat05 Highlights
MKU
7Invited Talks by Statisticians
7
David Cox Keynote Address Bayesian,
Frequentists
Physicists Steffen Lauritzen Goodness of
Fit Jerry Friedman Machine Learning Susan Holmes
Visualisation Peter Clifford Time
Series Mike Titterington Deconvolution Nancy
Reid Conference Summary (Statistics)
Phystat05 Highlights
MKU
8Invited Talks by (Astro)Physicists
8
Bob Cousins Nuisance Parameters for Limits Kyle
Cranmer LHC discovery Alex Szalay
Astrophysics Terabytes Jean-Luc Starck
Multiscale geometry Jim Linnemann Statistical
Software for Particle Physics Bob
Nichol Statistical Software for
Astrophysics Stephen Johnson Historical Transits
of Venus Andrew Jaffe Conference Summary
(Astrophysics) Gary Feldman Conference Summary
(Particles)
Phystat05 Highlights
MKU
9Contents of the Proceedings
9
Bayes/Frequentist 5 talks Goodness of
Fit 5 Likelihood/parameter estimation 6 Nuis
ance parameters/limits/discovery 10 Machine
learning 7 Software 8 Visualisation
1 Astrophysics 5 Time series 1 Deconvolu
tion 3
Phystat05 Highlights
MKU
10Statistics in (A/P)Physics
10
Phystat05 Highlights
MKU
11Statistics in (Particle) Physics
11
- An experiment goes through following stages
- Prepare conditions for taking data for a particle
X ( if theory driven) - Record events that might be X and reconstruct the
measurables - Select events that could have X by applying
criteria (cuts) - Generate histograms of variables and ask the
questions - Is there any evidence for new things or is the
null hypothesis unrefuted? If there is evidence,
what are the estimates for parameters of X?
(Confrontation of theory with experiment or v.v.) - The answers can come via your favorite
statistical technique (depends on how you ask the
question)
Phystat05 Highlights
MKU
12(yet another) Chronology (from S. Andreons web
page)
12
Phystat05 Highlights
- Homo apriorius establishes probability of an
hypothesis, no matter what data tell. - Homo pragamiticus establishes that it is
interested by the data only. - Homo frequentistus measures probability of the
data given the hypothesis. - Homo sapiens measures probability of the data and
of the hypothesis. - Homo bayesianis measures probability of the
hypothesis, given the data.
MKU
13Bayesian vs Frequentist
13
We need to make a statement about Parameters,
given Data
Bayes 1763 Frequentism 1937 Both
analyse data (x) ? statement about parameters ( ?
) Both use Prob (x ? ), e.g. Prob (
) 90 but very
different interpretation
Phystat05 Highlights
Bayesian Probability (parameter, given
data) Frequentist Probability (data, given
parameter)
Bayesians address the question everyone is
interested in, by using assumptions no-one
believes
Frequentists use impeccable logic to deal with
an issue of no interest to anyone
MKU
14Goodness of Fit
14
Lauritzen Invited talk - GoF Yabsley GoF and
sparse multi-D data Ianni GoF and sparse
multi-D data Raja GoF and L Gagunashvili ?2
and weighting Pia Software Toolkit for Data
Analysis Block Rejecting outliers Bruckman
Alignment Blobel Tracking
Phystat05 Highlights
MKU
15Goodness of Fit
15
- We would like to know if a given distribution is
of a specified type, test the validation of a
postulated model,.. - A few GoF tests are widely used in practice
- ?2 test most widely used application is 1 or 2D
fits to data - G2 (the likelihood ratio statistics) test the
general version of ?2 test (Lauritzens personal
choice) - Kolmogorov-Smirnov test a robust but prone to
mislead test, can be used to confirm, say, two
distributions (histograms) are the same by
calculating the p-value for the difference
hypothesis. - Other new methods, like AslanZechs energy test,
exist
Phystat05 Highlights
MKU
16An example from ATLAS (Bruckman)
16
Direct Least-Squares solution to the Silicon
Tracker alignment problem
The method consists of minimizing the giant ?2
resulting from a simultaneous fit of all particle
trajectories and alignment parameters
Let us consequently use the linear expansion (we
assume all second order derivatives are
negligible). The track fit is solved by
Phystat05 Highlights
while the alignment parameters are given by
Systems large inherent Computational challenges
MKU
Equivalent to Millepede approach from V. Blobel
17Nuisance Parameters/Limits/Discovery
17
- Cousins Limits and Nuisance Params
- Reid Respondent
- Punzi Frequentist multi-dimensional
ordering rule - Tegenfeldt Feldman-Cousins Cousins-Highland
- Rolke Limits
- Heinrich Bayes limits
- Bityukov Poisson situations
- Hill Limits v Discovery (see Punzi _at_
PHYSTAT2003) - Cranmer LHC discovery and nuisance parameters
Phystat05 Highlights
MKU
18Systematics
18
NoteSystematic errors (HEP) lt-gt nuisance params
(statistician)
An example
we need to know these, probably from
other measurements (and/or theory) Uncertainties
?error in
Phystat05 Highlights
Physics parameter
Observed
for statistical errors
Some are arguably statistical errors
MKU
19Nuisance Parameters
19
- Nuisance parameters are parameters with unknown
true values. They may be - statistical, such as number of background events
in a sideband used for estimating the background
under a peak. - systematic, such as the shape of the background
under the peak, or the error caused by the
uncertainty of the hadronic fragmentation model
in the Monte Carlo. - Most experiments have a large number of
systematic uncertainties. - If the experimenter is blind to these
uncertainties, they become a bigger nuisance!
Phystat05 Highlights
MKU
20Issues with LHC
20
- LHC will collide 40 million times/sec and collect
petabytes of data. pp collisions at 14 TeV will
generate events much more complicated than LEP,
TeVatron. - Kyle Cranmer has pointed out that systematic
issues will be even more important at the LHC. - If the statistical error is O(1) and systematic
error is O(0.1), it does not much matter how you
treat it. - However, at the LHC, we may have processes with
100 background events and 10 systematic errors,
this is not negligible. - Even more critical, we want 5s for a discovery
level.
Phystat05 Highlights
MKU
21Why 5s? (FeldmanCranmer)
21
- LHC searches 500 searches each of which has 100
resolution elements (mass, angle bins, ...) 5 x
104 chances to find something. - One experiment False positive rate at 5 s ? (5
x 104) (3 x 10-7) 0.015. OK. - Two experiments
- Assume allowable false positive rate 10.
- 2 (5 x 104) (1 x 10-4) 10 ? 3.7 s required.
- Required other experiment verification, assume
rate 0.01 (1 x 10-3)(10) 0.01 ? 3.1 s
required. - Caveats Is the significance real? Are there
common systematic errors?
Phystat05 Highlights
MKU
22Confidence Intervals
22
- Various techniques discussed during conference.
Most concerns were summarized by Feldman. - Bayesian good method but Heinrich showed that
flat priors in multi-D may lead to undesirable
results (undercoverage). - Frequentist-Bayesian hybrids Bayesian for priors
and frequentist to extract range. Cranmer
considered this for LHC (which was also used at
Higgs searches). - Profile likelihood shown by Punzi to have issues
when distribution is Poisson-like. - Full Neyman reconstruction Cranmer and Punzi
attempted this, but is not feasible for large
number of nuisance parameters. - Banff workhsop of this summer was found useful in
comparing various methods. The real suggestions
for LHC will likely come from 2007 workshop on
LHC issues.
Phystat05 Highlights
MKU
23Event Classification
23
- The problem Given a measurement of an event X
find F(X) which returns 1 if the event is signal
(s) and 0 if the event is background (b) to
optimize a figure of merit, say, s/vb for
discovery and s/ v(sb) for established signal. - Theoretical solution Use MC to calculate the
likelihood ratio Ls(X)/Lb(X) and derive F(X) from
it. Unfortunately, this does not work as in a
high-dimension space, even the largest data set
is sparse. (Feldman) - In recent years, physicists have turned to
machine learning give the computer samples of s
and b events and let the computer figure out what
F(X) is.
Phystat05 Highlights
MKU
24Multivariate Analysis
24
Friedman Machine learning Prosper Respondent Nar
sky Bagging Roe Boosting
(Miniboone) Gray Bayes
optimal classification Bhat
Bayesian networks Sarda
Signal enhancement
Phystat05 Highlights
MKU
25Multivariates and Machine Learning
25
- Various methods exist to classify, train and test
events. - Artificial neural networks (ANN) currently the
most widely used (examples from Prosper, ) - Decision trees differentiating variable is used
to separate sample into branches until a leaf
with a preset number of signal and background
events are found. - Trees with rules combining a series of trees to
increase single decision tree power (Friedman) - Bagging (Bootstrap AGGregatING) trees build a
collection of trees by selecting a sample of the
training data (Narsky) - Boosted trees a robust method that gives
misclassified events in one tree a higher weight
in the generation of a new tree - Comparisons of significance were performed, but
not all of were controlled experiments, so
conclusions may be deceptive until further tests..
Phystat05 Highlights
MKU
26Ex Boosted Decision Trees (Roe)
26
- An nice example from MiniBoone
- Create M many trees and take the final score for
signal and background as weighted sum of
individual trees
Phystat05 Highlights
MKU
27Punzi effect (getting L wrong)
27
- Giovanni Punzi _at_ PHYSTAT2003
- Comments on L fits with variable resolution
- Separate two close signals (A and B) , when
resolution s varies event by event, and is
different for 2 signals - e.g. M, Different numbers of tracks ?
different sM - Avoiding Punzi bias
- Include p(sA) and p(sB) in fit OR
- Fit each range of si separately, and add (NA)i ?
(NA)total, and similarly for B - Beware of event-by-event variables and construct
likelihoods accordingly - (Talk by Catastini)
Phystat05 Highlights
MKU
28Blind Analyses
28
- Potential problem Experimenters bias
- Original suggestion? Luis Alvarez
- Methods of blinding
- Keep signal region box closed
- Add random numbers to data
- Keep Monte Carlo parameters blind
- Use part of data to define procedure
- A number of analyses in experiments doing blind
searches - Dont modify result after unblinding, in
general.. - Question Will LHC experiments choose to be
blind? In which analysis?
Phystat05 Highlights
MKU
29Astrophysics Cosmology Highlights
29
Phystat05 Highlights
MKU
30Astro/Cosmo General Issues
30
There is only one universe and some
experiments can never be rerun A. Jaffe
(concluding talk) ? Astrocosmo tend to be more
Bayesian, by nature.
Phystat05 Highlights
- Virtual Observatories all astro data available
from desktop - Data volume growth doubling every year, most data
are on the web (Szalay) - Bad computing storage issues
- Good (?) Systematic errors more significant
statistical errors - Nichol discussed using grid techniques.
MKU
31Astrophysics Various Hot Points
31
- Flat priors have been used commonly, but are
dangerous (Cox, Le Diberder, Cousins) would ? be
the best quantity to use or is it ?h2 ? - Issues with non-gaussian distribution of noise
taken into account in the spectrum a few methods
discussed by Starck, Digel, .. - Blind analyses are rare (not so good at a priori
modeling!) - Lots of good software in astrophysics and
repositories more advanced than PP. - Jaffes talk has a a nice usage of CMB as a case
study for statistical methods in astrophysics,
starting from 1st principles of Bayesian.
Phystat05 Highlights
MKU
32Software and Available Tools
32
Phystat05 Highlights
MKU
33Talks Given on Software
33
- Linnemann Software for Particles
- Nichol Software for Astro (and Grid)
- Le Diberder sPlot
- Paterno R
- Kreschuk ROOT
- Verkerke RooFit
- Pia Goodness of Fit
- Buckley CEDAR
- Narsky StatPatternRecognition
Phystat05 Highlights
MKU
34Available Tools
34
- A number of good software has become more and
more available (good news for LHC!) - PP and astro use somehow different softwares
(IDL, IRAF by astro, for ex.) - 2004 Phystat workshop at MSU on statistical
software (mainly on R ROOT) by Linnemann - Statatisticians have a repository of standard
source codes (StatLib) http//lib.stat.cmu.edu/
- One good output of the conference was a
Recommendation of Statistical Software Repository
at FNAL - Linnemann has a web page of collections
http//www.pa.msu.edu/people/linnemann/stat_resour
ces.html
Phystat05 Highlights
MKU
35CDF Statistics Committee resources
35
- Documentation about statistics and a repository
http//www-cdf.fnal.gov/physics/statistics/statist
ics_home.html
Phystat05 Highlights
MKU
36Sample Repository Page
36
Phystat05 Highlights
MKU
37CEDAR CEPA
37
Phystat05 Highlights
MKU
38Summary Conclusions
38
- Very useful physicists/statisticians interaction
- e.g. Confidence intervals with nuisance
parameters, - Multivariate techniques, etc..
- Lots of things learnt from
- ourselves (by having to present own stuff!)
- each other (various different approaches..)
- statisticians (update on techniques..)
- A step towards common tools/Software
repositories http//www.phystat.org (Linnemann) - Programme, transparencies, papers, etc
http//www.physics.ox.ac.uk/phystat05 (with
useful links such as recommended readings) - Proceedings published by Imperial College Press
(Spring 06)
Phystat05 Highlights
MKU
39What is Next?
39
- A few workshops/schools took place since
October, 2005 - e.g. Manchester (Nov 2005), SAMSI Duke (April
2006), Banff (July 2006), Spanish Summer School
(July 2006) - No PHYSTAT Conference in summer 2007
- ATLAS Workshop on Statistical Methods, 18-19 Jan
2007 - PHYSTAT Workshop at CERN, 27-29 June 2007 on
- Statistical issues for LHC Physics analyses.
- (Both workshops will likely aim at discovery
significance. Please attend!) - Suggestions/enquiries to l.lyons_at_physics.ox.ac.
uk
Phystat05 Highlights
- LHC will take data soon. We do not wish to say
- rather say
The experiment was inconclusive, so we had to
use statistics (inside cover of the Good Book
by L. Lyons)
We used statistics, and so we are sure that weve
discovered X (well with some confidence level!)
MKU
40Some Final Notes
40
- Tried to give you a collage of PHYSTAT05 topics.
- My deepest thanks to Louis for giving me the
chance introducing me to the PHYSTAT
experience! - Apologies to those talks I have not been able to
cover - Thank you for the invitation!
Phystat05 Highlights
MKU
41Backup
41
- Bayes
- Frequentist
- Cousins-Highland
- Higgs Saga at CERN
Phystat05 Highlights
MKU
42Bayesian Approach
42
Bayesian
Bayes Theorem
Phystat05 Highlights
posterior
likelihood
prior
Problems P(param) True or False
Degree of belief
Prior What functional form?
Flat? Which
variable?
MKU
43Frequentist Approach
43
Neyman Construction
- µ
-
-
x -
x0 - µ Theoretical parameter
- x Observation NO PRIOR
Phystat05 Highlights
MKU
44Frequentist Approach
44
at 90 confidence
and known, but random unknown, but fixed Probability statement about and
Frequentist
Phystat05 Highlights
and known, and fixed unknown, and random Probability/credible statement about
Bayesian
MKU
45A Method
45
Method Mixed Frequentist - Bayesian
Full frequentist method hard to apply in several
dimensions
Bayesian for nuisance parameters and Frequentist
to extract range Philosophical/aesthetic
problems? Highland and Cousins NIM A320 (1992)
331
Phystat05 Highlights
MKU
46Higgs Saga
46
P (DataTheory) P (TheoryData)
Is data consistent with Standard Model?
or with Standard Model Higgs?
Phystat05 Highlights
End of Sept 2000 Data not very consistent with
S.M. Prob (Data S.M.) lt 1 valid
frequentist statement Turned by the press into
Prob (S.M. Data) lt 1 and therefore Prob
(Higgs Data) gt 99 i.e. It is almost certain
that the Higgs has been seen
MKU