PHYSTAT05%20Highlights:%20Statistical%20Problems%20in%20Particle%20Physics,%20%20%20Astrophysics%20and%20Cosmology - PowerPoint PPT Presentation

About This Presentation

Title:

PHYSTAT05%20Highlights:%20Statistical%20Problems%20in%20Particle%20Physics,%20%20%20Astrophysics%20and%20Cosmology

Description:

Narsky Bagging. Roe Boosting (Miniboone) Gray Bayes optimal classification ... Bagging (Bootstrap AGGregatING) trees: build a collection of trees by selecting ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 47

Provided by: mugekara

Category:

more less

Transcript and Presenter's Notes

Title: PHYSTAT05%20Highlights:%20Statistical%20Problems%20in%20Particle%20Physics,%20%20%20Astrophysics%20and%20Cosmology

1
PHYSTAT05 HighlightsStatistical Problems in
Particle Physics, Astrophysics and Cosmology
1
Phystat05 Highlights
Müge Karagöz Ünel Oxford University

University College London
03/11/2006

MKU
2
Outline
2

Conference Information and History
Introduction to statistics
Selection of hot topics
Available tools
Astrophysics and cosmology
Conclusions

Phystat05 Highlights
MKU
3
PHYSTAT History
3
Phystat05 Highlights
MKU
4
4
Phystat05 Highlights
Poster
MKU
5
Chronology of PHYSTAT05
5
Where CERN Fermilab Durham SLAC Oxford
When Jan 2000 March 2000 March 2002 Sept 2003 Sept 2005
Issues Limits Limits Wider range of topics Wider range of topics Wider range of topics
Physicists Particles Particles 3 astrophysicists Particles 3 astrophysicists Particles Astro Cosmo Particles Astro Cosmo
Statisticians 3 3 2 Many Many
Phystat05 Highlights
MKU
6
PHYSTAT05 Programme
6
7 Invited talks by Statisticians 9 Invited talks
by Physicists 38 Contributed talks 8
Posters Panel Discussion 3 Conference
Summaries 90 participants
Phystat05 Highlights
MKU
7
Invited Talks by Statisticians
7
David Cox Keynote Address Bayesian,
Frequentists
Physicists Steffen Lauritzen Goodness of
Fit Jerry Friedman Machine Learning Susan Holmes
Visualisation Peter Clifford Time
Series Mike Titterington Deconvolution Nancy
Reid Conference Summary (Statistics)
Phystat05 Highlights
MKU
8
Invited Talks by (Astro)Physicists
8
Bob Cousins Nuisance Parameters for Limits Kyle
Cranmer LHC discovery Alex Szalay
Astrophysics Terabytes Jean-Luc Starck
Multiscale geometry Jim Linnemann Statistical
Software for Particle Physics Bob
Nichol Statistical Software for
Astrophysics Stephen Johnson Historical Transits
of Venus Andrew Jaffe Conference Summary
(Astrophysics) Gary Feldman Conference Summary
(Particles)
Phystat05 Highlights
MKU
9
Contents of the Proceedings
9
Bayes/Frequentist 5 talks Goodness of
Fit 5 Likelihood/parameter estimation 6 Nuis
ance parameters/limits/discovery 10 Machine
learning 7 Software 8 Visualisation
1 Astrophysics 5 Time series 1 Deconvolu
tion 3
Phystat05 Highlights
MKU
10
Statistics in (A/P)Physics
10
Phystat05 Highlights
MKU
11
Statistics in (Particle) Physics
11

An experiment goes through following stages
Prepare conditions for taking data for a particle
X ( if theory driven)
Record events that might be X and reconstruct the
measurables
Select events that could have X by applying
criteria (cuts)
Generate histograms of variables and ask the
questions
Is there any evidence for new things or is the
null hypothesis unrefuted? If there is evidence,
what are the estimates for parameters of X?
(Confrontation of theory with experiment or v.v.)
The answers can come via your favorite
statistical technique (depends on how you ask the
question)

Phystat05 Highlights
MKU
12
(yet another) Chronology (from S. Andreons web
page)
12
Phystat05 Highlights

Homo apriorius establishes probability of an
hypothesis, no matter what data tell.
Homo pragamiticus establishes that it is
interested by the data only.
Homo frequentistus measures probability of the
data given the hypothesis.
Homo sapiens measures probability of the data and
of the hypothesis.
Homo bayesianis measures probability of the
hypothesis, given the data.

MKU
13
Bayesian vs Frequentist
13
We need to make a statement about Parameters,
given Data
Bayes 1763 Frequentism 1937 Both
analyse data (x) ? statement about parameters ( ?
) Both use Prob (x ? ), e.g. Prob (
) 90 but very
different interpretation
Phystat05 Highlights
Bayesian Probability (parameter, given
data) Frequentist Probability (data, given
parameter)
Bayesians address the question everyone is
interested in, by using assumptions no-one
believes
Frequentists use impeccable logic to deal with
an issue of no interest to anyone
MKU
14
Goodness of Fit
14
Lauritzen Invited talk - GoF Yabsley GoF and
sparse multi-D data Ianni GoF and sparse
multi-D data Raja GoF and L Gagunashvili ?2
and weighting Pia Software Toolkit for Data
Analysis Block Rejecting outliers Bruckman
Alignment Blobel Tracking
Phystat05 Highlights
MKU
15
Goodness of Fit
15

We would like to know if a given distribution is
of a specified type, test the validation of a
postulated model,..
A few GoF tests are widely used in practice
?2 test most widely used application is 1 or 2D
fits to data
G2 (the likelihood ratio statistics) test the
general version of ?2 test (Lauritzens personal
choice)
Kolmogorov-Smirnov test a robust but prone to
mislead test, can be used to confirm, say, two
distributions (histograms) are the same by
calculating the p-value for the difference
hypothesis.
Other new methods, like AslanZechs energy test,
exist

Phystat05 Highlights
MKU
16
An example from ATLAS (Bruckman)
16
Direct Least-Squares solution to the Silicon
Tracker alignment problem
The method consists of minimizing the giant ?2
resulting from a simultaneous fit of all particle
trajectories and alignment parameters
Let us consequently use the linear expansion (we
assume all second order derivatives are
negligible). The track fit is solved by
Phystat05 Highlights
while the alignment parameters are given by
Systems large inherent Computational challenges
MKU
Equivalent to Millepede approach from V. Blobel
17
Nuisance Parameters/Limits/Discovery
17

Cousins Limits and Nuisance Params
Reid Respondent
Punzi Frequentist multi-dimensional
ordering rule
Tegenfeldt Feldman-Cousins Cousins-Highland
Rolke Limits
Heinrich Bayes limits
Bityukov Poisson situations
Hill Limits v Discovery (see Punzi _at_
PHYSTAT2003)
Cranmer LHC discovery and nuisance parameters

Phystat05 Highlights
MKU
18
Systematics
18
NoteSystematic errors (HEP) lt-gt nuisance params
(statistician)
An example
we need to know these, probably from
other measurements (and/or theory) Uncertainties
?error in
Phystat05 Highlights
Physics parameter
Observed
for statistical errors
Some are arguably statistical errors
MKU
19
Nuisance Parameters
19

Nuisance parameters are parameters with unknown
true values. They may be
statistical, such as number of background events
in a sideband used for estimating the background
under a peak.
systematic, such as the shape of the background
under the peak, or the error caused by the
uncertainty of the hadronic fragmentation model
in the Monte Carlo.
Most experiments have a large number of
systematic uncertainties.
If the experimenter is blind to these
uncertainties, they become a bigger nuisance!

Phystat05 Highlights
MKU
20
Issues with LHC
20

LHC will collide 40 million times/sec and collect
petabytes of data. pp collisions at 14 TeV will
generate events much more complicated than LEP,
TeVatron.
Kyle Cranmer has pointed out that systematic
issues will be even more important at the LHC.
If the statistical error is O(1) and systematic
error is O(0.1), it does not much matter how you
treat it.
However, at the LHC, we may have processes with
100 background events and 10 systematic errors,
this is not negligible.
Even more critical, we want 5s for a discovery
level.

Phystat05 Highlights
MKU
21
Why 5s? (FeldmanCranmer)
21

LHC searches 500 searches each of which has 100
resolution elements (mass, angle bins, ...) 5 x
104 chances to find something.
One experiment False positive rate at 5 s ? (5
x 104) (3 x 10-7) 0.015. OK.
Two experiments
Assume allowable false positive rate 10.
2 (5 x 104) (1 x 10-4) 10 ? 3.7 s required.
Required other experiment verification, assume
rate 0.01 (1 x 10-3)(10) 0.01 ? 3.1 s
required.
Caveats Is the significance real? Are there
common systematic errors?

Phystat05 Highlights
MKU
22
Confidence Intervals
22

Various techniques discussed during conference.
Most concerns were summarized by Feldman.
Bayesian good method but Heinrich showed that
flat priors in multi-D may lead to undesirable
results (undercoverage).
Frequentist-Bayesian hybrids Bayesian for priors
and frequentist to extract range. Cranmer
considered this for LHC (which was also used at
Higgs searches).
Profile likelihood shown by Punzi to have issues
when distribution is Poisson-like.
Full Neyman reconstruction Cranmer and Punzi
attempted this, but is not feasible for large
number of nuisance parameters.
Banff workhsop of this summer was found useful in
comparing various methods. The real suggestions
for LHC will likely come from 2007 workshop on
LHC issues.

Phystat05 Highlights
MKU
23
Event Classification
23

The problem Given a measurement of an event X
find F(X) which returns 1 if the event is signal
(s) and 0 if the event is background (b) to
optimize a figure of merit, say, s/vb for
discovery and s/ v(sb) for established signal.
Theoretical solution Use MC to calculate the
likelihood ratio Ls(X)/Lb(X) and derive F(X) from
it. Unfortunately, this does not work as in a
high-dimension space, even the largest data set
is sparse. (Feldman)
In recent years, physicists have turned to
machine learning give the computer samples of s
and b events and let the computer figure out what
F(X) is.

Phystat05 Highlights
MKU
24
Multivariate Analysis
24
Friedman Machine learning Prosper Respondent Nar
sky Bagging Roe Boosting
(Miniboone) Gray Bayes
optimal classification Bhat
Bayesian networks Sarda
Signal enhancement
Phystat05 Highlights
MKU
25
Multivariates and Machine Learning
25

Various methods exist to classify, train and test
events.
Artificial neural networks (ANN) currently the
most widely used (examples from Prosper, )
Decision trees differentiating variable is used
to separate sample into branches until a leaf
with a preset number of signal and background
events are found.
Trees with rules combining a series of trees to
increase single decision tree power (Friedman)
Bagging (Bootstrap AGGregatING) trees build a
collection of trees by selecting a sample of the
training data (Narsky)
Boosted trees a robust method that gives
misclassified events in one tree a higher weight
in the generation of a new tree
Comparisons of significance were performed, but
not all of were controlled experiments, so
conclusions may be deceptive until further tests..

Phystat05 Highlights
MKU
26
Ex Boosted Decision Trees (Roe)
26

An nice example from MiniBoone
Create M many trees and take the final score for
signal and background as weighted sum of
individual trees

Phystat05 Highlights
MKU
27
Punzi effect (getting L wrong)
27

Giovanni Punzi _at_ PHYSTAT2003
Comments on L fits with variable resolution
Separate two close signals (A and B) , when
resolution s varies event by event, and is
different for 2 signals
e.g. M, Different numbers of tracks ?
different sM
Avoiding Punzi bias
Include p(sA) and p(sB) in fit OR
Fit each range of si separately, and add (NA)i ?
(NA)total, and similarly for B
Beware of event-by-event variables and construct
likelihoods accordingly
(Talk by Catastini)

Phystat05 Highlights
MKU
28
Blind Analyses
28

Potential problem Experimenters bias
Original suggestion? Luis Alvarez
Methods of blinding
Keep signal region box closed
Add random numbers to data
Keep Monte Carlo parameters blind
Use part of data to define procedure
A number of analyses in experiments doing blind
searches
Dont modify result after unblinding, in
general..
Question Will LHC experiments choose to be
blind? In which analysis?

Phystat05 Highlights
MKU
29
Astrophysics Cosmology Highlights
29
Phystat05 Highlights
MKU
30
Astro/Cosmo General Issues
30
There is only one universe and some
experiments can never be rerun A. Jaffe
(concluding talk) ? Astrocosmo tend to be more
Bayesian, by nature.
Phystat05 Highlights

Virtual Observatories all astro data available
from desktop
Data volume growth doubling every year, most data
are on the web (Szalay)
Bad computing storage issues
Good (?) Systematic errors more significant
statistical errors
Nichol discussed using grid techniques.

MKU
31
Astrophysics Various Hot Points
31

Flat priors have been used commonly, but are
dangerous (Cox, Le Diberder, Cousins) would ? be
the best quantity to use or is it ?h2 ?
Issues with non-gaussian distribution of noise
taken into account in the spectrum a few methods
discussed by Starck, Digel, ..
Blind analyses are rare (not so good at a priori
modeling!)
Lots of good software in astrophysics and
repositories more advanced than PP.
Jaffes talk has a a nice usage of CMB as a case
study for statistical methods in astrophysics,
starting from 1st principles of Bayesian.

Phystat05 Highlights
MKU
32
Software and Available Tools
32
Phystat05 Highlights
MKU
33
Talks Given on Software
33

Linnemann Software for Particles
Nichol Software for Astro (and Grid)
Le Diberder sPlot
Paterno R
Kreschuk ROOT
Verkerke RooFit
Pia Goodness of Fit
Buckley CEDAR
Narsky StatPatternRecognition

Phystat05 Highlights
MKU
34
Available Tools
34

A number of good software has become more and
more available (good news for LHC!)
PP and astro use somehow different softwares
(IDL, IRAF by astro, for ex.)
2004 Phystat workshop at MSU on statistical
software (mainly on R ROOT) by Linnemann
Statatisticians have a repository of standard
source codes (StatLib) http//lib.stat.cmu.edu/
One good output of the conference was a
Recommendation of Statistical Software Repository
at FNAL
Linnemann has a web page of collections
http//www.pa.msu.edu/people/linnemann/stat_resour
ces.html

Phystat05 Highlights
MKU
35
CDF Statistics Committee resources
35

Documentation about statistics and a repository
http//www-cdf.fnal.gov/physics/statistics/statist
ics_home.html

Phystat05 Highlights
MKU
36
Sample Repository Page
36
Phystat05 Highlights
MKU
37
CEDAR CEPA
37
Phystat05 Highlights
MKU
38
Summary Conclusions
38

Very useful physicists/statisticians interaction
e.g. Confidence intervals with nuisance
parameters,
Multivariate techniques, etc..
Lots of things learnt from
ourselves (by having to present own stuff!)
each other (various different approaches..)
statisticians (update on techniques..)
A step towards common tools/Software
repositories http//www.phystat.org (Linnemann)
Programme, transparencies, papers, etc
http//www.physics.ox.ac.uk/phystat05 (with
useful links such as recommended readings)
Proceedings published by Imperial College Press
(Spring 06)

Phystat05 Highlights
MKU
39
What is Next?
39

A few workshops/schools took place since
October, 2005
e.g. Manchester (Nov 2005), SAMSI Duke (April
2006), Banff (July 2006), Spanish Summer School
(July 2006)
No PHYSTAT Conference in summer 2007
ATLAS Workshop on Statistical Methods, 18-19 Jan
2007
PHYSTAT Workshop at CERN, 27-29 June 2007 on
Statistical issues for LHC Physics analyses.
(Both workshops will likely aim at discovery
significance. Please attend!)
Suggestions/enquiries to l.lyons_at_physics.ox.ac.
uk

Phystat05 Highlights

LHC will take data soon. We do not wish to say
rather say

The experiment was inconclusive, so we had to
use statistics (inside cover of the Good Book
by L. Lyons)
We used statistics, and so we are sure that weve
discovered X (well with some confidence level!)
MKU
40
Some Final Notes
40

Tried to give you a collage of PHYSTAT05 topics.
My deepest thanks to Louis for giving me the
chance introducing me to the PHYSTAT
experience!
Apologies to those talks I have not been able to
cover
Thank you for the invitation!

Phystat05 Highlights
MKU
41
Backup
41

Bayes
Frequentist
Cousins-Highland
Higgs Saga at CERN

Phystat05 Highlights
MKU
42
Bayesian Approach
42
Bayesian
Bayes Theorem
Phystat05 Highlights
posterior
likelihood
prior
Problems P(param) True or False
Degree of belief
Prior What functional form?
Flat? Which
variable?
MKU
43
Frequentist Approach
43
Neyman Construction

µ
x
x0
µ Theoretical parameter
x Observation NO PRIOR

Phystat05 Highlights
MKU
44
Frequentist Approach
44
at 90 confidence
and known, but random unknown, but fixed Probability statement about and
Frequentist
Phystat05 Highlights
and known, and fixed unknown, and random Probability/credible statement about
Bayesian
MKU
45
A Method
45
Method Mixed Frequentist - Bayesian
Full frequentist method hard to apply in several
dimensions
Bayesian for nuisance parameters and Frequentist
to extract range Philosophical/aesthetic
problems? Highland and Cousins NIM A320 (1992)
331
Phystat05 Highlights
MKU
46
Higgs Saga
46
P (DataTheory) P (TheoryData)
Is data consistent with Standard Model?
or with Standard Model Higgs?
Phystat05 Highlights
End of Sept 2000 Data not very consistent with
S.M. Prob (Data S.M.) lt 1 valid
frequentist statement Turned by the press into
Prob (S.M. Data) lt 1 and therefore Prob
(Higgs Data) gt 99 i.e. It is almost certain
that the Higgs has been seen
MKU

Write a Comment

User Comments (0)