PHYSTAT05%20Highlights:%20Statistical%20Problems%20in%20Particle%20Physics,%20%20%20Astrophysics%20and%20Cosmology - PowerPoint PPT Presentation

About This Presentation
Title:

PHYSTAT05%20Highlights:%20Statistical%20Problems%20in%20Particle%20Physics,%20%20%20Astrophysics%20and%20Cosmology

Description:

Narsky Bagging. Roe Boosting (Miniboone) Gray Bayes optimal classification ... Bagging (Bootstrap AGGregatING) trees: build a collection of trees by selecting ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: PHYSTAT05%20Highlights:%20Statistical%20Problems%20in%20Particle%20Physics,%20%20%20Astrophysics%20and%20Cosmology


1
PHYSTAT05 HighlightsStatistical Problems in
Particle Physics, Astrophysics and Cosmology
1
Phystat05 Highlights
Müge Karagöz Ünel Oxford University
  • University College London
  • 03/11/2006

MKU
2
Outline
2
  • Conference Information and History
  • Introduction to statistics
  • Selection of hot topics
  • Available tools
  • Astrophysics and cosmology
  • Conclusions

Phystat05 Highlights
MKU
3
PHYSTAT History
3
Phystat05 Highlights
MKU
4
4
Phystat05 Highlights
Poster
MKU
5
Chronology of PHYSTAT05
5
Where CERN Fermilab Durham SLAC Oxford
When Jan 2000 March 2000 March 2002 Sept 2003 Sept 2005
Issues Limits Limits Wider range of topics Wider range of topics Wider range of topics
Physicists Particles Particles 3 astrophysicists Particles 3 astrophysicists Particles Astro Cosmo Particles Astro Cosmo
Statisticians 3 3 2 Many Many
Phystat05 Highlights
MKU
6
PHYSTAT05 Programme
6
7 Invited talks by Statisticians 9 Invited talks
by Physicists 38 Contributed talks 8
Posters Panel Discussion 3 Conference
Summaries 90 participants
Phystat05 Highlights
MKU
7
Invited Talks by Statisticians
7
David Cox Keynote Address Bayesian,
Frequentists
Physicists Steffen Lauritzen Goodness of
Fit Jerry Friedman Machine Learning Susan Holmes
Visualisation Peter Clifford Time
Series Mike Titterington Deconvolution Nancy
Reid Conference Summary (Statistics)
Phystat05 Highlights
MKU
8
Invited Talks by (Astro)Physicists
8
Bob Cousins Nuisance Parameters for Limits Kyle
Cranmer LHC discovery Alex Szalay
Astrophysics Terabytes Jean-Luc Starck
Multiscale geometry Jim Linnemann Statistical
Software for Particle Physics Bob
Nichol Statistical Software for
Astrophysics Stephen Johnson Historical Transits
of Venus Andrew Jaffe Conference Summary
(Astrophysics) Gary Feldman Conference Summary
(Particles)
Phystat05 Highlights
MKU
9
Contents of the Proceedings
9
Bayes/Frequentist 5 talks Goodness of
Fit 5 Likelihood/parameter estimation 6 Nuis
ance parameters/limits/discovery 10 Machine
learning 7 Software 8 Visualisation
1 Astrophysics 5 Time series 1 Deconvolu
tion 3
Phystat05 Highlights
MKU
10
Statistics in (A/P)Physics
10
Phystat05 Highlights
MKU
11
Statistics in (Particle) Physics
11
  • An experiment goes through following stages
  • Prepare conditions for taking data for a particle
    X ( if theory driven)
  • Record events that might be X and reconstruct the
    measurables
  • Select events that could have X by applying
    criteria (cuts)
  • Generate histograms of variables and ask the
    questions
  • Is there any evidence for new things or is the
    null hypothesis unrefuted? If there is evidence,
    what are the estimates for parameters of X?
    (Confrontation of theory with experiment or v.v.)
  • The answers can come via your favorite
    statistical technique (depends on how you ask the
    question)

Phystat05 Highlights
MKU
12
(yet another) Chronology (from S. Andreons web
page)
12
Phystat05 Highlights
  • Homo apriorius establishes probability of an
    hypothesis, no matter what data tell.
  • Homo pragamiticus establishes that it is
    interested by the data only.
  • Homo frequentistus measures probability of the
    data given the hypothesis.
  • Homo sapiens measures probability of the data and
    of the hypothesis.
  • Homo bayesianis measures probability of the
    hypothesis, given the data.

MKU
13
Bayesian vs Frequentist
13
We need to make a statement about Parameters,
given Data
Bayes 1763 Frequentism 1937 Both
analyse data (x) ? statement about parameters ( ?
) Both use Prob (x ? ), e.g. Prob (
) 90 but very
different interpretation
Phystat05 Highlights
Bayesian Probability (parameter, given
data) Frequentist Probability (data, given
parameter)
Bayesians address the question everyone is
interested in, by using assumptions no-one
believes
Frequentists use impeccable logic to deal with
an issue of no interest to anyone
MKU
14
Goodness of Fit
14
Lauritzen Invited talk - GoF Yabsley GoF and
sparse multi-D data Ianni GoF and sparse
multi-D data Raja GoF and L Gagunashvili ?2
and weighting Pia Software Toolkit for Data
Analysis Block Rejecting outliers Bruckman
Alignment Blobel Tracking
Phystat05 Highlights
MKU
15
Goodness of Fit
15
  • We would like to know if a given distribution is
    of a specified type, test the validation of a
    postulated model,..
  • A few GoF tests are widely used in practice
  • ?2 test most widely used application is 1 or 2D
    fits to data
  • G2 (the likelihood ratio statistics) test the
    general version of ?2 test (Lauritzens personal
    choice)
  • Kolmogorov-Smirnov test a robust but prone to
    mislead test, can be used to confirm, say, two
    distributions (histograms) are the same by
    calculating the p-value for the difference
    hypothesis.
  • Other new methods, like AslanZechs energy test,
    exist

Phystat05 Highlights
MKU
16
An example from ATLAS (Bruckman)
16
Direct Least-Squares solution to the Silicon
Tracker alignment problem
The method consists of minimizing the giant ?2
resulting from a simultaneous fit of all particle
trajectories and alignment parameters
Let us consequently use the linear expansion (we
assume all second order derivatives are
negligible). The track fit is solved by
Phystat05 Highlights
while the alignment parameters are given by
Systems large inherent Computational challenges
MKU
Equivalent to Millepede approach from V. Blobel
17
Nuisance Parameters/Limits/Discovery
17
  • Cousins Limits and Nuisance Params
  • Reid Respondent
  • Punzi Frequentist multi-dimensional
    ordering rule
  • Tegenfeldt Feldman-Cousins Cousins-Highland
  • Rolke Limits
  • Heinrich Bayes limits
  • Bityukov Poisson situations
  • Hill Limits v Discovery (see Punzi _at_
    PHYSTAT2003)
  • Cranmer LHC discovery and nuisance parameters

Phystat05 Highlights
MKU
18
Systematics
18
NoteSystematic errors (HEP) lt-gt nuisance params
(statistician)
An example
we need to know these, probably from
other measurements (and/or theory) Uncertainties
?error in
Phystat05 Highlights
Physics parameter
Observed
for statistical errors
Some are arguably statistical errors
MKU
19
Nuisance Parameters
19
  • Nuisance parameters are parameters with unknown
    true values. They may be
  • statistical, such as number of background events
    in a sideband used for estimating the background
    under a peak.
  • systematic, such as the shape of the background
    under the peak, or the error caused by the
    uncertainty of the hadronic fragmentation model
    in the Monte Carlo.
  • Most experiments have a large number of
    systematic uncertainties.
  • If the experimenter is blind to these
    uncertainties, they become a bigger nuisance!

Phystat05 Highlights
MKU
20
Issues with LHC
20
  • LHC will collide 40 million times/sec and collect
    petabytes of data. pp collisions at 14 TeV will
    generate events much more complicated than LEP,
    TeVatron.
  • Kyle Cranmer has pointed out that systematic
    issues will be even more important at the LHC.
  • If the statistical error is O(1) and systematic
    error is O(0.1), it does not much matter how you
    treat it.
  • However, at the LHC, we may have processes with
    100 background events and 10 systematic errors,
    this is not negligible.
  • Even more critical, we want 5s for a discovery
    level.

Phystat05 Highlights
MKU
21
Why 5s? (FeldmanCranmer)
21
  • LHC searches 500 searches each of which has 100
    resolution elements (mass, angle bins, ...) 5 x
    104 chances to find something.
  • One experiment False positive rate at 5 s ? (5
    x 104) (3 x 10-7) 0.015. OK.
  • Two experiments
  • Assume allowable false positive rate 10.
  • 2 (5 x 104) (1 x 10-4) 10 ? 3.7 s required.
  • Required other experiment verification, assume
    rate 0.01 (1 x 10-3)(10) 0.01 ? 3.1 s
    required.
  • Caveats Is the significance real? Are there
    common systematic errors?

Phystat05 Highlights
MKU
22
Confidence Intervals
22
  • Various techniques discussed during conference.
    Most concerns were summarized by Feldman.
  • Bayesian good method but Heinrich showed that
    flat priors in multi-D may lead to undesirable
    results (undercoverage).
  • Frequentist-Bayesian hybrids Bayesian for priors
    and frequentist to extract range. Cranmer
    considered this for LHC (which was also used at
    Higgs searches).
  • Profile likelihood shown by Punzi to have issues
    when distribution is Poisson-like.
  • Full Neyman reconstruction Cranmer and Punzi
    attempted this, but is not feasible for large
    number of nuisance parameters.
  • Banff workhsop of this summer was found useful in
    comparing various methods. The real suggestions
    for LHC will likely come from 2007 workshop on
    LHC issues.

Phystat05 Highlights
MKU
23
Event Classification
23
  • The problem Given a measurement of an event X
    find F(X) which returns 1 if the event is signal
    (s) and 0 if the event is background (b) to
    optimize a figure of merit, say, s/vb for
    discovery and s/ v(sb) for established signal.
  • Theoretical solution Use MC to calculate the
    likelihood ratio Ls(X)/Lb(X) and derive F(X) from
    it. Unfortunately, this does not work as in a
    high-dimension space, even the largest data set
    is sparse. (Feldman)
  • In recent years, physicists have turned to
    machine learning give the computer samples of s
    and b events and let the computer figure out what
    F(X) is.

Phystat05 Highlights
MKU
24
Multivariate Analysis
24
Friedman Machine learning Prosper Respondent Nar
sky Bagging Roe Boosting
(Miniboone) Gray Bayes
optimal classification Bhat
Bayesian networks Sarda
Signal enhancement
Phystat05 Highlights
MKU
25
Multivariates and Machine Learning
25
  • Various methods exist to classify, train and test
    events.
  • Artificial neural networks (ANN) currently the
    most widely used (examples from Prosper, )
  • Decision trees differentiating variable is used
    to separate sample into branches until a leaf
    with a preset number of signal and background
    events are found.
  • Trees with rules combining a series of trees to
    increase single decision tree power (Friedman)
  • Bagging (Bootstrap AGGregatING) trees build a
    collection of trees by selecting a sample of the
    training data (Narsky)
  • Boosted trees a robust method that gives
    misclassified events in one tree a higher weight
    in the generation of a new tree
  • Comparisons of significance were performed, but
    not all of were controlled experiments, so
    conclusions may be deceptive until further tests..

Phystat05 Highlights
MKU
26
Ex Boosted Decision Trees (Roe)
26
  • An nice example from MiniBoone
  • Create M many trees and take the final score for
    signal and background as weighted sum of
    individual trees

Phystat05 Highlights
MKU
27
Punzi effect (getting L wrong)
27
  • Giovanni Punzi _at_ PHYSTAT2003
  • Comments on L fits with variable resolution
  • Separate two close signals (A and B) , when
    resolution s varies event by event, and is
    different for 2 signals
  • e.g. M, Different numbers of tracks ?
    different sM
  • Avoiding Punzi bias
  • Include p(sA) and p(sB) in fit OR
  • Fit each range of si separately, and add (NA)i ?
    (NA)total, and similarly for B
  • Beware of event-by-event variables and construct
    likelihoods accordingly
  • (Talk by Catastini)

Phystat05 Highlights
MKU
28
Blind Analyses
28
  • Potential problem Experimenters bias
  • Original suggestion? Luis Alvarez
  • Methods of blinding
  • Keep signal region box closed
  • Add random numbers to data
  • Keep Monte Carlo parameters blind
  • Use part of data to define procedure
  • A number of analyses in experiments doing blind
    searches
  • Dont modify result after unblinding, in
    general..
  • Question Will LHC experiments choose to be
    blind? In which analysis?

Phystat05 Highlights
MKU
29
Astrophysics Cosmology Highlights
29
Phystat05 Highlights
MKU
30
Astro/Cosmo General Issues
30
There is only one universe and some
experiments can never be rerun A. Jaffe
(concluding talk) ? Astrocosmo tend to be more
Bayesian, by nature.
Phystat05 Highlights
  • Virtual Observatories all astro data available
    from desktop
  • Data volume growth doubling every year, most data
    are on the web (Szalay)
  • Bad computing storage issues
  • Good (?) Systematic errors more significant
    statistical errors
  • Nichol discussed using grid techniques.

MKU
31
Astrophysics Various Hot Points
31
  • Flat priors have been used commonly, but are
    dangerous (Cox, Le Diberder, Cousins) would ? be
    the best quantity to use or is it ?h2 ?
  • Issues with non-gaussian distribution of noise
    taken into account in the spectrum a few methods
    discussed by Starck, Digel, ..
  • Blind analyses are rare (not so good at a priori
    modeling!)
  • Lots of good software in astrophysics and
    repositories more advanced than PP.
  • Jaffes talk has a a nice usage of CMB as a case
    study for statistical methods in astrophysics,
    starting from 1st principles of Bayesian.

Phystat05 Highlights
MKU
32
Software and Available Tools
32
Phystat05 Highlights
MKU
33
Talks Given on Software
33
  • Linnemann Software for Particles
  • Nichol Software for Astro (and Grid)
  • Le Diberder sPlot
  • Paterno R
  • Kreschuk ROOT
  • Verkerke RooFit
  • Pia Goodness of Fit
  • Buckley CEDAR
  • Narsky StatPatternRecognition

Phystat05 Highlights
MKU
34
Available Tools
34
  • A number of good software has become more and
    more available (good news for LHC!)
  • PP and astro use somehow different softwares
    (IDL, IRAF by astro, for ex.)
  • 2004 Phystat workshop at MSU on statistical
    software (mainly on R ROOT) by Linnemann
  • Statatisticians have a repository of standard
    source codes (StatLib) http//lib.stat.cmu.edu/
  • One good output of the conference was a
    Recommendation of Statistical Software Repository
    at FNAL
  • Linnemann has a web page of collections
    http//www.pa.msu.edu/people/linnemann/stat_resour
    ces.html

Phystat05 Highlights
MKU
35
CDF Statistics Committee resources
35
  • Documentation about statistics and a repository
    http//www-cdf.fnal.gov/physics/statistics/statist
    ics_home.html

Phystat05 Highlights
MKU
36
Sample Repository Page
36
Phystat05 Highlights
MKU
37
CEDAR CEPA
37
Phystat05 Highlights
MKU
38
Summary Conclusions
38
  • Very useful physicists/statisticians interaction
  • e.g. Confidence intervals with nuisance
    parameters,
  • Multivariate techniques, etc..
  • Lots of things learnt from
  • ourselves (by having to present own stuff!)
  • each other (various different approaches..)
  • statisticians (update on techniques..)
  • A step towards common tools/Software
    repositories http//www.phystat.org (Linnemann)
  • Programme, transparencies, papers, etc
    http//www.physics.ox.ac.uk/phystat05 (with
    useful links such as recommended readings)
  • Proceedings published by Imperial College Press
    (Spring 06)

Phystat05 Highlights
MKU
39
What is Next?
39
  • A few workshops/schools took place since
    October, 2005
  • e.g. Manchester (Nov 2005), SAMSI Duke (April
    2006), Banff (July 2006), Spanish Summer School
    (July 2006)
  • No PHYSTAT Conference in summer 2007
  • ATLAS Workshop on Statistical Methods, 18-19 Jan
    2007
  • PHYSTAT Workshop at CERN, 27-29 June 2007 on
  • Statistical issues for LHC Physics analyses.
  • (Both workshops will likely aim at discovery
    significance. Please attend!)
  • Suggestions/enquiries to l.lyons_at_physics.ox.ac.
    uk

Phystat05 Highlights
  • LHC will take data soon. We do not wish to say
  • rather say

The experiment was inconclusive, so we had to
use statistics (inside cover of the Good Book
by L. Lyons)
We used statistics, and so we are sure that weve
discovered X (well with some confidence level!)
MKU
40
Some Final Notes
40
  • Tried to give you a collage of PHYSTAT05 topics.
  • My deepest thanks to Louis for giving me the
    chance introducing me to the PHYSTAT
    experience!
  • Apologies to those talks I have not been able to
    cover
  • Thank you for the invitation!

Phystat05 Highlights
MKU
41
Backup
41
  • Bayes
  • Frequentist
  • Cousins-Highland
  • Higgs Saga at CERN

Phystat05 Highlights
MKU
42
Bayesian Approach
42
Bayesian
Bayes Theorem
Phystat05 Highlights
posterior
likelihood
prior
Problems P(param) True or False
Degree of belief
Prior What functional form?
Flat? Which
variable?
MKU
43
Frequentist Approach
43
Neyman Construction
  • µ


  • x

  • x0
  • µ Theoretical parameter
  • x Observation NO PRIOR

Phystat05 Highlights
MKU
44
Frequentist Approach
44
at 90 confidence
and known, but random unknown, but fixed Probability statement about and
Frequentist
Phystat05 Highlights
and known, and fixed unknown, and random Probability/credible statement about
Bayesian
MKU
45
A Method
45
Method Mixed Frequentist - Bayesian
Full frequentist method hard to apply in several
dimensions
Bayesian for nuisance parameters and Frequentist
to extract range Philosophical/aesthetic
problems? Highland and Cousins NIM A320 (1992)
331
Phystat05 Highlights
MKU
46
Higgs Saga
46
P (DataTheory) P (TheoryData)
Is data consistent with Standard Model?
or with Standard Model Higgs?
Phystat05 Highlights
End of Sept 2000 Data not very consistent with
S.M. Prob (Data S.M.) lt 1 valid
frequentist statement Turned by the press into
Prob (S.M. Data) lt 1 and therefore Prob
(Higgs Data) gt 99 i.e. It is almost certain
that the Higgs has been seen
MKU
Write a Comment
User Comments (0)
About PowerShow.com