Title: Measuring forecast skill: is it real skill or is it the varying climatology
1Measuring forecast skill is it real skill or
is it the varying climatology?
- Tom Hamill
- NOAA / ESRL / PSD, Boulder, Colorado
tom.hamill_at_noaa.gov www.cdc.noaa.gov/people/tom.h
amill - Josip Juras
- University of Zagreb, Croatia
2Hypothesis
- If climatological event probability varies among
samples, then many verification metrics will
credit a forecast with extra skill it doesnt
deserve - the extra skill comes from the
variations in the climatology.
3Consider Equitable Threat Scores
4- Consider Equitable
- Threat Scores
- ETS location-dependent,
- related to climatological
- probability.
5- Consider Equitable
- Threat Scores
- ETS location-dependent,
- related to climatological
- probability.
- (2) Average of ETS at
- individual grid points 0.28
6- Consider Equitable
- Threat Scores
- ETS location-dependent,
- related to climatological
- probability.
- (2) Average of ETS at
- individual grid points 0.28
- (3) ETS after data lumped into
- one big table 0.42
7Considering three metrics
- (1) Brier Skill Score (1perfect, 0reference)
(2) Relative Operating Characteristic
(3) Equitable Threat Score
(each will show this tendency to have scores vary
depending on how theyre calculated)
8Equitable Threat Scorestandard method of
calculation
Assume we have a deterministic forecast All m
samples, at all locations and times, populate
one contingency table Event
forecast? YES NO
-------------------------------------------------
YES a b Event ------------------
------------------------------- Observed?
NO c d -----------------------------
--------------------
where
9Equitable Threat Scorealternative method of
calculation
Consider the possibility of different regions
with different climates. Assume nc contingency
tables, each associated with samples with a
distinct climatological event frequency. ns(k)
out of the m samples were used to populate the
kth table. ETS calculated separately for each
contingency table, and alternative,
weighted- average ETS is calculated as
BSS?
ROC?
10ETS calculated two ways
11Example of unexpected skill two islands, zero
meteorologists
Imagine a planet with a global ocean and two
isolated islands. Weather forecasting other
than climatology for each island is
impossible. Island 1 Forecast, observed
uncorrelated, N (?, 1) Island 2 Forecast,
observed uncorrelated, N (?, 1) 0 ?
5 Event Observed gt 0
12Two islands
As ? increases
Island 2
Island 1
But still, each islands forecast is no better
than a random draw from its climatology. Expect
no skill.
13Skill with conventional methods of calculation
Reference climatology implicitly becomes N(?,1)
N(?,1) not N(?,1) OR N(?,1)
14The new reference climatology
15Are standard methods wrong?
- Assertion weve just re-defined climatology,
theyre the correct scores with reference to that
climatology. - Response You can calculate them this way, but
you shouldnt. - You will draw improper inferences due to lurking
variable - i.e., the varying climatology should
be a predictor. - Discerning real skill or skill difference gets
tougher
One method that is sometimes used is to combine
all the data into a single 2x2 table this
procedure is legitimate only if the probability
p of an occurrence (on the null hypothesis) can
be assumed to be the same in all the individual
2x2 tables. Consequently, if p obviously varies
from table to table, or we suspect that it may
vary, this procedure should not be used. W. G.
Cochran, 1954, discussing ANOVA tests
16Related problem when means are the same but
climatological variances differ
- Event v gt 2.0
- Island 1 f N(0,1), v N(0,1), Corr (f,v)
0.0 - Island 2 f N(0,?), v N(0,?), 1 ?? 3,
Corr (f,v) 0.9 - Expectation positive skill over two islands, but
not a function of ?
17Scores vary with ?
more
18the island with the greater climatological
uncertainty of the observed event ends up
dominating the calculations.
more
19Solutions ?
- (1) Analyze events where climatological
probabilities are the same at all locations,
e.g., terciles.
20Solutions, continued
- (2) Use sample-weighted averages
ROC
21Conclusions
- Many conventional verification metrics like BSS,
RPSS, threat scores, ROC, potential economic
value, etc. can be overestimated if climatology
varies among samples. - results in false inferences think theres skill
where theres none. - complicates evaluation of model improvements
Model A better than Model B, but doesnt appear
quite so since both inflated in skill. - Fixes
- Consider events where climatology doesnt vary
such as the exceedance of a quantile of the
climatological distribution - Combine after calculating for distinct
climatologies. - Please Document your method for calculating a
score!
Acknowledgements Matt Briggs, Dan Wilks, Craig
Bishop, Beth Ebert, Steve Mullen, Simon Mason,
Bob Glahn, Neill Bowler, Ken Mylne, Bill Gallus,
Frederic Atger, Francois LaLaurette, Zoltan
Toth, Jeff Whitaker.
22Brier Skill Scores from raw ensembles
Event is whether the observed weather will be
above threshold T. Let Xe(j) X1(j), ,
Xn(j) be n-member ensemble forecast of the
relevant scalar variable for the j th of m
samples (taken over many case days and / or
locations). Ensemble sorted from lowest to
highest. Convert sorted ensemble to an
n-member binary forecast Ie(j) I1(j), ,
In (j) . 1 if Xi(j) gtT, 0 if Xi(j)
T Observed weather also converted to
binary, denoted by Io(j).
forecast probability
Brier score of the forecast
back
23Brier Skill Score, continued
Standard method of calculation Single
climatological probability calculated over all m
samples
BSS 1.0 BSf / BSc (def. of Brier Skill
Score)
climatological event probability
Brier Score of climatology
24Brier Skill Score, continued
Alternative 1 Multiple climatological
probabilities calculated for different regions,
then summed.
Suppose m samples split up into nc subsets, each
with a distinct climatological event frequency.
Let pc(k) be the climatological probability in
the kth of the nc subsets, with ns(k) samples in
this subset. Let rk r(1) , , r(ns(k)) be
the associated set of sample indices out of the m
samples.
Brier score of climatology calculated separately
for each subset.
The overall Brier score of climatology is the sum
of scores for each subset.
25Brier Skill Score, continued
Alternative 2 Final BSS is sample-weighted
average of BSS for each subset.
calculate forecast Brier score separately
for each distinct region, as was done for
climatology
ns(k) / m is the weight applied to that
regions BSS
26Relative Operating Characteristicstandard
method of calculation
Populate 2x2 contingency tables, separate one for
each sorted ensemble member. The contingency
table for the ith sorted ensemble member is
Event forecast by ith member?
YES NO ---------------------------
---------------------------- YES ai bi
Event ------------------------------------
------------------- Observed? NO ci di
----------------------------------------------
---------
( ai bi ci di
1)
(false alarm rate)
(hit rate)
ROC is a plot of hit rate (y) vs. false alarm
rate (x). Commonly summarized by area under
curve (AUC), 1.0 for perfect forecast, 0.5 for
climatology.
back
27Relative Operating Characteristic alternative
method of calculation
As with the BSS, suppose samples can be
partitioned into nc subsets, each associated
with a distinct climatological event frequency.
Using the ns(k) samples, the hit rates and false
alarm rates for the kth climatology
Then calculate sample-weighted average hit rates
and false alarm rates
28Island 1, 0.0232 ETS -0.0022, HR 0.0172
Island 2, ? 1, 0.0288 ETS 0.4195, HR
0.5937
ETS(combined table) 0.193, HR 0.336
Island 2, ? 3, 0.26 ETS 0.5327, HR
0.778
ETS(combined table) 0.499, HR 0.715
back