Title: Stedinger, J.R., C. M. Crainiceanu, D. Ruppert, C.T. Behr, and R. McKay, Statistical Models of Crypt
1Stedinger, J.R., C. M. Crainiceanu, D. Ruppert,
C.T. Behr, and R. McKay, Statistical Models of
Cryptosporidium Concentrations in Natural Waters,
seminar presented to the New York City Dept. of
Environmental Protection, Valhalla, NY, March 14,
2002.
- The report describes statistical methods for the
analysis of the Cryptosporidium concentrations
in natural waters, using the ICR data set as an
example. Zero counts are part of the sampling
variation of count data and will be modeled as
zero counts to allow correct inferences
concerning environmental concentrations. The
hierarchical structure viewed as an empirical
Bayesian model allows prediction of the
distribution of concentrations at different sites
as a function of the site, season, and water
matrix. This is what agencies need for risk
characterization. - To obtain these results has required extension of
available regression models for discrete data
with random effects and with hierarchical
structure. Generalized Linear Mixed Model (GLMM)
with a hierarchical structure that includes
sites, regions, hydrologic and watershed effects
has been developed. Markov Chain Monte Carlo
(MCMC) simulation is employed to compute the
posterior distributions of the parameters. A very
powerful and flexible statistical software
package called WinBugs is used for the Bayesian
computations. - These models were applied separately to stream
and reservoir source waters. Results indicate
that streams have a much higher average
concentration, turbidity is a significant
indicator of concentrations, and that seasonal
variations are relevant in some cases. Models of
variations in recover rates are also considered
and show a small dependence upon turbidity.
2Statistical Models of Cryptosporidium
Concentrations in Natural Waters
- Prof. Jery Stedinger (Civil Env. Engin.)
- Prof. David Ruppert (ORIE)
- Ciprian Craniceanu (Statistics)
- Christopher Behr (Civil Env. Engin.)
- R. J. MacKay (ORIE)
- Cornell University
3Outline
- Background Cryptosporidium parvum
- Cryptosporidium Data as Counts
- Model Formulation and Computation
- Analyses of Cryptosporidium concentrations
- Recovery Rate Analysis
- Conclusions
4Concern about Cryptosporidium
- C. parvum causes mild to serious infections
- Outbreaks largest in Milwaukee (1994)
- Potentially high endemic levels
- 2,000-3,000 reported cases/yr between 1995-97
- Unreported cases estimated between 0.2 and 2 of
population for industrialized countries - ? 0.5 to 5 million cases/year in U.S.
5ICR Datasets
6Oocyst Recovery IFA Method
Walker, 1995
Stomacher
Centrifuge
Fiber Filter
Raw Water (V)
Suspended Solution
Separated Liquid
Top Layer
Centrifuge
Fraction (F) of Pellet
Suspended Solution
Pellet of Solids
Oocysts counted DISCRETELY!
Acetate Membrane
Dye added
Slides
7Issues in Cryptosporidium Data
- Cryptosporidium testing methods
- Immunoflourescence assay method (ICR)
- Immunomagnetic separation (1623)
- Difficult to detect
- Low mean recovery rates
- ICR ? 10 1623 ? 40
- High variability (s/m)
- ICR CV ? 100 1623 CV ? 50
8Issues in Cryptosporidium Data
- Many zero counts (93 - ICR 85 - ICRSS)
- Many sites have only zero counts
9Problems posed by Zeros
- Data source NJ, Delaware River, 1997,
LeChevallier et al. - Volume analyzed was always 5 liters.
- Without adjustment for recovery
- Average non-zero conc. 0.48 oocysts/L
- Total_counts/total_volume 0.19 oocysts/L
10NJ Data as Censored LN
- Detection limit 1/(effective volume)
- Maximum likelihood estimation (MLE)
- Recovery rate R 10 (fixed)
- Lognormal Percentile (oocysts/L)
- 50th 90th 99th Mean
- 1.40 4.95 13.9 2.28
11Data as Poisson Counts
- E Number of Counts R V C
- R recovery rate C concentration
- V effective volume (known)
- Models
- Poisson counts R 10 C const
- Poisson counts R 10 C Gamma
- Poisson counts R Beta C Gamma
12Results of Model Fitting, NJ
- Percentile
- Distribution 50th 90th 99th Mean CV
- Lognormal 1.4 5.0 14 2.3 1.3
- Poisson 1.9 - - 1.9 -
- Gamma/P 0.7 5.4 15 1.9 1.6
- G/B(0.6)/P 1.0 5.0 12 1.9 1.3
- G/B(1.2)/P 1.4 3.5 6 1.8 0.8
13Model Fits West Virgina
- Percentile
- Distribution 50th 90th 99th Mean
CV -
- Lognormal 0.1 17 930 187
- Poisson 0.2 - - 0.2
0 - Gamma/P 0.05 15 85 5.6
3.1 - G/B(0.6)/P 1.0 4.0 8.8 1.6
1.1 - G/B(1.2)/P 1.4 2.7 4.1 1.6
0.5
14West Virginia Data
15Outline
- Background Cryptosporidium parvum
- Cryptosporidium Data as Counts
- Model Formulation and Computation
- Analyses of Cryptosporidium concentrations
- Recovery Rate Analysis
- Conclusions
16National ICR Crypto Data Set
17Modeling Implications
- Linear regression is inappropriate
- Limited information in zero counts
- Counts not normally distributed
-
- Information at most sites insufficient to
estimate mean concentration for site - Need to combine information from different sites
- Sites can be viewed as residing in regions
- Together Generalized Linear Mixed Model
Use a Poisson Count Model
Use Hierarchial model w/ regional site effects
18Bayesian Pathogen Concentration Model
Model Elements Yij pathogen counts Cij
pathogen conc. Vij volume of water Rij
recovery rate Xij predictor matrix random
effects tij time-site effects sj site
effects rk(j) regional effects k(j) region
for j
Hierarchical Model Yij ? PoissonVijCijRij l
og Cij XijT? tij Rij Beta (a, b) where
tij N sj , st2 sj N rk(j), ss2 rk(j) N
m, sr2
19Bayesian Statistical Approach
- Frequentist Approach
- Maximizing likelihood function f(y?) given data
y yields point estimates of ? (MLEs) - Bayesian Approach
- Provide prior distribution ?(?)
- Determine posterior distribution p(?y)
- where p(?y) ? f(y?) ?(?)
20Bayesian Computation
- Want posterior distribution of parameters ?, s, ?
- Basic model has a Poisson residual, recovery rate
and a random effect for each observation, plus
site means are linked as are ten regional
effects. - These must be integrated out to compute
likelihood. - With 100 sites and 18 observations per site,
- such integration is analytically intractable
- Use Markov Chain Monte Carlo simulation
21Outline
- Background Cryptosporidium parvum
- Cryptosporidium Data as Counts
- Model Formulation and Computation
- Analyses of Cryptosporidium concentrations
- WQ Prediction Health Risk Analysis
- Recovery Rate Analysis
- Conclusions
22Model Covariates
- Time-site Covariates
- Log-turbidity, Carbonate hardness,
- Total organic carbon,
- Hydrologic Variables (stream sites only)
- Seasonal Effects
- Spline Function
- Temperature Anomaly
- Site-Specific Covariates
- Urban land area, sediment export potential, pop.
- Log-Avg Residence Time (res./lake sites only)
23Modeling Objectives
- Water Quality Prediction (WQP)
- Focus covariates that vary over time place
- Includes all relevant covariates
- Model for Health Risk Analysis (HRA)
- Focus covariates known over time at given place
- Includes site characteristics and time function
24HRA for Reservoir/Lake Sites
- Not Significant Parameter in italics.
- Other parameters in Full model Temp. anomaly,
log-population, log-urban land area, soil
permeability sediment export, log residence time,
seasonal spline coefficients.
25WQP for Reservoir/Lake Sites
- Not Significant Parameter in italics.
- Other parameters in Full model Temp. anomaly,
log-population, log-urban land area, soil
permeability sediment export, log residence time,
seasonal spline coefficients.
26WQP for Stream Sites
- Not Significant Parameter italics.
- Other parameters Full model Total Organic
Carbon, Temp. anomaly, log-population, soil
permeability, sediment export, hydrologic
variables.
27Summary of Results
28Outline
- Background Cryptosporidium parvum
- Cryptosporidium Data as Counts
- Model Formulation
- Model Computation and Parameterization
- Analyses of Cryptosporidium concentrations
- Recovery Rate Analysis
- Conclusions
29Recovery rates
- For given laboratory, recovery rate R equals
probability that a lab technician observes
counts a pathogen originally in the sample. - Because for Crypto and Giardia recovery rates are
small and highly variable, ignoring recovery
rates would underestimate concentrations and
exaggerate variability
30EPA ICR-Spiking Study
31Recovery rate model
- Nij Gammaa, b
- Zij Poisson?
- ????Vij Rij Nij / VTij
- logRij/(1- Rij) XijT? tij
- tij N labj , st2
- labj Nm , s2lab
- Nij - number of pathogens spiked
- VTij total vol.
- Vij effective vol. analyzed
- Rij recovery rate
- Zij pathogens counted
- Xij covariates
32Posterior means for parameters of interest ICR
spiking data
33Recovery rates conclusions
- Recovery rates are small highly variable
- Turbidity is statistically significant for
Cryptosporidium but not for Giardia - Laboratory effects are appreciable for Giardia
but not for Cryptosporidium
34Conclusions
- Cryptosporidium and Giardia data are discrete
counts, with many zeros. - Recovery rates for Cryptosporidium and Giardia
are small highly variable. - Turbidity is statistically sig for
- Crypto lab effect sig. for Giardia.
- A Bayesian analysis of hierarchical
- Generalized Linear Mixed Model (GLMM) is able to
evaluate the 350-site ICR data
35 Date Total Giardia Total Crypto cysts/50L oocys
ts/50L 9-Mar-02 0 0 6-Mar-02 0 0 4-Mar-02 5 0 3-Ma
r-02 0 0 2-Mar-02 0 0 1-Mar-02 1 1 28-Feb-02 2 0 2
7-Feb-02 2 0 26-Feb-02 5 0 25-Feb-02 5 1 24-Feb-02
1 1 19-Feb-02 6 2 13-Feb-02 0 0 11-Feb-02 4 1 6-F
eb-02 4 1 4-Feb-02 2 2 28-Jan-02 1 0 22-Jan-02 1
0 14-Jan-02 1 0
Catskill Lower Effluent Chamber (CATLEFF)
36Thoughts about NYC
- Long records for 3 key sites
- Used 2 analysis methods with different recovery
rate characteristics (ICR, 1623) - Interested in
- quality prediction
- long-term health risk assessment
37Thoughts about NYC
- Bayesian analysis can integrate
- count data,
- observed WQ characteristics,
- recover rate distributions (ICR, 1623),
- and
- natural variability and
- persistence
- of environmental concentrations of pathogens
38(No Transcript)
39References
Crainiceanu, C. M., D. Ruppert, J.R. Stedinger,
and C.T. Behr, Improving MCMC Mixing for a GLMM
Describing Pathogen Concentrations in Water
Supplies, in Case Studies in Bayesian Analysis,
Springer-Verlag Series, New York,
2002. Crainiceanu, C. M., D. Ruppert, and J.R.
Stedinger, Bayesian recovery rates modelling for
waterborne pathogens, technical report, Cornell
University, 2002. Behr, C.T., Modeling
Cryptosporidum Concentrations A Bayesian GLMM
of Regional Discrete Count Data, MS Thesis,
Cornell Univ., August, 2001. Stedinger, J. R.,
and R. J. MacKay, Interpretation of
Cryptosporidium and Giardia Monitoring Data
Generated by ICR Program invited presentation,
EPA Workshop on Statistics, Washington DC, Nov.
19, 1998. We also published Walker, F.R., Jr.,
and Jery R. Stedinger, Fate and Transport model
of Cryptosporidium, J. of Environmental
Engineering, 125(4), 325-333, 1999.
40Generalized Linear Mixed Model
- Hierarchical Poisson-lognormal structure
- Log link function
- Random effects captured by t, s
41Gibbs Sampling Method
- Gibbs Sampling is one type of
- Markov Chain Monte Carlo method
- General Idea used to obtain p(?y)
- Start with initial values of ?
- Iteratively sample values from p(?y)
- Many iterations yields p(?y) empirically
42Gibbs Sampling Algorithm
- Each iteration i, Gibbs Sampler (GS) generates
- q1(i) p(q1 q2 (i-1) , ... , qd (i-1) , y)
-
- qd(i) p(qd q1 (i) , ... , qd-1 (i-1) , y)
- Since q1(i) , ... , qd(i) ? p (qy) as i ? ?,
- After T iterations, the posterior mean
- qj ? mj (?qj(i)) /T E(mj) qj
- Obtain p (qy) and marginal p (qiy, qj) for i?j