Title: Let It Rain: Modeling Multivariate Rain Time Series Using Hidden Markov Models
1Let It Rain Modeling Multivariate Rain Time
Series Using Hidden Markov Models
- Sergey Kirshner
- Donald Bren School of
- Information and Computer Sciences
- UC Irvine
March 2, 2006
2Acknowledgements
Padhraic Smyth UCI
Andy Robertson IRI
DOE (DE-FG02-02ER63413)
3http//iri.columbia.edu/climate/forecast/net_asmt/
2006/feb2006/MAM06_World_pcp.html
4What to Do with Rainfall Data?
Description
historical rainfall data
model
general circulation model (GCM) outputs
5What to Do with Rainfall Data?
Downscaling
historical rainfall data
predicted data
model
general circulation model (GCM) outputs
6What to Do with Rainfall Data?
Simulation
crop modeling
historical rainfall data
predicted data
model
general circulation model (GCM) outputs
water management
7Snapshot of the Data
8Modeling Precipitation Occurrence
Northeast Brazil 1975-2002 (except 1976, 78, 84,
and 86) 24 seasons (N) 90 days (T) 10 stations
(M)
9 and Amounts
10Annual Precipitation Probability
11Spatial Correlation
12Spell Run Length Distributions
Dry spells are in blue wet spells are in red.
13Important Data Characteristics
- Correlation
- Spatial dependence
- Temporal structure
- Run-length distributions
- Persistence
- First order dependence
- Variability of individual series
- Interannual variability important for climate
studies
14Missing Data
Missing data mask (black) for 41 stations
(y-axis) in India for May 1 - Oct 31, 1973. 29
of the data is missing, with stations 13 14, 16,
24, 26, 30, 36, 38, and 40 missing more than 45
of the data for that station.
15A Bit of Notation
- Vector time series R
- Vector observation of R at time t
R11
R21
RT1
R12
R22
RT2
R13
R23
RT3
R1M
R2M
RTM
R1
R2
RT
16Weather Generator
- Does not take spatial correlation into account
17Rain Generating Process
18Hidden Markov Model (HMM)
- Discrete weather states S (K states)
- Evolution of the weather state transition
probability P(StSt-1) - Rainfall generation in weather state i emission
probability P(RtSti)
19Hidden Markov Model (HMM)
R1
R2
Rt
RT-1
RT
S1
S2
St
ST-1
ST
20Basic Operations with HMMs
- Probability of weather states given observed data
(inference) - Forward-Backward
- Model parameter estimation given the data
- Baum-Welch (EM)
- Most likely sequence of weather states given the
data - Viterbi
Rabiner 89
21States for 4-state HMM
Robertson, Kirshner, Smyth 04
22Weather State Evolution
Robertson, Kirshner, and Smyth 04
23Generalizations to HMMs Auto-regressive HMM
(AR-HMM)
- Explicitly models temporal first-order dependence
of rainfall
24Generalizations to HMMs Non-homogeneous HMM
(NHMM)
- Incorporates atmospheric variables
- Allows non-stationary and oscillatory behavior
Hughes and Guttorp 94 Bengio and Frasconi 95
25Parameter Estimation
- Find Q maximizing P(rQ) (ML) or P(Qr) (MAP)
- Cannot be done in closed form
- EM (Baum-Welch for HMMs)
- E-step compute
- Forward-Backward
- Calculate
- M-step
- Maximize
- Can be split into maximization of emission and
transition parameters
26Modeling Approaches
- Use HMMs
- Transition probabilities for temporal dependence
- Emissions (hidden state distributions) for
spatial or multivariate dependence (and
additional temporal dependence) - Emphasis on categorical valued data
- Transitions and emissions can be specified
separately - Covers cross-product of models
27Modeling Approaches (contd)
- Use HMMs
- Possible emission distributions
- Conditional independence
- Chow-Liu trees Chow and Liu 68, conditional
Chow-Liu forests Kirshner et al 04 - Markov Random Fields
- Maximum entropy models e.g., Jelinek 98,
Boltzmann machines e.g., Hinton and Sejnowski
86, thin junction trees Bach and Jordan 02 - Belief Networks
- Sigmoidal belief networks Neal 92
- Possible transition distributions
- Non-homogeneous mixture (mixture of experts
Jordan and Jacobs 94) - Stationary transition matrix
- Non-homogeneous transition matrix (Hughes and
Guttorp 94, Meila and Jordan 96, Bengio and
Fasconi 95)
28HMM-CI
e.g., Zucchini and Guttorp 91 Hughes and
Guttorp 94
29Why Use HMM-CI?
- Simple and efficient
- O(TKM) for inference and for parameter estimation
- Small number of free parameters
- Can handle missing data
- Can be used to model amounts
30HMM-CI for Amounts
- Types of mixture components
- Gamma Bellone 01
- Exponentials Robertson et al 06
31Why not HMM-CI
- Not matching spatial correlations or persistence
well - Models spatial correlation implicitly through
hidden states - May require large K to model regions with
moderate number of stations
32HMM-Autologistic
Hughes, Guttorp, and Charles 99
33What about HMM-Autologistic?
- Sure!
- Models spatial correlations very well
- Can use sampling or approximate schemes to
compute normalization constant and to update
parameters
- Not so sure
- Complexity of exact computation is exponential in
M - What about temporal dependence?
- May have too many free parameters if not
constrained - Does not handle missing values (or very slow)
34Neither Here nor There
- HMM-CI efficient but too simplistic
- HMM-Autologistic more capable but computationally
more cumbersome - Want something in between
- Computationally tractable
- Emission spatial dependence
- Additional temporal dependence
- Missing values
35Bayesian Networks and Trees
- Tree-structured distributions
- Chow-Liu trees (spatial dependence) Chow and Liu
68 - With HMMs Kirshner et al 04
- Conditional Chow-Liu forests (spatial and
temporal dependence) Kirshner et al 04 - Markov (undirected) and Bayesian (directed)
networks - MaxEnt (logistic)
- Conditional MaxEnt
- Sigmoidal belief networks Neal 92
- Would need to estimate both the parameters and
the structure
36Chow-Liu Trees
- Approximation of a joint distribution with a
tree-structured distribution Chow and Liu 68 - Maximizing log-likelihood ? solving maximum
spanning tree (MST) problem - Can find both the tree structure and the
parameters in one swoop! - Finding MST is quadratic in the number of nodes
Kruskal 59 - Edge weights are pairwise mutual information
values measure of conditional independence
37Learning Chow-Liu Trees
0.3126 0.0229 0.0172 0.0230 0.0183 0.2603
38Chow-Liu Trees
- Approximation of a joint distribution with a
tree-structured distribution Chow and Liu 68 - Properties
- Efficient
- O(TM2B2)
- Optimal
- Can handle missing data
- Mixture of trees Meila and Jordan 00
- More expressive than trees yet with simple
estimation procedure - HMMs with trees Kirshner et al 04
39HMM-Chow-Liu
Kirshner et al 04
40Tree-structured Emissions for Amounts
Ot2
Ot1
Rt2
Rt1
Ot4
Ot3
Rt4
Rt3
St1
41Improving on Chow-Liu Trees
- Tree edges with low MI add little to the
approximation. - Observations from the previous time point can be
more relevant than from the current one. - Idea Build Chow-Liu tree allowing it to include
variables from the current and the previous time
point.
42Conditional Chow-Liu Forests
- Extension of Chow-Liu trees to conditional
distributions - Approximation of conditional multivariate
distribution with a tree-structured distribution - Uses MI to build maximum spanning (directed)
trees (forest) - Variables of two consecutive time points as nodes
- All nodes corresponding to the earlier time point
considered connected before the tree construction
- Same asymptotic complexity as Chow-Liu trees
- Optimal (within the class of structures)
Kirshner et al 04
43Example of CCL-Forest Learning
0.3126 0.0229 0.0230 0.1207 0.1253 0.0623 0.1392 0
.1700 0.0559 0.0033 0.0030 0.0625
44HMM-Conditional-Chow-Liu
St1
St2
St3
Kirshner et al 04
45Beyond Trees
- Can learn more complex structure
- Optimality not guaranteed Chickering 96 Srebro
03 - Structure and parameters may have to be learned
in separate computations - Computationally expensive
- Independence model matches all univariate
marginals - Chow-Liu trees match all univariate and some
bivariate marginals - Unconstrained Bayesian or Markov Networks
- May have too few data points for the number of
parameters - Even 3rd order cliques may have zero probability
mass
46Log-linear or Logistic
a
b
c
d
47Maximum Entropy Method
- Given
- Target distribution (empirical)
- Set of features and corresponding constraints
- Example feature is 1 when it rains both at
station 1 and 2 - Corresponding constraint
- Interpretation
- Proportion of time it rains simultaneously at
stations 1 and 2 is the same for both the
historical data and according to the
learned distribution - Want to satisfy all of the constraints
e.g., Jelinek 98
48MaxEnt Method (contd)
- Maximize entropy of subject to
constraints corresponding to features - Exponential form
- satisfying all of the
constraints for features in maximizes the
log-likelihood of the data!!! e.g., Della Pietra
et al 97 - Such solution is unique (likelihood is concave)
49HMM-Autologistic
Hughes, Guttorp, and Charles 99
50Conditional Log-linear Distribution
a
d
b
c
e
51Conditional MaxEnt Method
- Extension of MaxEnt distribution to conditional
distributions - Target distribution
- Set of features and corresponding constraints
- Maximize conditional entropy subject to
constraints
e.g., Lafferty et al 01
52Learning parameters of MaxEnt models
- Assume set of features given
- Need only free parameters to learn
- Cannot be done in closed form
- Iterative algorithms IS, GIS, IIS, conjugate
gradients Brown 59, Darroch and Ratciff 72,
Berger et al 96, Della Pietra et al 97, Goodman
02 - Require computation of (or similar) per
iteration - Exact computation exponential in the size of the
largest clique in the Markov network and
proportional to the size of the data - Needs computation of the junction tree and
requires message passing e.g., Bach and Jordan
02 - Needs potentially large number of iterations
- Want to reduce computation
53Sigmoidal Belief Network
a
b
c
d
Neal 92
54Product of Univariate Conditional Maximum Entropy
Models
- Approximate target distribution as a product of
univariate conditional MaxEnt distributions
(PUC-MaxEnt) - Parameters for each factor can be learned
separately - Requires summation over only a single modeled
variable at a time, not the largest clique - No message passing required
- Intuition Bayesian network with factors modeled
as conditional univariate MaxEnt distributions - Sigmoidal belief networks Neal 92
55Structure Learning
- Number of possible structure super-exponential in
the number of variables - Finding optimal solution NP-hard Chickering 96
- Need to search over possible structures
- Search
- Structure modification in the outer loop
- Parameter estimation in the inner loop
- Restricting to bivariate interactions
- Edge induction
56HMM-PUC-MaxEnt
Rt2
Rt2
Rt2
Rt1
Rt1
Rt1
Rt
Rt4
Rt3
Rt4
Rt3
Rt4
Rt3
St1
St2
St3
St
St
57AR-HMM-PUC-MaxEnt
Rt-11
Rt-11
Rt-11
Rt1
Rt2
Rt1
Rt2
Rt1
Rt2
Rt-12
Rt-12
Rt-12
Rt-13
Rt-13
Rt-13
Rt3
Rt4
Rt3
Rt4
Rt4
Rt3
Rt-14
Rt-14
Rt-14
St1
St2
St3
58Experimental Setup
- Data
- Australia
- 15 seasons, 184 days each, 30 stations
- Queensland
- 40 seasons, 197 days each, 11 stations
- Measuring predictive performance
- Choose K (number of states)
- Leave-n-out cross-validation
- Evaluation metrics
- Log-likelihood
- Error for prediction of a single entry given the
rest - Difference in spatial correlation
- Difference in persistence
59Southwestern Australia
1978-1992 May-October 15 seasons 184 days 30
stations
60Scaled out-of-sample log-likelihood (SW Australia)
61Out-of-sample predictive error (SW Australia)
62Examples of Weather States (HMM-CI)
63Examples of Weather States (HMM-CL)
64Examples of Weather States (HMM-PUC-MaxEnt)
65Queensland (Northeastern Australia)
1958-1998 October-April 40 seasons 197 days 11
stations
66Correlation and Persistence of Queensland Data
67Scaled out-of-sample log-likelihood (Queensland)
68Out-of-sample correlation difference (Queensland)
69Out-of-sample persistence difference (Queensland)
70Summary
- Important and interesting application
- Lots of data available, lots of problems to be
solved - Use tree-structured distributions
- Can find parameters and structure at the same
time - If trees are not sufficient, prepare cycle
servers - Learning complexity jumps once loops are
introduced
71Contributions
- New models for multi-site rainfall occurrence and
amounts - Conditional Chow-Liu forest model for
multivariate data Kirshner, Smyth and Robertson,
UAI-2004 - HMM with Chow-Liu and conditional Chow-Liu trees
for modeling multivariate time series Kirshner,
Smyth and Robertson, UAI-2004 - HMM with Product-of-Univariate-Conditional MaxEnt
distributions (PUC-MaxEnt) Kirshner 2005 - HMM with mixtures of exponentials Robertson et
al, in press - HMM with tree-structured mixtures
72Software
- (M)ulti(V)ariate (N)onhomogeneous (H)idden
(M)arkov (M)odels Toolbox - Free software for multivariate time series
modeling with HMM as a backbone - Large selection of implemented emission
distributions
http//www.datalab.uci.edu/software/mvnhmm
73Future Work
- Rainfall
- Filling in missing data
- Modeling large regions
- Factorized state space
- Using satellite data
- OLR fields
- Subseasonal predictions
- Selecting good input variables
- Other models for amounts
- Machine Learning
- Learning structure of the distribution from data
- Modeling in the presence of missing data
- Loops in HMM-Conditional-Chow-Liu and log-linear
models - Factorized state-space models
- Continuous hidden-state models
- Modeling of multivariate real-valued non-Gaussian
distributions
74Correlation for 4-state HMM-CI
Robertson et al 04
75Persistence for 4-state HMM-CI
Robertson et al 04
76Inference for NHMMs
- Inference (calculating )
- Forward-Backward recursively compute
77Forecasting Precipitation
- Can we use this model for forecasting?
- Same predicted expected values, no variability
- Need additional information about the seasons to
be forecasted
78HMM-CI Is It Sufficient?
- Simple yet effective
- Few parameters
- Implicit marginal spatial dependency through the
hidden states - Requires large number of hidden states
- Points to exploration of dependency models
79Limitations of Chow-Liu Structures