Title: Spatial Modelling 1 : Incorporating spatial modelling in a random effects structure
1Lecture 22
- Spatial Modelling 1 Incorporating spatial
modelling in a random effects structure
2Lecture Contents
- Introduction to spatial modelling
- Nested random effect levels
- House price dataset
- Including distance as a fixed effect
- Direction effects
- Focused clustering (Falkirk dataset)
3Spatial statistical modelling
- Here we require a statistical approach that
accounts for the spatial location at which a
response is collected. This means that the model
that is fitted to the data needs to account for
the spatial effects. - This may be to account for any effects due to
location in the model or to predict values of the
response at other locations via some form of
interpolation that accounts for both other
predictor variables and/or the spatial location.
4Types of spatial data
- There are many forms of spatial data but we can
broadly divide these into three types (Cressie
1993) - Geostatistical data here measurements are taken
at a fixed number of chosen locations in a
geographical area. - Lattice data here measurement are taken at on a
regular lattice and at each point on this lattice
a measurement is collected. - Point process data here each observation is the
location of a response and its co-ordinates are
also recorded.
5Geostatistical data
- Such data are collected in various fields,
particularly mining and earth sciences. - A measurement e.g. age coal ash is taken at each
of a number of locations. - Methods such as variograms and spatial Kriging
are used to analyse such data. - Other application areas include weather maps and
agricultural field trials. - Note such data is not ideally suited to standard
random effect modelling.
6Disease mapping
- One particular type of spatial modelling that is
often linked with random effect modelling is
disease mapping. - Here cases of a disease (either human or animal)
are observed over a chosen region e.g. a country.
We then wish to infer the relative risk of the
disease for a particular individual at a
particular location based on the data collected. - Both our practicals this afternoon will consider
disease mapping datasets. The other two types of
spatial data relate to disease data.
7Lattice Data
- Such data is common in many fields, for example
image analysis where the pixels in an image are
found on a regular rectangular lattice. - More importantly we will consider disease count
data where counts of a disease are recorded for
contiguous regions on a map. - Although a map is not regular we can construct a
lattice from a map by identifying neighbouring
regions and linking neighbouring regions to form
a lattice.
8Example
Here we see a map of 5 regions in the left hand
picture, and on the right it has been converted
to a lattice with connections between regions
that share boundaries.
?
9Point process data
- This data is also commonly found in disease
mapping although may be used in many applications
where cases of an event are seen at particular
locations. - Each item of data consists of the location of an
event, the response (type of event) and
potentially predictor variables for the event. - Note Rasmus has worked more extensively in this
area and will be happy to answer questions here. ?
10Disease point process modelling
- In disease mapping our data is typically binary
i.e. people are infected (or die from) a disease
or are not. - The data occur in point process form but there
are 2 problems with analysing them as a point
process - All our responses are 1 as we only observe the
infected/dead people! - Due to confidentiality and the sensitive nature
of medical data the data cannot often be released
as individual records. - To counter point 1 we could sample control cases
at random from the population however point 2
means that we typically total up cases for fixed
areas and use a Poisson model on the lattice data
that this creates.
11Why might there be spatial effects?
- This depends on the response variable and
application area. - It is possible that geography is itself a
predictor for our response or is a surrogate for
other factors. - Many factors can be linked to location e.g.
weather, deprivation, altitude, pollution, wealth
which might influence the response. - So if our response is influenced by any of these
factors then accounting for spatial effects many
improve our model.
12Nested random effects/ levels of geography
- The simplest link to random effect models is to
consider nested random effects. - We have considered pupils nested in schools and
cows nested in herds. - In some sense the schools and herds are spatial
units in that schools generally take children
from their locality and a herd is based on a
particular farm. However we could also fit where
the pupils live as another classification of the
data which is more spatial. - On the next slide we consider a dataset with more
levels of geography.
13UK house prices dataset
- An MMath student of mine (David Goodacre) studied
a dataset of house prices in the UK. The data
supplied by the Nationwide building society
consists of average house prices in areas of the
UK over a 12 year period (1992-2003). The data is
for 753 towns in the UK and there are 3 levels of
geography (towns nested in counties nested in
regions.) - Note that if we had individual house sale
information then we could have considered point
process approaches but here we consider random
effect modelling.
14A 4-level VC model for the house price dataset
- The following model was fitted to the data
- where i indexes year, j indexes town, k indexes
county and l indexes region. The response, y is
the log of the average price. - This model can be fitted using both frequentist
and likelihood methods in packages that allow
four levels in the model.
15Links with other topics
- It is worth noting that this house price dataset
is a repeated measures dataset as you considered
yesterday. - It also contains missing data as in any year in
which there were less than 50 sales in a postal
town will lead to a missing observation. - However we here assume MAR conditional on the
model we are fitting.
16Estimates for house price dataset
- Below are given IGLS estimates for the model
Parameter Estimate (SE)
ß0 4.036 (0.067)
ß1 -0.020 (0.002)
ß2 0.009 (0.0001)
s2f 0.045 (0.021)
s2v 0.016 (0.004)
s2u 0.045 (0.003)
s2e 0.013 (0.0002)
Here we see that the model consists of parallel
curves with both year and year2 very significant.
The variance is greatest between regions and
between postal towns
17Region Level Effects
Here we see that the south east of the UK and
London are the most expensive whilst Scotland the
North and Wales are the cheapest.
18County level effects
After accounting for regions the pattern of
county effects is more sporadic. We can however
pick up 2 regions, Cheshire in the North West and
Surrey in the South East that are more expensive
than their neighbours.
19Region level predictions
- Here we see a graph of region level predictions
20Further Modelling
- In his project Dave looked at random slopes
models at the various levels of the model, so
that we could pick out whether the increase in
prices was different in different regions. - He also looked at fitting models of a more
spatial nature! See next lecture.
21Why are spatial effects different?
- The main difference with spatial effects is that
we have additional information about each
(spatial) unit. - For example if we observe the average house price
of a town in Grampian, a town in Surrey and 2
towns in Berkshire then we know something of the
spatial relation of these towns. - We might expect the prices in the 2 towns in
Berkshire to be similar and to be more similar to
Surrey which is also in the South East than
Grampian that is in Scotland. - In our current models we will fit an effect for
Berkshire which will capture some of the
relationship between its 2 towns and a South East
effect that will capture the link with the Surrey
town.
22Problems with the nested classification approach
- As we have seen the nested classification
approach can capture much of the spatial
variability however we have to decide on the
geographic definitions of areas. - We generally use easily available definitions
e.g. county and region but there is no guarantee
that these are the best classifications. - We also have the problem of border effects, for
example two towns on either side of a region
border will not share either region or county
effects but may have very similar prices. - We will look at another approach here before
studying more complex spatial approaches in the
next lecture.
23Including location in fixed effects
- It may be the case that there is a trend e.g. in
house prices in the UK they generally fall as we
move North and West. We could therefore add in
two (fixed effect) predictors giving the N/S and
E/W co-ordinates of each point. - If the unit of observation is an area e.g. postal
town we would generally use the co-ordinates of
the centroid of the unit. - If a linear relationship is not sensible then we
could consider polynomial terms in each
direction. For example (excluding random effects)
24Distance effects
- Another possibility in terms of UK house prices
is to consider the distance from London. This
distance can be constructed from the co-ordinates
of each point. The graph to the left gives the
combined region and county effects and suggests a
distance from London effect might be appropriate.
25Distance and direction effects
- In some scenarios the direction as well as the
distance from a particular point is important. - This is not the case with house prices however in
pollution data then direction can be very
important where a dominant wind direction will
suggest that particular directions away from the
source will experience more pollution than
others. - We will next look at a dataset from Falkirk in
Scotland that is analysed in Lawson, Browne
Vidal Rodeiro (2003)
26Focused Clustering
- One research area in public health looks at the
impact of sources of pollution on the health
status of communities. The detection of patterns
of health events associated with pollution
sources is known as focused clustering. The
statistical modelling involved usually relates to
the point process nature of such data. - Lawson, Browne Vidal Rodeiro (2003) devote a
whole chapter to Focused clustering and include
some fairly complex models that can be considered
in WinBUGS. Here we will look at some simpler
models that can be fitted in MLwiN to a dataset
from Falkirk in Scotland.
27Respiratory cancer in Falkirk
- The figure to the right shows the census
geographies of 26 regions found around a foundry
(marked by ) in Falkirk, Scotland. It is thought
conceivable that the foundry was an air pollution
hazard in the early 1970s prior to the study.
This could have an impact on the respiratory
cancer experience of those living in the areas
close to the foundry
28Falkirk dataset
- The data consists of observed and expected counts
of respiratory cancer cases in the time period
1978-1983. - We first compare the standardized mortality rates
(SMRs) observed/expected against the locations
of the centroids of the 26 areas in Falkirk
(relative to the foundry) to look for patterns.
29Position of the sites
- Note in the graphs to the right that the 3
highest SMRs are close to the source both in the
N/S and E/W directions. - We can convert these locations to distance and
direction measures.
30Distance and direction
- Here we see that there appears to be a negative
relationship between distance and SMR but no
obvious pattern with regard to the direction
relationship.
31(Extra) Poisson modelling
- We have modelled the effects of deprivation,
distance and direction in the following Poisson
model - Note that we have used 1st order MQL in MLwiN and
allowed extra-Poisson variation. This shows there
is less variation than a Poisson distribution so
we will also try fitting SMR as a Normal
distributed response.
32Normal response model for SMR
- Here we see that none of the predictors has a
significant effect which is probably because the
dataset is so small. - We do see however that the risk reduces as
distance from the foundry increases and for areas
with larger deprivation scores. (suggesting
higher rates in less deprived areas but not
significantly.)
33Information for the practical
- In the practical we will return to using nested
random effects to account for spatial effects. - Our data is from the European community and
consists of male deaths from malignant melanoma
in 9 countries in the EU. - The practical is a (modified) chapter from Browne
(2003) and looks at MCMC methods for this
dataset. It is also analysed using
quasilikelihood methods in the MLwiN users guide
and you are welcome to also try these methods.