Loading...

PPT – Between and beyond: Irregular series, interpolation, variograms, and smoothing PowerPoint presentation | free to download - id: 810430-ZDk0M

The Adobe Flash plugin is needed to view this content

Between and beyond Irregular series,

interpolation, variograms, and smoothing

- Nicholas J. Cox

Mind the gap!

- Repeated reminder, London Underground.

Executive summary

- A new program mipolate for several kinds of

interpolation is now available. - It can be downloaded from SSC (3 September 2015).

- Variograms are useful for examining dependence

structure in time and spatial series. - Work is in progress on a new program vgram for

variograms.

Irregular series

- Irregular series are series in which non-missing

values are not all equally spaced. - Special case Values would be equally spaced

(every day, every year, ), but there are some

gaps with missing values, for human or inhuman

reasons. - General case Values are just at known times or

points with no necessary rules about spacing. - Irregular series often seem to invite

interpolation.

Luke Howard (1772 1864)

- Best remembered for his nomenclature for clouds
- (cumulus, stratus, cirrus and so forth).
- Here we use as sandbox some of his temperature

data from Plaistow, near London, in 1807.

- Howard, Luke. 1818.
- The Climate of London, Deduced from

Meteorological Observations, Made at Different

Places in the Neighbourhood of the Metropolis. - Volume I.
- London W. Phillips, etc.

(No Transcript)

Series of events

- N.B. We are not talking here about series of

events, - or realisations of point processes.
- In such series occurrences are typically

irregularly spaced, but the gaps are inherent in

the process, - not a failing of our data.
- Examples range from eruptions to elections.

(No Transcript)

Interpolation

- Interpolation is the art of reading between the

lines. - Historically, it is a deterministic process,

often a matter of going beyond printed tables of

functions - (logarithmic, trigonometric, and so forth).
- In principle, we should worry about the

statistical properties of interpolation. It is

local prediction. - In practice, imputation now appears better known

among statistical researchers.

Interpolation in (official) Stata

- The ipolate command for linear interpolation

(and extrapolation) was added in Stata 3.1

(1993). - The Mata functions spline3() and spline3eval()

were added in Stata 9.0 (2005).

User-written programs on SSC

- Programs (NJC) have been available from SSC for
- cubic interpolation cipolate (2002)
- cubic spline interpolation csipolate (2009)
- piecewise cubic Hermite interpolation
- pchipolate (2012)
- nearest neighbour interpolation
- nnipolate (2012)
- A combined and extended program mipolate is now

available too.

Two dimensions too

- Note also bipolate (Joseph Canner, SSC) (2014).
- By default it uses quintic polynomials.
- Other available methods include thin plate

splines - and Shepards method.
- Note also twoway contour.

mipolate generalises ipolate

- Interpolation is of yvar with respect to

specified xvar. - Prior tsset or xtset is not assumed.
- Regular spacing is not assumed.
- Multiple values of yvar at the same xvar are

averaged first. - Groupwise operations using by are supported.

Linear and cubic

- Linear interpolation just uses previous and

following known values (only). - This is done by ipolate, and also mipolate by

default. - Cubic interpolation is another classic method,

using two previous and two following known values

(only). - This is done by mipolate, cubic.
- The default of mipolate with either method
- (as with ipolate) is not to extrapolate.

Un peu dhistoire

- Cubic interpolation, as a particular kind of

polynomial interpolation, is often attributed to

Joseph-Louis Lagrange (17361813) but was

proposed earlier by Edward Waring (1735?1798). - In fact there is a long history of work with

contributions by many outstanding mathematicians,

not least Isaac Newton (16431727) and Leonhard

Euler (17071783).

Lagrange Waring

Cubic splines

- As before, we are using cubic polynomials

locally, but they are constrained to join

smoothly. - The syntax is mipolate, spline.
- This is merely a wrapper for the official Mata

functions. - As before, the default of mipolate with this

option is not to extrapolate.

Linear extrapolation

- As with ipolate linear extrapolation is available

as an option in mipolate to fill in missings at

the end of series. - What your teachers told you is true
- extrapolation is dangerous.
- Dont point that straight line It can go off

anywhere. - (Allude here to Mark Twain on the Mississippi.)

Piecewise cubic Hermite interpolation

- This method also uses piecewise cubics joining

smoothly. The syntax is mipolate, pchip. - The interpolant is shape-preserving and cannot
- overshoot locally.
- Sections in which yvar is increasing, decreasing

or constant with xvar remain so after

interpolation. - Hence local maxima and minima also remain so.
- This interpolation method also extrapolates.

Charles Hermite (18221901)

Inverse distance weighting

- Interpolation can use a weighted average of known

values, the weights being inverse powers of

distance d from unknown value. - If I dont know the value at 42, 41 and 43 are

distance 1 away, 40 and 44 distance 2, and so on.

- For weights d-p, limiting case p 0 makes all

weights equal, and so the interpolant is the

overall mean, while p very large means that only

the very nearest values have effect.

Other methods

- mipolate adds forward, backward, nearest

neighbour and groupwise interpolation - Use the previous, next or the nearest known

value. Or extend the single non-missing value in

a group to all others. - Using the last known value is often dubious

statistically, - but it is a very common request in data

management. - The other methods are provided partly for

completeness. - There is small print (option choices) about how

to break ties when two values are equally near.

mipolate summary

- Nine methods
- linear
- cubic
- (cubic) spline
- pchip
- idw
- forward
- backward
- nearest
- groupwise

- Linear extrapolation?
- yes
- yes
- yes
- no
- no
- no
- no
- no
- no

(No Transcript)

(No Transcript)

Simple messages

- There are many interpolation methods to choose

from. - They will often disagree, even for simple-looking

instances. - Disagreement gives a handle on uncertainty.
- In a real problem, simulate missings and test how

well known values are estimated. - What makes most sense in your problem will

reflect its dependence structure.

- We turn from a project that is done to one that

is very much in progress.

Variograms

- Variograms (more properly semivariograms) are

plots of - (mean) half difference between values squared
- versus
- separation, distance or lag.
- By a tempting abuse of terminology, we often use

the same name for the underlying relationship as

a function.

First known use of term variogram

- Geoffrey H. Jowett (1922 ) in 1955
- The comparison of means of sets of observations

from sections of independent stochastic series. - Journal of the Royal Statistical Society. Series

B (Methodological) 17 208227.

Spatial and time series

- Variograms are central to one approach to spatial

statistics, in this context often known as

geostatistics. - Georges Matheron (19302000) is most often

mentioned here. - But variograms can be very useful for time series

too.

Time series too

- Variograms are prominent in these texts on time

series and longitudinal data - Diggle, P.J. 1990. Time Series A Biostatistical

Introduction. Oxford Oxford University Press. - Diggle, P.J., Heagerty, P.J., Liang, K-Y. and

Zeger, S.L. 2002. Analysis of Longitudinal Data.

Oxford Oxford University Press.

User-written programs

- Programs (NJC) are available from SSC for
- variograms in one dimension variog (2005)
- variograms in two dimensions variog2 (2005)
- A combined and extended program vgram is under

development.

Generality of variograms

- So, variograms are without undue strain

defined - for time series and for spatial series,
- whether regular or irregular,
- as they just depend on separation being measured.

- Plotting the mean for each distinct separation is

- a common, but not compulsory, convention.

A simple example webuse air2

Variograms

- vgram air, recast(connected) xla(0(12)72)

- vgram air

Comparison at different lags

- We are plotting mean squared differences between

values compared at lags 1, 2, 3, - In this example, we have monthly data, so are

comparing values 1, 2, 3, months apart. - Many readers may be familiar with the same idea

for calculating autocorrelation and

cross-correlation. - The variogram like the raw data plot hints at

a structure of trend plus seasonality.

Variograms of residuals, not data

- Here, as elsewhere, it is a good idea to work

with residuals, rather than the original data. - Time series modellers could have a happy time

arguing which model was best for the airline

data, but we just use a Poisson regression on

time and look at its residuals. - On the versatility and virtuosity of Poisson

regression, check out Gould, William. - http//blog.stata.com/2011/08/22/use-poisson-rathe

r-than-regress-tell-a-friend/

Sometimes, structure is this simple

- Poisson regression

- Residuals from Poisson

A little more formally

- The semivariogram ?(h) for response z is

given by - 2 ?(h) A z(i) - z(i h)2
- where A denotes averaging over pairs of values

at lag h. - As emphasised, using a mean is a convention. The

fuller picture (literally!) is a plot of z(i) -

z(i h)2 versus h. - This is often known as a variogram cloud.
- I borrow the notation A() from Whittle, P. 1970.

Probability. Harmondsworth Penguin.

Where does the 2 come from?

- The units of the semivariogram are those of the

response squared. - Adding the variance to the graph as a reference

line underlines the connection. - A non-standard formula for the variance is, for

any i, j, - (1/2) E (zi - zj)2 .

Back to vgram

- vgram (not yet public) is already quite general.
- We take possibilities one by one.
- With just one argument, the response, it checks

for a tsset or xtset time variable and uses it to

define separations if found. Note that panel

data are supported for free. - With just one argument otherwise, the order of

the observations is taken to define position in

time or space.

- With two arguments, the second variable is taken

to define position. A width() option is required

to specify the width of bins within which

differences squared are averaged. Equal and

unequal spacing can thus both be accommodated. - With three arguments, the second and third

variables are taken to define position. A width()

option is required to specify the width of bins

within which differences squared are averaged.

Distance is calculated from coordinates using

Pythagoras theorem.

Why not just use autocorrelation?

- Variograms are defined for a wider class of

processes. Autocorrelation functions require weak

stationarity variograms are defined for

processes with stationary increments. - Variograms are more flexible in the face of

irregular spacing. - The very wide use of autocorrelation reflects

custom and familiarity as well as intrinsic

merit.

A further example

- We look at rainfalls for 8 May 1986 (a single

day) for 467 stations in Switzerland.

(No Transcript)

(No Transcript)

How much information ?

- Optionally the semivariogram results can be saved

in vgram to new variables. - Keeping track of the number of pairs used at each

lag is important. - Here we exploit the feature that spikeplot can

show frequencies on a square root scale.

(No Transcript)

To do list

- variogram clouds
- robust estimators
- more flexible binning
- spherical distances too
- direction as well as lag

- model fitting
- (valid functional forms)
- use for interpolation
- (and smoothing)
- (kriging, Gaussian process regression)

Variogram virtues

- Defined for time and spatial series.
- Defined for regular and irregular series.
- Can help identify and check for structure.
- even if you have no interest in their most

mentioned use, as a means towards the end of

spatial interpolation.

This paper

- This paper fills a much needed gap in the

literature. - See Jackson, A. 1997.
- Chinese acrobatics, an old-time brewery,
- and the much needed gap
- The life of Mathematical Reviews.
- Notices of the American Mathematical Society
- 44 330337.

Acknowledgments

- Historical portraits Wikipedia.
- MATLAB code for pchip
- Moler, C. 2004. Numerical Computing with MATLAB.

Philadelphia SIAM. Chapter 3. http//www.mathwork

s.com/moler/interp.pdf) - The Swiss rainfall data can be found here
- http//www.ai-geostats.org/pub/AI_GEOSTATS/AI_GEOS

TATSData/sic97data_01.zip

Leo Breiman (19282005)

- The main thing to learn about statistics is what

is sensible and honest and possible. - Doubt and suspicion, as well as technical

knowledge, are indispensable tools in statistics. - 1973.
- Statistics With a view towards applications.
- Boston Houghton Mifflin, pp.1, 18.