Between and beyond: Irregular series, interpolation, variograms, and smoothing - PowerPoint PPT Presentation


PPT – Between and beyond: Irregular series, interpolation, variograms, and smoothing PowerPoint presentation | free to download - id: 810430-ZDk0M


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Between and beyond: Irregular series, interpolation, variograms, and smoothing


Between and beyond: Irregular series, interpolation, variograms, and smoothing Nicholas J. Cox – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 55
Provided by: DavidC393
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Between and beyond: Irregular series, interpolation, variograms, and smoothing

Between and beyond Irregular series,
interpolation, variograms, and smoothing
  • Nicholas J. Cox

Mind the gap!
  • Repeated reminder, London Underground.

Executive summary
  • A new program mipolate for several kinds of
    interpolation is now available.
  • It can be downloaded from SSC (3 September 2015).
  • Variograms are useful for examining dependence
    structure in time and spatial series.
  • Work is in progress on a new program vgram for

Irregular series
  • Irregular series are series in which non-missing
    values are not all equally spaced.
  • Special case Values would be equally spaced
    (every day, every year, ), but there are some
    gaps with missing values, for human or inhuman
  • General case Values are just at known times or
    points with no necessary rules about spacing.
  • Irregular series often seem to invite

Luke Howard (1772 1864) 
  • Best remembered for his nomenclature for clouds
  • (cumulus, stratus, cirrus and so forth).
  • Here we use as sandbox some of his temperature
    data from Plaistow, near London, in 1807.

  • Howard, Luke. 1818.
  • The Climate of London, Deduced from
    Meteorological Observations, Made at Different
    Places in the Neighbourhood of the Metropolis.
  • Volume I.
  • London W. Phillips, etc.

(No Transcript)
Series of events
  • N.B. We are not talking here about series of
  • or realisations of point processes.
  • In such series occurrences are typically
    irregularly spaced, but the gaps are inherent in
    the process,
  • not a failing of our data.
  • Examples range from eruptions to elections.

(No Transcript)
  • Interpolation is the art of reading between the
  • Historically, it is a deterministic process,
    often a matter of going beyond printed tables of
  • (logarithmic, trigonometric, and so forth).
  • In principle, we should worry about the
    statistical properties of interpolation. It is
    local prediction.
  • In practice, imputation now appears better known
    among statistical researchers.

Interpolation in (official) Stata
  • The ipolate command for linear interpolation
    (and extrapolation) was added in Stata 3.1
  • The Mata functions spline3() and spline3eval()
    were added in Stata 9.0 (2005).

User-written programs on SSC
  • Programs (NJC) have been available from SSC for
  • cubic interpolation cipolate (2002)
  • cubic spline interpolation csipolate (2009)
  • piecewise cubic Hermite interpolation
  • pchipolate (2012)
  • nearest neighbour interpolation
  • nnipolate (2012)
  • A combined and extended program mipolate is now
    available too.

Two dimensions too
  • Note also bipolate (Joseph Canner, SSC) (2014).
  • By default it uses quintic polynomials.
  • Other available methods include thin plate
  • and Shepards method.
  • Note also twoway contour.

mipolate generalises ipolate
  • Interpolation is of yvar with respect to
    specified xvar.
  • Prior tsset or xtset is not assumed.
  • Regular spacing is not assumed.
  • Multiple values of yvar at the same xvar are
    averaged first.
  • Groupwise operations using by are supported.

Linear and cubic
  • Linear interpolation just uses previous and
    following known values (only).
  • This is done by ipolate, and also mipolate by
  • Cubic interpolation is another classic method,
    using two previous and two following known values
  • This is done by mipolate, cubic.
  • The default of mipolate with either method
  • (as with ipolate) is not to extrapolate.

Un peu dhistoire
  • Cubic interpolation, as a particular kind of
    polynomial interpolation, is often attributed to
    Joseph-Louis Lagrange (17361813) but was
    proposed earlier by Edward Waring (1735?1798).
  • In fact there is a long history of work with
    contributions by many outstanding mathematicians,
    not least Isaac Newton (16431727) and Leonhard
    Euler (17071783).

Lagrange Waring
Cubic splines
  • As before, we are using cubic polynomials
    locally, but they are constrained to join
  • The syntax is mipolate, spline.
  • This is merely a wrapper for the official Mata
  • As before, the default of mipolate with this
    option is not to extrapolate.

Linear extrapolation
  • As with ipolate linear extrapolation is available
    as an option in mipolate to fill in missings at
    the end of series.
  • What your teachers told you is true
  • extrapolation is dangerous.
  • Dont point that straight line It can go off
  • (Allude here to Mark Twain on the Mississippi.)

Piecewise cubic Hermite interpolation
  • This method also uses piecewise cubics joining
    smoothly. The syntax is mipolate, pchip.
  • The interpolant is shape-preserving and cannot
  • overshoot locally.
  • Sections in which yvar is increasing, decreasing
    or constant with xvar remain so after
  • Hence local maxima and minima also remain so.
  • This interpolation method also extrapolates.

Charles Hermite (18221901)
Inverse distance weighting
  • Interpolation can use a weighted average of known
    values, the weights being inverse powers of
    distance d from unknown value.
  • If I dont know the value at 42, 41 and 43 are
    distance 1 away, 40 and 44 distance 2, and so on.
  • For weights d-p, limiting case p 0 makes all
    weights equal, and so the interpolant is the
    overall mean, while p very large means that only
    the very nearest values have effect.

Other methods
  • mipolate adds forward, backward, nearest
    neighbour and groupwise interpolation
  • Use the previous, next or the nearest known
    value. Or extend the single non-missing value in
    a group to all others.
  • Using the last known value is often dubious
  • but it is a very common request in data
  • The other methods are provided partly for
  • There is small print (option choices) about how
    to break ties when two values are equally near.

mipolate summary
  • Nine methods
  • linear
  • cubic
  • (cubic) spline
  • pchip
  • idw
  • forward
  • backward
  • nearest
  • groupwise
  • Linear extrapolation?
  • yes
  • yes
  • yes
  • no
  • no
  • no
  • no
  • no
  • no

(No Transcript)
(No Transcript)
Simple messages
  • There are many interpolation methods to choose
  • They will often disagree, even for simple-looking
  • Disagreement gives a handle on uncertainty.
  • In a real problem, simulate missings and test how
    well known values are estimated.
  • What makes most sense in your problem will
    reflect its dependence structure.

  • We turn from a project that is done to one that
    is very much in progress.

  • Variograms (more properly semivariograms) are
    plots of
  • (mean) half difference between values squared
  • versus
  • separation, distance or lag.
  • By a tempting abuse of terminology, we often use
    the same name for the underlying relationship as
    a function.

First known use of term variogram
  • Geoffrey H. Jowett (1922 ) in 1955
  • The comparison of means of sets of observations
    from sections of independent stochastic series.
  • Journal of the Royal Statistical Society. Series
    B (Methodological) 17 208227.

Spatial and time series
  • Variograms are central to one approach to spatial
    statistics, in this context often known as
  • Georges Matheron (19302000) is most often
    mentioned here.
  • But variograms can be very useful for time series

Time series too
  • Variograms are prominent in these texts on time
    series and longitudinal data
  • Diggle, P.J. 1990. Time Series A Biostatistical
    Introduction. Oxford Oxford University Press.
  • Diggle, P.J., Heagerty, P.J., Liang, K-Y. and
    Zeger, S.L. 2002. Analysis of Longitudinal Data.
    Oxford Oxford University Press.

User-written programs
  • Programs (NJC) are available from SSC for
  • variograms in one dimension variog (2005)
  • variograms in two dimensions variog2 (2005)
  • A combined and extended program vgram is under

Generality of variograms
  • So, variograms are without undue strain
  • for time series and for spatial series,
  • whether regular or irregular,
  • as they just depend on separation being measured.
  • Plotting the mean for each distinct separation is
  • a common, but not compulsory, convention.

A simple example webuse air2
  • vgram air, recast(connected) xla(0(12)72)
  • vgram air

Comparison at different lags
  • We are plotting mean squared differences between
    values compared at lags 1, 2, 3,
  • In this example, we have monthly data, so are
    comparing values 1, 2, 3, months apart.
  • Many readers may be familiar with the same idea
    for calculating autocorrelation and
  • The variogram like the raw data plot hints at
    a structure of trend plus seasonality.

Variograms of residuals, not data
  • Here, as elsewhere, it is a good idea to work
    with residuals, rather than the original data.
  • Time series modellers could have a happy time
    arguing which model was best for the airline
    data, but we just use a Poisson regression on
    time and look at its residuals.
  • On the versatility and virtuosity of Poisson
    regression, check out Gould, William.
  • http//

Sometimes, structure is this simple
  • Poisson regression
  • Residuals from Poisson

A little more formally
  • The semivariogram ?(h) for response z is
    given by
  • 2 ?(h) A z(i) - z(i h)2
  • where A denotes averaging over pairs of values
    at lag h.
  • As emphasised, using a mean is a convention. The
    fuller picture (literally!) is a plot of z(i) -
    z(i h)2 versus h.
  • This is often known as a variogram cloud.
  • I borrow the notation A() from Whittle, P. 1970.
    Probability. Harmondsworth Penguin.

Where does the 2 come from?
  • The units of the semivariogram are those of the
    response squared.
  • Adding the variance to the graph as a reference
    line underlines the connection.
  • A non-standard formula for the variance is, for
    any i, j,
  • (1/2) E (zi - zj)2 .

Back to vgram
  • vgram (not yet public) is already quite general.
  • We take possibilities one by one.
  • With just one argument, the response, it checks
    for a tsset or xtset time variable and uses it to
    define separations if found. Note that panel
    data are supported for free.
  • With just one argument otherwise, the order of
    the observations is taken to define position in
    time or space.

  • With two arguments, the second variable is taken
    to define position. A width() option is required
    to specify the width of bins within which
    differences squared are averaged. Equal and
    unequal spacing can thus both be accommodated.
  • With three arguments, the second and third
    variables are taken to define position. A width()
    option is required to specify the width of bins
    within which differences squared are averaged.
    Distance is calculated from coordinates using
    Pythagoras theorem.

Why not just use autocorrelation?
  • Variograms are defined for a wider class of
    processes. Autocorrelation functions require weak
    stationarity variograms are defined for
    processes with stationary increments.
  • Variograms are more flexible in the face of
    irregular spacing.
  • The very wide use of autocorrelation reflects
    custom and familiarity as well as intrinsic

A further example
  • We look at rainfalls for 8 May 1986 (a single
    day) for 467 stations in Switzerland.

(No Transcript)
(No Transcript)
How much information ?
  • Optionally the semivariogram results can be saved
    in vgram to new variables.
  • Keeping track of the number of pairs used at each
    lag is important.
  • Here we exploit the feature that spikeplot can
    show frequencies on a square root scale.

(No Transcript)
To do list
  • variogram clouds
  • robust estimators
  • more flexible binning
  • spherical distances too
  • direction as well as lag
  • model fitting
  • (valid functional forms)
  • use for interpolation
  • (and smoothing)
  • (kriging, Gaussian process regression)

Variogram virtues
  • Defined for time and spatial series.
  • Defined for regular and irregular series.
  • Can help identify and check for structure.
  • even if you have no interest in their most
    mentioned use, as a means towards the end of
    spatial interpolation.

This paper
  • This paper fills a much needed gap in the
  • See Jackson, A. 1997.
  • Chinese acrobatics, an old-time brewery,
  • and the much needed gap
  • The life of Mathematical Reviews.
  • Notices of the American Mathematical Society
  • 44 330337.

  • Historical portraits Wikipedia.
  • MATLAB code for pchip
  • Moler, C. 2004. Numerical Computing with MATLAB.
    Philadelphia SIAM. Chapter 3. http//www.mathwork
  • The Swiss rainfall data can be found here
  • http//

Leo Breiman (19282005)
  • The main thing to learn about statistics is what
    is sensible and honest and possible.
  • Doubt and suspicion, as well as technical
    knowledge, are indispensable tools in statistics.
  • 1973. 
  • Statistics With a view towards applications. 
  • Boston Houghton Mifflin, pp.1, 18.