The identification of exceptional values in the ESPON database - PowerPoint PPT Presentation

Loading...

PPT – The identification of exceptional values in the ESPON database PowerPoint presentation | free to download - id: 716404-NzA5N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

The identification of exceptional values in the ESPON database

Description:

The identification of exceptional values in the ESPON database Paul Harris Martin Charlton National Centre for Geocomputation NUIM Maynooth Ireland – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 31
Provided by: Harr2171
Learn more at: http://www.espon.eu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: The identification of exceptional values in the ESPON database


1
  • The identification of exceptional values in the
    ESPON database

Paul Harris Martin Charlton National Centre for
Geocomputation NUIM Maynooth Ireland Madrid
seminar - 10/6/10
2
  • Outline
  1. ESPON DB data
  2. Identifying exceptional values
  3. Case study 1 (detecting logical input errors)
  4. Case study 2 (detecting statistical outliers)
  5. Next things to do..

3
  • 1. ESPON DB data
  • Socio-economic, land cover,
  • Continuous, categorical, nominal, ordinal,.
  • Spatial support
  • Area units NUTS 0/1/2/23/3
  • (whose boundaries may also change over time)
  • Temporal support
  • Commonly, yearly units (with only a short time
    series)

4
  • 2. Identifying exceptional values
  • Define two types
  • Logical input errors
  • (e.g. a negative unemployment rate)
  • Statistical outliers
  • (e.g. an unusually high unemployment rate)
  • Two-stage identification algorithm
  • Stage 1 identify input errors via mechanical
    techniques
  • Stage 2 identify outliers via statistical
    techniques

5
  • Stage 1
  • Identify logical Input Errors

6
  • Logical input errors
  • Usually detected using some logical, mathematical
    approach
  • Statistical detection may also help
  • Typical input errors
  • Impossible values (e.g. negatives, fractions)
  • Repeated data for different variables
  • Data displaced between or within columns
  • Data swapped between or within columns
  • Wrong NUTS code or name
  • Wrong NUTS regions used (e.g. for 1999 instead of
    2006)
  • Missing value code (e.g. 9999 treated as a true
    value)

7
  • Our approach
  • Detect input errors mathematically (
    statistically)
  • Flag observations if they are likely input errors
  • If possible - correct them
  • More likely - consult an expert on the data
  • Once happy - go to stage 2 - assume data is
    error-free

8
  • Stage 2
  • Identify statistical outliers

9
  • Types of outliers.

10
  • Our approach
  • There is no single best outlier detection
    technique, so
  • Apply a representative selection of outlier
    detection techniques (which are simple robust)
  • Flag an observation if it is a likely outlier
    according to each technique
  • Build up a weight of evidence for the likelihood
    of a given observation being statistically
    outlying
  • Suggest what type of outlier it is likely to be
  • - aspatial, spatial, temporal,
    relationship, some mixture
  • Consult an expert on the data to decide on the
    appropriate course of action
  • Heres an example using nine techniques three
    observations

11
Identification technique Identification type Obs. 1 Obs. 2 Obs. 3
1. Boxplot statistics Aspatial univariate Yes Yes
2. Hawkins spatial test statistic Spatial univariate Yes
3. Time series statistics Temporal univariate Yes Yes
4. Large residuals from multiple linear regression Aspatial multivariate, Linear relationships Yes Yes
5. Large residuals from locally weighted regression Aspatial multivariate, Nonlinear relationships Yes
6. Large residuals from geographically weighted regression Spatial multivariate, Nonlinear relationships Yes
7. Principal component analysis Aspatial multivariate, Linear relationships Yes
8. Locally weighted principal component analysis Aspatial multivariate, Nonlinear relationships Yes
9. Geographically weighted principal component analysis Spatial multivariate, Nonlinear relationships Yes
Can have a spatial, univariate form if the coordinate data are used as variables Can have a spatial, univariate form if the coordinate data are used as variables Can have a spatial, univariate form if the coordinate data are used as variables Can have a spatial, univariate form if the coordinate data are used as variables Can have a spatial, univariate form if the coordinate data are used as variables
12
3. Case study 1 (detecting logical input errors)
  • Data
  • Data at NUTS3 level (1351 observations/regions)
  • Variables
  • GDP evolution (2000 to 2005) (age)
  • Calculated using 4 other variables
  • 205 logical input errors deliberately introduced
    to
  • NUTS codes the 4 variables used to calculate
    GDP evolution only
  • 15 of data infected

13
Performance results
False negatives - 13.2 (e.g. in Italy) False
positives - 2.0 (e.g. in Spain) Overall
misclassification rate - 3.7
14
Consequences if we had ignored input errors.
15
4. Case study 2 (detecting statistical outliers)
  • Data
  • Data at NUTS23 level for eight years 2000-2007
  • For each year - unemployment rate calculated
  • Unemployment population)/(Active
    population)
  • 8 variables at each of 790 regions 6320 obs.
  • Data checked for input errors - i.e. stage 1 done

16
(No Transcript)
17
Presentation of results
  • For brevity
  • Lets say - we only need at least one of 8
    time-specific unemployment values in a region to
    be outlying
  • (But we can identify outliers by year too)

18
Results 1 boxplot statistics(aspatial
univariate)
19
Results 2 Hawkins test(spatial univariate)
20
Results 3 time series statistics(temporal
univariate)
21
Results 4 MLR residuals(aspatial linear
relationships)
22
Results 5 LWR residuals(aspatial nonlinear
relationships)
23
Results 6 GWR residuals(spatial nonlinear
relationships)
24
Results 7 PCA residuals(aspatial linear
relationships model-free)
25
Results 8 LWPCA residuals(aspatial nonlinear
relationships model-free)
26
Results 9 GWPCA residuals(spatial nonlinear
relationships model-free)
27
Summary of results weight of evidence
28
Preliminary performance results
  • Infected 5 of the data with outliers
    repeated the analysis on this infected data
  • False negatives 10.3
  • False positives 34.3
  • Overall misclassification rate 26.1
  • Problems
  • Difficult to guarantee that our infections
    actually produce outliers
  • The data already contains outliers (as shown)

29
  • 5. Next things to do
  • 1. Other ways of performance testing our approach
  • Simulated data with known properties?
  • Statistical theory (or properties)?
  • 2. Refining each of our nine chosen techniques
  • Robust extensions

30
  • Thank You!
About PowerShow.com