Mining for Spatial Patterns

- Shashi Shekhar
- Department of Computer Science
- University of Minnesota http//www.cs.umn.edu/s

hekhar - Collaborators V. Kumar, G. Karypis, C.T. Lu, W.

Wu, Y. Huang, V. Raju, P. Zhang, P. Tan, M.

Steinbach - This work was partially funded by NASA and Army

High Performance Computing Center

Spatial Data Mining(SDM) - Examples

- Historical Examples
- London Asiatic Cholera 1854 (Griffith)
- Dental health and fluoride in water, Colorado

early 1900s - Current Examples
- Cancer clusters (CDC), Spread of disease (e.g.

Nile virus) - Crime hotspots (NIJ CML, police petrol planning)
- Environmental justice (EPA), fair lending

practices - Upcoming Applications Location aware services
- Defense Sensor networks, Mobile ad-hoc networks
- Civilian Mortgage PMI determination based on

location

Army Relevance of SDM

- Strategic
- Predicting global hot spots (FORMID)
- Army land endangered species vs. training and

war games - Search for local trends in massive simulation

data - Critical infra-structure defense (threat

assessment) - Tactical
- Inferring enemy tactics (e.g. flank attack) from

blobology - Detection of lost ammunition dumps (Dr.

Radhakrishnan) - Operational
- Interpretation of maps map matching (locating

oneself on map) - identify terrain feature, e.g. ravines, valleys,

ridge, etc. - Locating enemy (e.g. sniper in a haystack, sensor

networks) - Avoiding friendly fire

Spatial Data Mining(SDM) - Definition

- Search of implicit, interesting patterns in

geo-spatial data

- Ex. Reconnaissance, Vector maps(NIMA, TEC), GPS,

Sensor networks - Data Mining vs. Statistics
- Primary vs. Secondary analysis
- Global vs. local trends
- Spatial Data Mining vs. Data Mining
- Spatial Autocorrelation
- Continuous vs. Discrete data types

Background

- Spatial Data Mining
- Spatial statistics in Geology, Regional Economics
- NSF workshop on GIS and DM (3/99)
- NSF workshop on spatial data analysis (5/02)
- Spatial patterns
- Spatial outliers
- Location prediction
- Associations, colocations
- Hotspots, Clustering, trends,

Framework

- 2 Approaches to mining Spatial Data
- 1. Pick spatial features use classical DM

methods - 2. Use novel data mining techniques
- Our Approach
- Define the problem capture special needs
- Explore data using maps, other visualization
- Try reusing classical DM methods
- If classical DM perform poorly, try new methods
- Evaluate chosen methods rigourously
- Performance tuning if needed

Spatial Association Rule

- Citation Symp. On Spatial Databases 2001
- Problem Given a set of boolean spatial features
- find subsets of co-located features, e.g. (fire,

drought, vegetation) - Data - continuous space, partition not natural,

no reference feature - Classical data mining approach association rules
- But, Look Ma! No Transactions!!! No support

measure! - Approach Work with continuous data without

transactionizing it! - confidence Pr.fire at s drought in N(s) and

vegetation in N(s) - support cardinality of spatial join of instances

of fire, drought, dry veg. - participation min. fraction of instances of a

features in join result - new algorithm using spatial joins and apriori_gen

filters

Event Definition

- Convert the time series into sequence of events

at each spatial location.

Interesting Association Patterns

- Use domain knowledge to eliminate uninteresting

patterns. - A pattern is less interesting if it occurs at

random locations. - Approach
- Partition the land area into distinct groups

(e.g., based on land-cover type). - For each pattern, find the regions for which the

pattern can be applied. - If the pattern occurs mostly in a certain group

of land areas, then it is potentially

interesting. - If the pattern occurs frequently in all groups of

land areas, then it is less interesting.

Association Rules

- Intra-zone non-sequential Patterns

- Region corresponds to semi-arid grasslands, a

type of vegetation, which is able to quickly take

advantage of high precipitation than forests. - Hypothesis FPAR-Hi events could be related to

unusual precipitation conditions.

Co-location

Can you find co-location patterns from the

following sample dataset?

Answers and

Co-location

Spatial Co-location A set of features

frequently co-located Given A set T of K

boolean spatial feature types Tf1,f2, ,

fk A set P of N locations Pp1, , pN in

a spatial frame work S, pi? P is of some spatial

feature in T A neighbor relation R over

locations in S Find Tc ?subsets of T

frequently co-located Objective Correctness

Completeness Efficiency Constraints R

is symmetric and reflexive Monotonic

prevalence measure

Reference Feature Centric

Window Centric

Event Centric

Co-location

Comparison with association rules

Association rules Co-location rules

underlying space discrete sets continuous space

item-types item-types events /Boolean spatial features

collections transactions neighborhoods

prevalence measure support participation index

conditional probability measure Pr. A in T B in T Pr. A in N(L) B at L

Participation index Participation ratio pr(fi, c)

of feature fi in co-location c f1, f2, , fk

fraction of instances of fi with feature f1, ,

fi-1, fi1, , fk nearby 2.Participation index

minpr(fi, c) Algorithm Hybrid Co-location

Miner

Spatial Co-location Patterns

- Dataset

- Spatial feature A,B,C and their instances
- Possible associations are (A, B), (B, C), etc.
- Neighbor relationship includes following pairs
- A1, B1
- A2, B1
- A2, B2
- B1, C1
- B2, C2

Spatial Co-location Patterns

- Partition approachYasuhiko, KDD 2001
- Support not well defined,i.e. not independent of

execution trace - Has a fast heuristic which is hard to analyze for

correctness/completeness

- Dataset

Spatial feature A,B, C, and their instances

Support A,B1 B,C2

Support A,B 2 B,C2

Spatial Co-location Patterns

- Dataset

- Reference feature approach Han SSD 95
- C as reference feature to get transactions
- Transactions (B1) (B2)
- Support (A,B) ? from Apriori algorithm
- Note Neighbor relationship includes following

pairs - A1, B1
- A2, B1
- A2, B2
- B1, C1
- B2, C2

Spatial feature A,B, C, and their instances

Spatial Co-location Patterns

- Our approach (Event Centric)
- Neighborhood instead of transactions
- Spatial join on neighbor relationship
- Support ? Prevalence
- Participation index min. p_ratio
- P_ratio(A, (A,B)) fraction of instance of A

participating in join(A,B, neighbor) - Examples
- Support(A,B)min(2/2,3/3)1
- Support(B,C)min(2/2,2/2)1

- Dataset

Spatial feature A,B, C, and their instances

Spatial Co-location Patterns

- Partition approach

- Our approach

- Dataset

Support(A,B)min(2/2,3/3)1

Spatial feature A,B, C, and their instances

Support(B,C)min(2/2,2/2)1

Support A,B 2 B,C2

- Reference feature approach

C as reference feature Transactions (B1)

(B2) Support (A,B) ?

Support A,B1 B,C2

Spatial Outliers

- Spatial Outlier A data point that is extreme

relative to it neighbors - Case Study traffic stations different from

neighbors SIGKDD 2001, JIDA 2002 - Data - space-time plot, distr. Of f(x), S(x)
- Distribution of base attribute
- spatially smooth
- frequency distribution over value domain normal
- Classical test - Pr.item in population is low
- Q? distribution of diff.f(x), neighborhood

aggf(x) - Insight this statistic is distributed normally!
- Test (z-score on the statistics) gt 2
- Performance - spatial join, clustering methods

Spatial Outlier Detection

Given A spatial graph GV,E A neighbor

relationship (K neighbors) An attribute

function V -gt R An aggregation function

R k -gt R A comparison function

Confidence level threshold ? Statistic test

function ST R -gtT, F Find O vi vi ?V,

vi is a spatial outlier Objective

Correctness The attribute values of vi is

extreme, compared with its neighbors

Computational efficiency Constraints

and ST are algebraic aggregate functions of

and Computation cost dominated by I/O op.

Spatial Outlier Detection

Spatial Outlier Detection Test 1. Choice of

Spatial Statistic S(x) f(x)E y?

N(x)(f(y)) Theorem S(x) is normally

distributed if f(x) is

normally distributed 2. Test for Outlier

Detection (S(x) - ?s) / ?s gt ?

Hypothesis I/O cost determined by clustering

efficiency

f(x)

S(x)

Graphical Spatial Tests

Moran Scatter Plot

Original Data

Variogram Cloud

A Unified Approach Spatial Outliers

- Tests quantitative, graphical
- Results
- Computation spatial self-join
- Tests algebraic functions of join
- Join predicate neighbor relations
- I/O-cost f(clustering efficiency)
- Our algorithm is I/O-efficient for
- Algebraic tests

Scatter Plot

Original Data

Our Approach

Spatial Outlier Detection

Results 1. CCAM achieves higher clustering

efficiency (CE) 2. CCAM has lower I/O cost

3. High CE gt low I/O cost 4. Big Page gt high

CE

I/O cost

CE value

Z-order

CCAM

Cell-Tree

Location Prediction

- Citations IEEE Tran. on Multimedia 2002, SIAM DM

Conf. 2001, SIGKDD DMKD 2000 - Problem predict nesting site in marshes
- given vegetation, water depth, distance to edge,

etc. - Data - maps of nests and attributes
- spatially clustered nests, spatially smooth

attributes - Classical method logistic regression, decision

trees, bayesian classifier - but, independence assumption is violated ! Misses

auto-correlation ! - Spatial auto-regression (SAR), Markov random

field bayesian classifier - Open issues spatial accuracy vs. classification

accurary - Open issue performance - SAR learning is slow!

Location Prediction

Given 1. Spatial Framework 2. Explanatory

functions 3. A dependent class 4. A family

of function mappings Find Classification

model Objectivemaximize classification_accurac

y Constraints Spatial Autocorrelation exists

Nest locations

Distance to open water

Water depth

Vegetation durability

Motivation and Framework

Spatial AutoRegression (SAR)

- Spatial Autoregression Model (SAR)
- y ?Wy X? ?
- W models neighborhood relationships
- ? models strength of spatial dependencies
- ? error vector
- Solutions
- ? and ? - can be estimated using ML or Bayesian

stat. - e.g., spatial econometrics package uses Bayesian

approach using sampling-based Markov Chain Monte

Carlo (MCMC) method. - Likelihood-based estimation requires O(n3) ops.
- Other alternatives divide and conquer, sparse

matrix, LU decomposition, etc.

Evaluation

- Linear Regression
- Spatial Regression
- Spatial model is better

MRF Bayesian

- Markov Random Field based Bayesian Classifiers
- Pr(li X, Li) Pr(Xli, Li) Pr(li Li) / Pr

(X) - Pr(li Li) can be estimated from training data
- Li denotes set of labels in the neighborhood of

si excluding labels at si - Pr(Xli, Li) can be estimated using kernel

functions - Solutions
- stochastic relaxation Geman
- Iterated conditional modes Besag
- Graph cut Boykov

Experiment Design

Prediction Maps(Learning)

MRF-P Prediction (ADNP3.36)

Actual Nest Sites (Real Learning)

NZ85

NZ138

MRF-GMM Prediction (ADNP5.88)

SAR Prediction (ADNP9.80)

NZ140

NZ130

Prediction Maps(Testing)

MRF-P Prediction (ADNP2.84)

Actual Nest Sites (Real Testing)

Actual Nest Sites (Real Learning)

NZ30

NZ80

MRF-GMM Prediction (ADNP3.35)

SAR Prediction (ADNP8.63)

NZ76

NZ80

Comparison (MRF-BC vs. SAR)

- SAR can be rewritten as y (QX) ? Q?
- where Q (I- ?W)-1 which can be viewed as a

spatial smoothing operation. - This transformation shows that SAR is similar to

linear logistic model, and thus suffers with same

limitations i.e., SAR model assumes linear

separability of classes in transformed feature

space - SAR model also make more restrictive assumptions

about the distribution of features and class

shapes than MRF - The relationship between SAR and MRF are

analogous to the relationship between logistic

regression and Bayesian classifiers. - Our experimental results shows that MRF model

yields better spatial and classification

accuracies than SAR predictions.

MRF vs. SAR

Confusion Matrix

Spatial Confusion Matrix

Conclusion and Future Directions

- Spatial domains may not satisfy assumptions of

classical methods - data auto-correlation, continuous geographic

space - patterns global vs. local, e.g. spatial outliers

vs. outliers - data exploration maps and albums
- Open Issues
- patterns hot-spots, blobology (shape), spatial

trends, - metrics spatial accuracy(predicted locations),

spatial contiguity(clusters) - spatio-temporal dataset
- scale and resolutions sentivity of patterns
- geo-statistical confidence measure for mined

patterns

Army Relevance and Collaborations

- Relevance Maps are as important to soldiers as

guns - unknown - Joint Projects
- High Performance GIS for Battlefield Simulation

(ARL Adelphi) - Spatial Querying for Battlefield Situation

Assessment (ARL Adelphi) - Joint Publications
- w/ G. Turner (ARL Adelphi, MD) D. Chubb (CECOM

IEWD) - IEEE Computer (December 1996)
- IEEE Transactions on Knowledge and Data Eng.

(July-Aug. 1998) - Three conference papers
- Visits, Other Collaborations
- GIS group, Waterways Experimentation Station

(Army) - Concept Analysis Agency, Topographic Eng.

Center, ARL, Adelphi - Workshop on Battlefield Visualization and Real

Time GIS (4/2000)

Reference

- S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X.

Liu and C.T. Liu, Spatial Databases

Accomplishments and Research Needs, IEEE

Transactions on Knowledge and Data Engineering,

Jan.-Feb. 1999. - S. Shekhar and Y. Huang, Discovering Spatial

Co-location Patterns a Summary of Results, In

Proc. of 7th International Symposium on Spatial

and Temporal Databases (SSTD01), July 2001. - S. Shekhar, C.T. Lu, P. Zhang, "Detecting

Graph-based Spatial Outliers Algorithms and

Applications, the Seventh ACM SIGKDD

International Conference on Knowledge Discovery

and Data Mining, 2001. - S. Shekhar, C.T. Lu, P. Zhang, Detecting

Graph-based Saptial Outlier, Intelligent Data

Analysis, To appear in Vol. 6(3), 2002 - S. Shekhar, S. Chawla, the book Spatial

Database Concepts, Implementation and Trends,

Prentice Hall, 2002 - S. Chawla, S. Shekhar, W. Wu and U. Ozesmi,

Extending Data Mining for Spatial Applications

A Case Study in Predicting Nest Locations, Proc.

Int. Confi. on 2000 ACM SIGMOD Workshop on

Research Issues in Data Mining and Knowledge

Discovery (DMKD 2000), Dallas, TX, May 14, 2000. - S. Chawla, S. Shekhar, W. Wu and U. Ozesmi,

Modeling Spatial Dependencies for Mining

Geospatial Data, First SIAM International

Conference on Data Mining, 2001. - S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu,

and S. Chawla, Spatial Contextual Classification

and Prediction Models for Mining Geospatial

Data,To Appear in IEEE Transactions on

Multimedia, 2002. - S. Shekhar, V. Kumar, P. Tan. M. Steinbach, Y.

Huang, P. Zhang, C. Potter, S. Klooster, Mining

Patterns in Earth Science Data, IEEE Computing

in Science and Engineering (Submitted)

Reference

- S. Shekhar, C.T. Lu, P. Zhang, A Unified

Approach to Spatial Outliers Detection, IEEE

Transactions on Knowledge and Data Engineering

(Submitted) - S. Shekhar, C.T. Lu, X. Tan, S. Chawla, Map Cube

A Visualization Tool for Spatial Data Warehouses,

as Chapter of Geographic Data Mining and

Knowledge Discovery. Harvey J. Miller and Jiawei

Han (eds.), Taylor and Francis, 2001, ISBN

0-415-23369-0. - S. Shekhar, Y. Huang, W. Wu, C.T. Lu, What's

Spatial about Spatial Data Mining Three Case

Studies , as Chapter of Book Data Mining for

Scientific and Engineering Applications. V.

Kumar, R. Grossman, C. Kamath, R. Namburu (eds.),

Kluwer Academic Pub., 2001, ISBN 1-4020-0033-2 - Shashi Shekhar and Yan Huang , Multi-resolution

Co-location Miner a New Algorithm to Find

Co-location Patterns in Spatial Datasets, Fifth

Workshop on Mining Scientific Datasets (SIAM 2nd

Data Mining Conference), April 2002