Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering

About This Presentation

Title:

Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering

Description:

PKDD Conference, Berlin, Sept. 21, 2006. Department of Computer Science ... for the Volcano dataset for b=1.01, =6, which can be attributed to the fact that ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 29

Provided by: lindaj156

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering

1
Discovering Interesting Regions inSpatial Data
Sets using Supervised Clustering

Christoph F. Eick, Banafsheh Vaezian, Dan Jiang,
Jing Wang
PKDD Conference, Berlin, Sept. 21, 2006
Department of Computer Science
University of Houston, Texas, USA
Organization
Motivation Examples of Region Discovery
Region Discovery Framework
A Family of Clustering Algorithms for Region
Discovery
Experimental Evaluation
Related Work
Generalizability of the Region Discovery
Framework
Conclusion

2
1. Motivation Examples of Region Discovery

Application 1 Hot-spot Discovery this paper
Application 2 Regional Association Rule Mining
DEWY06
Find Regions
Mine Regional association rules
Application 3 Find Interesting Regions with
respect to a Continuous Variable
Application 4 Regional Co-location Mining
Application 5 Find representative regions
(Sampling)

b1.01
RD-Algorithm
b1.04
Wells in Texas Green safe well with respect to
arsenic Red unsafe well
3
2. Region Discovery Framework

We assume we have spatial or spatio-temporal
datasets that have the following structure
(x,y,z,tltnon-spatial attributesgt)
e.g. (longitude, lattitude, class_variable)
or (longitude, lattitude, continous_variable)
Clustering occurs in the (x,y,z,t)-space
regions are found in this space.
The non-spatial attributes are used by the
fitness function but neither in distance
computations nor by the clustering algorithm
itself.
For the remainder of the talk, we view region
discovery as a clustering task and assume that
regions and clusters are the same

4
Region Discovery Framework Continued

The algorithms we currently investigate solve the
following problem
Given
A dataset O with a schema R
A distance function d defined on instances of R
A fitness function q(X) that evaluates clustering
Xc1,,ck as follows
q(X) ?c?X reward(c)size(c)? with bgt1
Objective
Find c1,,ck ? O such that
ci?cj? if i?j
Xc1,,ck maximizes q(X)
All cluster ci?X are contiguous (each pair of
objects belonging to ci has to be
delaunay-connected with respect to ci and to d)
c1?,,?ck ? O
c1,,ck are frequently ranked based on the
reward each cluster receives, and low reward
clusters are not reported

5
Example of a Fitness Function for Hot Spot
Discovery

Class of Interest Unsafe_Well
Prior Probability 20
?1 0.5, ?2 1.5
R 1, R- 1
ß 1.1, ?1.

10
30
Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
c 50 200 200 350 200
P(c, Unsafe) 20/50 40 40/200 20 10/200 5 30/350 8.6 100/20050
Reward
6
Challenges for Region Discovery

Recall and precision with respect to the
discovered regions should be high
Definition of measures of interestingness and of
corresponding parameterized reward-based fitness
functions that capture what domain experts find
interesting in spatial datasets
Detection of regions at different levels of
granularities (from very local to almost global
patterns)
Detection of regions of arbitrary shapes
Necessity to cope with very large datasets
Regions should be properly ranked by relevance
(reward)
Design and implementation of clustering
algorithms that are suitable to address
challenges 1, 3, 4, 5 and 6.

7
3. A Family of Clustering Algorithms for Region
Discovery

Supervised Partitioning Around Medoids (SPAM).
Single Representative Insertion/Deletion Steepest
Decent Hill Climbing with Randomized Restart
(SRIDHCR).
Supervised Clustering using Evolutionary
Computing (SCEC)
Agglomerative Hierarchical Supervised Clustering
(SCAH)
Hierarchical Grid-based Supervised Clustering
(SCHG)
Supervised Clustering using Multi-Resolution
Grids (SCMRG)
Representative-based Clustering with Gabriel
Graph Based Post-processing (SCECGGP /
SRIDHCRGGP)
Supervised Clustering using Density Estimation
Techniques (SCDE)

Remark For a more details about SCEC, SPAM,
SRIDHCR see EZZ04, ZEZ06 the PKDD06 paper
briefly discusses SCAH, SCHG, SCMRG
8
SCAH (Agglomerative Hierarchical)
Inputs A dataset Oo1,...,on A distance Matrix
D d(oi,oj) oi,oj ? O , Output Clustering
Xc1,,ck Algorithm 1) Initialize
Create single object clusters ci oi, 1 i
n Compute merge candidates based on nearest
clusters 2) DO FOREVER a) Find the pair
(ci, cj) of merge candidates that improves q(X)
the most b) If no such pair exist terminate,
returning Xc1,,ck c) Delete the two
clusters ci and cj from X and add the cluster ci
? cj to X d) Update inter-cluster
distances incrementally e) Update merge
candidates based on inter-cluster distances
9
SCHG (Hierarchical Grid-based)
Remark Same as SCAH, but uses grid cells as
intial clusters Inputs A dataset
Oo1,...,on A grid structure G Output Clusterin
g Xc1,,ck Algorithm 1) Initialize
Create clusters making each single non-empty grid
cell a cluster Compute merge candidates (all
pairs of neighboring grid cells) 2) DO FOREVER
a) Find the pair (ci, cj) of merge candidates
that improves q(X) the most b) If no such
pair exist terminate, returning Xc1,,ck
c) Delete the two clusters ci and cj from X and
add the cluster cci ? cj to X d)
Update merge candidates ?c?X (MC(c,c) ? MC(c,
ci) ? MC(c, cj ))
10
Ideas SCMRG (Divisive, Multi-Resolution Grids)
Cell Processing Strategy 1. If a cell receives
a reward that is larger than the sum of its
rewards its ancestors return that cell.
2. If a cell and its ancestor do not receive
any reward prune 3. Otherwise, process the
children of the cell (drill down)
11
4. Experimental Evaluation
Dataset Name of objects of classes
1 B-Complex9 3,031 2
2 Volcano 1,533 2
3 Earthquake-1 3,161 3
4 Earthquake-10 31,614 3
5 Earthquake-100 316,148 3
6 Wyoming-Poverty 493,781 2
Volcano
Earthquake
12
Experimental Results
Dataset Algorithms SCAH SCHG SCMRG SCAH SCHG SCMRG
Dataset Parameters ß 1.01, ? 6 ß 1.01, ? 6 ß 1.01, ? 6 ß 3, ? 1 ß 3, ? 1 ß 3, ? 1
B-Complex9 Purity 1 0.998 1 1 0.997 0.863
B-Complex9 Quality 0.974 0.974 0.957 0.008 0.044 0.002
B-Complex9 Clusters 17 15 132 17 9 22
Volcano Purity 1 0.692 0.979 1 0.692 0.885
Volcano Quality 0.940 0.091 0.822 1E-5 7E-4 1E-4
Volcano Clusters 639 56 311 639 31 221
Earthquake-1 Purity 1 0.844 0.938 0.853 0.840 0.814
Earthquake-1 Quality 0.952 0.399 0.795 0.004 0.086 0.006
Earthquake-1 Clusters 479 33 380 161 10 93
Earthquake-10 Purity DNF 0.840 0.912 DNF 0.834 0.807
Earthquake-10 Quality DNF 0.398 0.658 DNF 0.077 0.006
Earthquake-10 Clusters DNF 37 506 DNF 12 153
Earthquake-100 Purity DNF 0.842 0.909 DNF 0.837 0.808
Earthquake-100 Quality DNF 0.389 0.560 DNF 0.083 0.006
Earthquake-100 Clusters DNF 38 780 DNF 9 191
Wyoming Purity DNF 0.772 0.721 DNF 0.769 0.661
Wyoming Quality DNF 0.027 0.227 DNF 0 0.001
Wyoming Clusters DNF 489 89 DNF 391 78
13
Experimental Evaluation

SCAH outperforms SCHG and SCMRG when the penalty
for the number of clusters is very low (b1.01,
?6). However, when SCAH runs out of pure
clusters to merge, it has the tendency to
terminate prematurely therefore, it does quite
poorly when the objective is obtain large
clusters (b3, ?1).
SCHG outperforms SCMRG and SCAH for b3, ?1.
SCMRG obtains better clusters than SCAH for the
Volcano dataset for b1.01, ?6, which can be
attributed to the fact that SCMRG uses grid cells
with different sizes.
Avg. wall clocktime for smaller datasets
SCAHSCMRG/SCHG 131/521
SCAH is not suitable to cope with dataset sizes
of 10000 and more, mainly because of the large
number of distance computations, large numbers of
clusters, and merge steps needed.
The quality of clustering of SCMRG is strongly
dependent on initial cluster sizes and on the
look ahead depth.

14
Problems with SCAH
Too restrictive definition of merge candidates
XXX OOO OOO XXX
No look ahead
Non-contiguous clusters
15
5. Related Work

In contrast to most work in spatial data mining,
our work centers on creating regional knowledge
and not global knowledge.
A lot of work in spatial data mining centers on
partioning a spatial dataset into transactions
so that apriori-style algorithms can be used. We
claim that our work can contribute to finding
such transactions DEWY06.
Our work has similarity to work in supervised
clustering/semi-supervised clustering in that it
uses class labels in evaluating clusters.
Moreover, the goals of the algorithms presented
in this paper are similar to hotspot discovery
algorithms, a task that does not receive a lot of
attention in spatial data mining, but more
attention by scientists in earth sciences and
related disciplines.

16
6. Generalizibility

Find regions whose density/entropy/purity with
respect to a class of interest is low/high ? this
talk
Find regions whose variance with respect to a
continuous variable is low ?contour maps
Find regions whose variance with respect to a
contious variable is high ?
Find regions whose distribution is similar to the
distribution of the whole dataset ? spatial
sampling
Find regions in which the density of 2 or more
classes is elevated ?regional co-location mining

17
7. Summary

A framework for region discovery that relies on
additive, reward-based fitness functions and
views region discovery as a clustering problem
has been introduced.
Evidence concerning the usefulness of the
framework for hot spot discovery problems has
been presented.
As a by-product some known and not so well known
flaws of hierarchical clustering algorithms have
been identified.
The ultimate vision of this research is the
development of region discovery engines that
assist earth scientists in finding interesting
regions in spatial datasets.

18
The Vision of the Presented Research
DomainExpert
Spatial Databases
Measure ofInterestingness Acquisition Tool
Database Integration Tool
Fitness Function
Data Set
Family of Clustering Algorithms
Region DiscoveryDisplay
Ranked Set of Interesting Regions and their
Properties
Visualization Tools
Architecture Region Discovery Engine
19
Additional Transparencies
Not used for PKDD 2006 Talk
20
Code SCMRG
21
Why should people use Region Discovery Engines
(RDE)?

RDE finds sub-regions with special
characteristics in large spatial datasets and
presents findings in an understandable form. This
is important for
Focused summarization
Find interesting subsets in spatial datasets for
further studies
Identify regions with unexpected patterns
because they are unexpected they deviate from
global patterns therefore, their regional
characteristics are frequently important for
domain experts
Without powerful region discovery algorithms,
finding regional patters tends to be haphazard,
and only leads to discoveries if ad-hoc region
boundaries have enough resemblance with the true
decision boundary
Exploratory data analysis for a mostly unknown
dataset
Co-location statistics frequently blurred when
arbitrary region definitions are used, hiding the
true relationship of two co-occuring phenomena
that become invisible by taking averages over
regions in which a strong relationship is watered
down, by including objects that do not contribute
to the relationship (example High crime-rates
along the major rivers in Texas)
Data set reduction focused sampling

22
Experimental Results Volcano for b1.01, ?6
SCAH
SCHG
SCMRG
23
Example Result SCMRG
24
Datasets Used

Obtained from Geosciences Department in
University of Houston.
The Earthquake dataset contains all earthquake
data worldwide done by the United States
Geological Survey (USGS) National Earthquake
Information Center (NEIC).
The modified Earthquake dataset contains the
longitude, latitude and a class variable that
indicates the depth of the earthquake,
0(shallow), 1(medium) and 2(deep).

25
Datasets Used

Wyoming datasets were created from U.S. Census
2000 data.
The Wyoming Modified Poverty Status in 1999 is a
modified version of the original dataset, Wyoming
Poverty Status.
The Wyoming Poverty Datasets were created using
county statistics. For each county, random
population coordinates were generated using the
complete spatial randomness (CSR) functions in
S-PLUS.
Then, the background information was attached to
each individual county based on the countys
distribution for the class of interest. Finally,
all counties were merged into a single dataset
that describes the whole state.

26
Datasets Used

Obtained from Geosciences Department in
University of Houston.
The Volcano dataset contains basic geographic and
geologic information for volcanoes thought to be
active in the last 10,000 years
The original data include a unique volcano
number, volcano name, location, latitude and
longitude, summit elevation, volcano type, status
and the time range of the last recorded eruption.
The Subset of the volcano dataset used in this
thesis contains longitude, latitude and a class
variable that indicates if a volcano is non
violent (blue) or violent (red).

27
Another Example Regional Co-location Mining
Regional Co-location
Task Find Co-location patterns for the following
data-set.
Global Co-location and
28
A Co-Location Reward Framework

Task Find regions in which the density of 2 or
more classes is elevated.
One approach to measure class density elevation
In general, multipliers lC can be computed for
every class in a dataset, indicating how much the
density of instances of class C is elevated in
region r compared to their density in the whole
space.
Example Binary Co-Location Reward Framework
increaseC(r) if lC(r)?1 then 0
else ((lC(r)
1)/(1/(prior(C)-1)))d
kC1,C2(r) increaseC1(r) increaseC2(r)
reward(r) maxC1,C2 C1?C2 (kC1,C2(r))

Write a Comment

User Comments (0)

About PowerShow.com

Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering - PowerPoint PPT Presentation

Discovering Interesting Regions in Spatial Data Sets using Supervised Clustering

PKDD Conference, Berlin, Sept. 21, 2006. Department of Computer Science ... for the Volcano dataset for b=1.01, =6, which can be attributed to the fact that ... – PowerPoint PPT presentation