Spatial Data Mining and Spatial Data Warehousing Special Topics In Database

About This Presentation

Title:

Spatial Data Mining and Spatial Data Warehousing Special Topics In Database

Description:

Special Topics In Database Sadra Abedinzadeh Ashkan Zarnani Farzad Peyravi Outline Motivation and General Description Data Warehousing: Basic Concepts and Techniques ... – PowerPoint PPT presentation

Number of Views:932

Avg rating:3.0/5.0

Slides: 54

Provided by: Dora80

Category:

more less

Transcript and Presenter's Notes

Title: Spatial Data Mining and Spatial Data Warehousing Special Topics In Database

1
Spatial Data Mining and Spatial Data
WarehousingSpecial Topics In Database

Sadra Abedinzadeh
Ashkan Zarnani
Farzad Peyravi

2
Outline

Motivation and General Description
Data Warehousing Basic Concepts and Techniques
Spatial Data Warehousing and Spatial OLAP
Techniques
Spatial Data Warehouse Models and Construction
Spatial OLAP Implementation and Application
Data Mining Basic Concepts and Techniques
Spatial Data Mining
Mining Spatial Association Rules.
Spatial Classification and Prediction
Spatial Data Clustering Analysis
Conclusions and Future Research.

3
Motivation

Data warehousing Integrating data from multiple
sources into large warehouses and support on-line
analytical processing and business decision
making.
Data mining (knowledge discovery in databases)
Extraction of interesting knowledge
(rules, regularities, patterns, constraints)
from data in large databases.
Necessity Data explosion problem ---
computerized data collection tools and mature
database technology lead to tremendous amounts of
data stored in databases.
We are drowning in data, but starving for
knowledge!

4
Data Warehousing

A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of managements
decision-making process. --- W. H. Inmon
A data warehouse is
A decision support database that is maintained
separately from the organizations operational
databases.
It integrates data from multiple heterogeneous
sources to support the continuing need for
structured and /or ad-hoc queries, analytical
reporting, and decision support.

5
Modeling Data Warehouses

Modeling data warehouses dimensions
measurements
Star schema A single object (fact table) in the
middle connected to a number of objects
(dimension tables) radially.
Snowflake schema A refinement of star schema
where the dimensional hierarchy is represented
explicitly by normalizing the dimension tables.
Fact constellations Multiple fact tables share
dimension tables.
Storage of selected summary tables
Independent summary table storing pre-aggregated
data, e.g., total sales by product by year.
Encoding aggregated tuples in the same fact table
and the same dimension tables.

6
Example of Star Schema

Time Dimension Table
Sales Fact Table
Product Dimension Table
Many Time Attributes
Time_Key
Many Product Attributes
Product_Key
Store Dimension Table
Location Dimension Table
Store_Key
Many Location Attributes
Many Store Attributes
Location_Key
unit_sales
dollar_sales
Measurements
Yen_sales
7
Example of a Snowflake Schema
Supplier_Key

Sales Fact Table
Product Dimension Table
Time Dimension Table
Time_Key
Supplier_Key
Many Time Attributes
Product_Key
Product_Key
Store_Key
Store Dimension Table
Location Dimension Table
Location_Key
Many Store Attributes
Location_Key
unit_sales
Country
dollar_sales
Measurements
Location_Key
Yen_sales
Region
Location_Key
8
A Star-Net Query Model
Customer Orders
Shipping Method
Customer

CONTRACTS
AIR-EXPRESS
ORDER
TRUCK
PRODUCT LINE
Product
Time
DAILY
QTRLY
ANNUALY
PRODUCT ITEM
PRODUCT GROUP
DISTRICT
SALES PERSON
REGION
DISTRICT
COUNTRY
DIVISION
Geography
Organization
Promotion
9
Construction of Data Cubes
All Amount Comp_Method, B.C.
Amount
0-20K
20-40K
60K-
sum
40-60K
Province
B.C.
Comp_Method
Prairies
Ontario
sum
Database
Discipline
...
sum

Each dimension contains a hierarchy of values
for one attribute
A cube cell stores aggregate values, e.g., count,
sum, max, etc.
A sum cell stores dimension summation values.
Sparse-cube technology and MOLAP/ROLAP
integration.
Chunk-based multi-way aggregation and
single-pass computation.

10
Efficient Data Cube Computation Methods

Data cube can be viewed as a lattice of cuboids
The bottom-most cuboid is the base cube.
The top most cuboid contains only one cell.
Materialization of data cube
Materialize every (cuboid), none, or some.
Algorithms for selection of which cuboids to
materialize.
Based on size, sharing, and access frequency.
Efficient cube computation methods
ROLAP algorithms.
Array-based cubing algorithm.

ALL
A
B
C
AB
BC
AC
ABC
AC
11
OLAP On-Line Analytical Processing

A multidimensional, LOGICAL view of the data.
Interactive analysis of the data drill, pivot,
slice_dice, filter.
Summarization and aggregations at every dimension
intersection.
Retrieval and display of data in 2-D or 3-D
crosstabs, charts, and graphs, with easy pivoting
of the axes.
Analytical modeling deriving ratios, variance,
etc. and involving measurements or numerical data
across many dimensions.
Forecasting, trend analysis, and statistical
analysis.
Requirement Quick response to OLAP queries.

12
OLAP Architecture

Logical architecture
OLAP view multidimensional and logic
presentation of the data in the data
warehouse/mart to the business user.
Data store technology The technology options of
how and where the data is stored.
Three services components
data store services
OLAP services, and
user presentation services.
Two data store architectures
Multidimensional data store (MOLAP).
Relational data store Relational OLAP (ROLAP).

13
Spatial Data Warehouse and Spatial OLAP

Spatial Data Warehouse Integrated,
subject-oriented, time-variant, and nonvolatile
spatial data repository for data analysis and
decision making.
Spatial Data Integration A big issue.
Spatial data cube Multidimensional spatial
database.
Non-spatial dimensions time, product,
organization hierarchies.
Spatial dimensions formed by geo-spatial
hierarchies.
Non-spatial (numerical) measurements
Distributive, algebraic, holistic.
Spatial Measurements
Collection of spatial object pointers which may
require spatial merge, overlay, or other
operations.

14
Example Weather Pattern Analysis

Input
a map with about 3,000 weather probes scattered
in B.C.
daily data for temperature, precipitation, wind
velocity, etc.
concept hierarchies for all attributes
Output
a map that reveals patterns merged (similar)
regions!
Goals
interactive analysis (drill-down, slice, dice,
pivot, roll-up)
fast response time
minimizing storage space used
Challenge a merged region may contain hundreds
of primitive regions (polygons).

15
A Model of Spatial Data Warehouses

Dimensions
nonspatial
(e.g. 25-30 degrees generalizes to hot)
spatial-to-nonspatial
(e.g. region B.C. generalizes to description
western provinces)
spatial-to-spatial
(e.g. region Burnaby generalizes to region
Lower Mainland)

Measurements
numerical
distributive (e.g. count, sum)
algebraic (e.g. average)
holistic (e.g. median, rank)
spatial
collection of spatial pointers (e.g. pointers to
all regions with 25-30 degrees in July)

16
Star Model of a Spatial Data Warehouse

Dimensions
region_name
time
temperature
precipitation
Measurements
region_map
area
count

Fact table
Dimension table
17
Spatial Merge Pre- vs On-line Computation
Precomputing all too much storage space

On-line merge very expensive
18
Spatial Measurements Selective Materialization

Methods for computation of spatial measurements
in spatial data cube.
Collect and store pointers to spatial objects in
a spatial data cubeComputing on the fly ---
expensive and slow.
Saving all the possible combinations --- huge
space overhead.
Precompute and store rough approximations in a
spatial data cube --- accuracy trade-off.
Selective computation only materialize those
which will be accessed frequently --- a
reasonable choice.
Cube lattice and granularity of merge-able
spatial objects.
Cuboid-level vs. cube cell level granularity.

19
Computing Spatial Measurements

Apply HRU96 greedy algorithm to select cuboids
HRU96 algorithm has granularity on a cuboid
level

Finer granularity, on a cell level
Only selected cells are materialized (not the
whole cuboid)
Factors in selections of cells
access frequency
size of a cell (number of merged objects)
It could be better to save 1,3,4,7 than 1,3
benefit for on-the-fly computationIf 1,3 is
saved, it can be used for 1,3,6.
Only neighboring objects are merged.

20
Integration of Data Mining and Data Warehousing

Data warehouse provides clean, integrated data
for fruitful mining.
Data mining provides powerful tools for analysis
of data stored in data warehouses.
OLAP can be viewed as data summarization and
simple data mining.
Data mining provides more analysis tools, e.g.,
association, classification, clustering,
pattern-directed, and trend analysis.
Mining multi-level knowledge by integration with
OLAP facilities mining in multiple data cubes.

21
Mining Different Kinds of Knowledge

Characterization Generalize, summarize, and
possibly contrast data characteristics, e.g.,
dry vs. wet regions.
Association Rules like inside(x, city) à
near(x, highway).
Classification Classify data based on the
values in a classifying attribute, e.g., classify
countries based on climate.
Clustering Cluster data to form new classes,
e.g., cluster houses to find distribution
patterns.
Trend and deviation analysis Find and
characterize evolution trend, sequential
patterns, similar sequences, and deviation data,
e.g., housing market analysis.
Pattern-directed analysis Find and characterize
user-specified patterns in large databases, e.g.,
volcanos on Mars.

22
Different Mining Tasks in Spatial DBs

Spatial data mining tasks
Spatial data characterization and comparison
Spatial clustering analysis
Spatial classification
Spatial association
Spatial pattern analysis
Spatial concept hierarchies thematic vs.
spatial.
Thematic hierarchy e.g., agriculture (food
(grain (corn, rice, ...), vegetable, fruit),
others(...)).
Spatial hierarchy, based on
Spatial data structures (MBR, quad-tree
R-tree).
Spatial related semantics (geo-region
classification).
Clustering analysis (e.g., neighborhood or
adjacent_to).

23
A Geo-Spatial Data Mining Query Language
GMQL

Extension to Spatial SQL Egenhofer94.
Support ad-hoc data mining queries.

mine characteristic rules type of
rule (characteristic, discriminant,
association, clustering, classification)for
Description of states along I 80 highway
from us_hiway, states_census SQL like
from, where clauses
where states_census.obj intersects us_hiway.obj
high level concepts and and highway "I 80
spatial joins may be usedwith respect to
states_census.obj, state_name, pop90,
capita_income list of relevant attributes
set attribute threshold 51 for state_name
thresholds for rules filtration

24
Background Knowledge for Data Mining

Conceptual "hierarchies" and generalization
operators.
Instance-based freshman, ..., senior Ì
undergraduate.
Schema-based address(city, province, country).
Rule-based good(x) undergraduate(x) Ù
gpa(x) ³ 3.5.
Operation-based aggregation, approximation,
clustering, etc.
Where to get such background knowledge?
Implicitly stored in databases, such as address.
Explicitly defined by experts, such as "physics
Ì science".
Formed with different attribute combinations,
food(category, brand, content _spec, package
_size, price).
Generated automatically by data distribution
analysis.
May need dynamic adjustment for a particular set
of data.
Choose from multiple hierarchies or try them in
parallel.

25
Automatic Generation of Numeric Hierarchies

Count
Amount
2000-97000
2000-16000
16000-97000
2000-12000
12000-16000
16000-23000
23000-97000
26
Spatial OLAP (Characterization)

Viewing data from different angles
Summarization on multiple concept levels

27
Mining Discriminant Rules

Discrimination Comparison of two or more classes
Strategy
Collect the relevant data respectively into
the target class and the contrasting class
Generalize both classes to the same high level
concepts,
Compare tuples with the same high level
descriptions,
Present for every tuple its description and two
numbers
support - distribution within single class
comparison - distribution between classes
Highlight the tuples with strong discriminant
features
Interestingness
Different measures of interestingness,e.g.
consider also the sizes of different classes

28
Spatial OLAP (Comparison)

Comparing different classes of data

Population increases faster in the western
part. Drill down, and look at different
dimensions to get explanation!!
29
Mining Association Rules

Association Finding association among a set of
attributes and their values.
Applications pattern association, market
analysis, etc.
Examples.
milk bread 5, 60
tire Ù auto_accessories auto_services 2,
80
Methods for mining associations
Apriori ( Agrawal Srikant94)
Partition technique (Savasere, Omiecinski,
Navathe95)
Sampling (Toivonen96)

30
Spatial Associations
FIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf
Course" FROM Washington_Golf_courses,
Washington WHERE CLOSE_TO(Washington_Golf_courses.
Obj, Washington.Obj, "3 km") AND
Washington.CFCC ltgt "D81" IN RELEVANCE TO
Washington_Golf_courses.Obj, Washington.Obj, CFCC
SET SUPPORT THRESHOLD 0.5
31
Spatial Associations Hierarchy of Spatial
Relationships

Spatial association Association relationship
containing spatial predicates, e.g., close_to,
intersect, contains, etc.
Topological relations
intersects, overlaps, disjoint, etc.
Spatial orientations
left_of, west_of, under, etc.
Distance information
close_to, within_distance, etc.
Hierarchy of spatial relationship
g_close_to near_by, touch, intersect, contain,
etc.
First search for rough relationship and then
refine it.

32
Efficient Mining of Spatial Associations

Two-step computation of spatial associations
Step 1 rough spatial computation as a filter
MBR or R-tree rough estimation.
Step2 Detailed spatial algorithm as refinement
apply only to those pairs which have passed the
rough spatial association testing (no less than
min_support).
Multi-dimensional mining
explore association relationships at any selected
granularity level
perform drill-down and roll-up on any dimension.

33
Example Spatial Association Rule Mining

What kinds of spatial objects are close to each
other in B.C.?
Kinds of objects cities, water, forests,
usa_boundary, mines, etc.
Rules mined
is_a(x, large_town) intersect(x, highway)
adjacent_to(x, water). 7, 85
is_a(x, large_town) adjacent_to(x,
georgia_strait) close_to(x, u.s.a.). 1, 78
Mining method Ariori multi-level association
geo- spatial algorithms (from rough to high
precision).

34
Data Classification

Data categorization based on a set of training
objects.
Applications credit approval, target marketing,
medical diagnosis, treatment effectiveness
analysis, etc.
Example classify a set of diseases and provide
the symptoms which describe each class or
subclass.
The classification task Based on the features
present in the class_labeled training data,
develop a description or model for each class.
It is used for
classification of future test data,
better understanding of each class, and
prediction of certain properties and behaviors.
Data classification methods Decision-trees
(e.g., ID3, C4.5), statistics, neural networks,
rough sets, etc.

35
A Decision-Tree Based Classification Method

A decision tree
ID-3 and C4.5 (Quinlan93) A top-down decision
tree generation algorithm.
At start, all the training examples are at the
root.
Partition examples recursively based on selected
attributes.
Attribute selection Maximizing an information
gain measure, i.e., favoring the partitioning
which makes the majority of examples belong to a
single class.

outlook
sunny
rain
overcast
windy
humidity
P
N
P
N
P
36
Scalable Classification Methods

Scalability of decision-tree classification
algorithms.
Previous approaches
Incremental tree construction (Quinlan86)
total cost is high.
Data sampling and discretizing continuous
attributes
(Cattlet91) still in main memory.
Data partition and merge of parallel partition
(Chan and Stolfo91) reduced classification
accuracy.
SLIQ SPRINT (Mehta et al.96, Shafer et
al.96) disk-based
Decision-tree construction algorithms.
Techniques Pre-sorting, breadth_first
tree-growing, and tree-pruning.

37
Generalization-Based Decision-Tree Induction

Integration of generalization with decision-tree
induction.
Classification at primitive concept levels, e.g.,
precise
temperature, humidity, outlook, etc.
Weakness low-level concepts, scattered classes,
bushy
classification-trees, semantic
interpretation problems.
Classification at high or medium concept levels
may lead to imprecise classification.
Medium level generalization adjustment
Generalize to intermediate concept level(s).
Merge and split concept levels for better class
representation and classification accuracy.
Efficiency Analysis performed in compressed,
generalized relations.

38
Mining Classification Rules

Classification Based on the features present in
the class_labeled training data, develop a
description or model for each class.
Applications credit approval, target marketing,
medical diagnosis, treatment effectiveness
analysis, etc.
Example classify a set of diseases and provide
the symptoms which describe each class or
subclass.

39
Spatial Classification

Generalization-based induction
Interactive classification

40
Predictive Modeling in Databases

Predictive modeling Predict data values or
construct generalized linear models based on
the database data.
One can only predict value ranges or category
distributions.
Method outline
Minimal generalization
Attribute relevance analysis
Generalized linear model construction
Prediction.
Determine the major factors which influence the
prediction.
Data relevance analysis uncertainty measurement,
entropy analysis, expert judgement, etc.
Multi-level prediction drill-down and roll-up
analysis.

41
Spatial Prediction and Trend Analysis

Spatial trend predictive modeling (Ester et
al97)
Discover centers local maximal of some
non-spatial attribute.
Determine the (theoretical) trend of some
non-spatial attribute, when moving away from the
centers.
Discover deviations (from the theoretical trend).
Explain the deviations.
Example Trend of unemployment rate change
according to the distance to Munich.
Similar modeling can be used to study trend of
temperature with the altitude, degree of
pollution in relevance to the regions of
population density, etc.

42
Data Clustering Analysis

Data clustering (unsupervised learning)
Cluster objects
into classes, based on their features, which
maximize intraclass similarity and minimize
interclass similarity.
Probability-based vs. distance-based clustering
analysis.
Typical probability-based clustering analysis
algorithms
COBWEB (Fisher87) Incremental concept
formation.
Category utility measurement (probability of each
concepts occurrence)
Top-down, incremental, hierarchical organization
of concepts.
CLASSIT (Gennari89) extend it to real-valued
data.
Typical distance-based clustering analysis
algorithms
Statistics-based, k-means, k-medoids, nearest
neighbors.

43
Distance-Based Spatial Clustering Analysis

Statistical approaches scan data frequently,
iterative
optimization, hierarchical clustering, etc.
CLARANS (Ng Han94) randomized search
(sampling)
PAM (a distance-based clustering
algorithm).
DASCAN (Ester et al.96) density-based
clustering using spatial data structures
(R-tree).
BIRCH (Zhang et al.96) Balanced iterative
reducing and
clustering using hierarchies.
Focus on densely occupied portions of the data
space.
Measurement reflects the natural closeness of
points.
A height-balanced tree (CF-tree) is used for
clustering.
Describe aggregate proximity relationships (Knorr
Ng96).

44
Spatial Clustering

How can we cluster points?
What are the distinct features of the clusters?

There are more customers with university degrees
in clusters located in the West. Thus, we can
use different marketing strategies!
45
Data and Knowledge Visualization

Visualization of characteristic and discriminant
rules
tables cubes bar/pie charts, curves,
surfaces, etc.
Visualization of association rules
Association rule graph Nodes for large
1-itemset, lines for large 2-items sets, arrows
for implication strength.
Association matrix support/confidence
size/color in cells.
Cluster analysis viewing clusters and their
characteristics.
Classification colored decision trees.
Prediction curves, pie charts, and relevance
analysis results.
Deviation analysis boxplots (quartiles, median)
and outliers.
Visual impression of large data mining results
arrange and color data items as pixels (Keim et
al.94)

46
Visual Data Mining (ref. D. Keim SIGMOD96
Tutorial)

Data visualization and exploratory analysis
Interactive, usually undirected search for
structures, trends, etc.
Typical data visualization techniques
Geometric techniques, icon-based techniques,
pixel-oriented techniques, hierarchical
techniques, graph-based techniques,
3D-techniques, dynamic techniques, and hybrid
techniques.
Database visualization systems
Statistics-oriented systems, visualization-oriente
d systems, database-oriented systems and special
purpose systems.
Visual database exploration is another powerful
approach to data mining, especially spatial data
mining.

47
Data Mining Interfaces

Interactive mining versus a data mining
language.
Specification of data mining tasks.
Data sets any sets of data in databases
Mining task specification kinds of knowledge or
forms of rules to be mined.
Background knowledge (e.g., concept
hierarchies) specification and manipulation.
Interestingness measurement significance,
confidence, thresholds, concept levels, etc.
Transformation and manipulation of output
results.
Roll-up vs. drill-down.
Multiple output forms generalized relations,
crosstabs, charts, curves, and other visual
outputs.

48
GeoMiner Graphical User Interface
49
Systems for Data Warehousing and Data Mining

Systems for Data Warehousing
Arbor Software Essbase
Oracle (IRI) Express
Cognos PowerPlay
Redbrick Systems Redbrick Warehouse
Microstrategy DSS/Server
Systems or Research Prototypes for Data Mining
IBM QUEST (Intelligent Miner)
Silicon Graphics MineSet
Integral Solutions Ltd. Clementine
Information Discovery Inc. Data Mining Suite
SFU (DBTech) DBMiner, GeoMiner
Rutger DataMine, GMD Explora, U Munich VisDB

50
Conclusions

Data warehousing and data mining
A rich, promising, young field with broad
applications and many challenging research
issues.
Imminent task spatial database analysis --- from
spatial data manipulation to on-line spatial
analytical processing (Spatial OLAP) and spatial
data mining.
Spatial data cube construction fine granularity
analysis.
Multiple spatial data mining tasks
Characterization, association, classification,
clustering, sequence and pattern analysis,
prediction, etc.
Integration of data mining with OLAP OLAP-based
spatial data mining.
Integration of spatial analysis methods, spatial
query processing methods, and spatial indexing
techniques.

51
Future Research

Foundation of spatial data warehousing and data
mining.
Implementation methods
Efficient construction of spatial data cubes.
A set of well-tuned spatial data mining
operators.
Spatial data and knowledge visualization tools.
Integration of multiple mining tasks with OLAP
functions.
New spatial indexing techniques for spatial data
warehousing and spatial mining.
New spatial data mining methodologies
Statistical tools, neural nets, and ad-hoc
query-based mining, etc.
Mining spatiotemporal data, raster data, and
integration with existing spatial analysis
techniques.

52
References

1 Floris Geerts, Sofie Haesevoets and Bart
Kuijpers.
A Theory of Spatio-Temporal Database. Computer
Science Dept., North Dakota State University
(2000)
2 Martin Ester, Hans-Peter Kriegel, Jörg
Sander.Algorithms and Applications for Spatial
Data Mining , Geographic Data Mining and
Knowledge Discovery, 2001.
3 Martin Ester, Alexander Frommelt, Hans-Peter
Kriegel, Jörg Sander. Algorithms for
Characterization and Trend Detection in Spatial
Databases, International Conference on Knowledge
Discovery and Data Mining (KDD-98)
4 Jan Paredaens, Bart Kuijpers. Data Models and
Query Languages for Spatial Databases. ACM SIGKDD
Explorations (1999)
5 Hans-Peter Kriegel, Thomas Brinkhoff, Ralf
Schneider. Efficient Spatial Query Processing in
Geographic Database Systems. VLDB (2001)
6 Usama Fayyad, Gregory Piatetsky-Shapiro, and
Padhraic Smyth. From Data Mining to Knowledge
Discovery in Databases. AI MAGAZINE (1999)
7 Ramakrishnan Srikant, Rakesh Agrawal. Mining
Quantitative Association Rules in Large
Relational Tables. VLDB (1996)
8 Krzysztof Koperski, A Progressive
Refinement Approach to Spatial Data Mining. SFU
PhD Thesis (1999)

Thank you !!!

Write a Comment

User Comments (0)

About PowerShow.com

Spatial Data Mining and Spatial Data Warehousing Special Topics In Database - PowerPoint PPT Presentation

Spatial Data Mining and Spatial Data Warehousing Special Topics In Database

Special Topics In Database Sadra Abedinzadeh Ashkan Zarnani Farzad Peyravi Outline Motivation and General Description Data Warehousing: Basic Concepts and Techniques ... – PowerPoint PPT presentation