Title: Chapter 8: Trends in DBMS 8.1 Database Support for Field Entities 8.2 Content-based Retrieval 8.3 Introduction to Spatial Data Warehouses 8.4 Summary
1Chapter 8 Trends in DBMS8.1 Database Support
for Field Entities8.2 Content-based
Retrieval8.3 Introduction to Spatial Data
Warehouses8.4 Summary
2Why Learn about Field Data-sets?
- Field data is timely and abundant
- Sensors (e.g. satellite based ones) provide
periodic snapshot of Earth - Most up-to-date data about current events (e.g.
fires, flood) - Field data are useful
- In creating, revising and evaluating vector data
sets - Digital archival of fragile historical paper maps
- To manually get details not captured in vector
interpretations - Example Location selection for a facility (e.g.
a grocery store) - Consider a set of Aerial photographs of different
locations - Vector interpretation includes roads, water
bodies, elevation - What other information can aerial imagery reveal
for construction planning? - trees (types and location), buildings,
3What are Field Data-sets?
- Field data set examples
- Satellite images, aerial photographs
- Digitized paper maps
- Earth Science data-sets, e.g. rainfall,
temperature maps - Data types of Spatial field data sets
- Images
- satellite based, e.g. www.terradata.com
- aerial photographs
- measurements from a Geo-registered sensor
networks, e.g. weather - Video, i.e. time series of images
- Audio data
- Focus Primarily images though some discussion
will apply to other data types
4Fields and Rasters a Sampling of Field Values
- Definitions
- Field a mapping from a spatial domain to a
value domain - Image a mapping from a rectangular grid to a
value domain - A rectangular grid is a collection of cells
called pixels - Raster is geo-registered image, i.e. grid axis
have absolute spatial locations - Fields are often approximated as rasters
- Example Figure 8.1
- Identify spatial domain, field, rectangular grid,
raster approximation - Fields can be approximated as images if relative
spatial locations are adequate
Figure 8.1
5Computing with Field Data
- Field data manipulated using operations of
- map algebra
- image algebra
- An Algebra is a mathematical structure consisting
of - Operands and Operations
- Map Algebra
- Operand rasters
- Operations Can be classified into four groups
- Local, Focal, Zonal and Global
- Image Algebra
- Operand images
- Operations crop, zoom, rotate
6Local Operation
- A local operation maps a raster into another
raster such that the value of a cell in the new
raster depends only on the value of that cell in
the original raster - Examples unary operation thresholding
- binary operation point wise addition
Figure 8.2
7Focal Operation
- In a focal operation, the value of a cell in the
new raster is dependent on the values of the cell
and its neighboring cells in the original raster - Examples unary operations focal sum, gradient,
Neighborhoods Rook, Bishop and Queen
Figure 8.3
8Zonal Operation
- In a global operation, the value of a cell in the
new raster is a function of the location or
values of all cells in the original or another
raster - Examples zonal sum, zonal average, ...
Figure 8.4
9Global Operation
- In a zonal operation, the value of a cell in the
new raster is a function of the value of that
cell in the original layer and the values of
other cells which appear in the same zone
specified in another raster - Example distance from nearest facility
Figure 8.5
10Image Operations Trim
- Image Operations
- Ignore the absolute locations of pixels.
- Come from image processing literature
- Ex. smoothing, low pass filter, high pass filter
- Example A trim operation extracts an
axis-aligned subset of the original raster
Figure 8.6
11Storage and Retrieval of Raster Data - 1
- Traditional Approach
- Store raster data in a file system
- Use custom software to retrieve data-items of
interest - Example personal photographs stored on MS
Windows - Q? What attributes can one attach to digital
photographs ? - Q? Is there an easy way to retrieve all pictures
taken in San Francisco? - Limitations
- Rigid schema
- limited ability to add and manage additional
attributes - Canned Queries only
- limited ability to support ad-hoc queries
- Data quality
- limited ability to identify duplicates or similar
data-items
12Storage and Retrieval of Raster Data in a SDBMS
- A database approach
- Database tables store
- raster data items
- attributes (i.e. meta-data), e.g. creation date,
geo-location, subject, ... - Use SQL like query language to retrieve desired
data-items - retrieve all raster data-items overlapping with
city of San Francisco (Q1) - retrieve latest raster data-item within city of
Paris (Q2) - retrieve raster data-items similar to a given
image (Q3) - Pros
- table schema definition allows user defined
attributes - improve ability to pose ad-hoc queries (Ex. Q1,
Q2) - improve data reliability and quality
- example Query Q3 may be used for duplicate
reduction
13Storage and Retrieval of Raster Data - Challenges
- Challenges in database based approach
- Storage size( raster data item) gt size (disk
blocks) - Retrieval raster has rich content
- a picture is worth a thousand word!
- Approaches to storage challenge
- Delegate storage to DBMS
- Use Binary Large Object (BLOB) data-type
- create table my_picture(
- image BLOB
- creation_date date
- place point
- )
- Do-it-yourself
- divide a raster data-item into smaller slices
- Q? Which way of slicing reduce disk I/Os for
common queries?
14How is Raster Data Stored on Secondary Storage?
- Slicing approaches
- Linear, e.g. one row per disk block (see Figure
8.8(b)) - Tiling - see Figure 8.8(c )
- Tiling is preferred
- For queries extracting rectangular sub-images
- Example - terraserver.com
Figure 8.8
15How is Raster Data Queried?
- Retrieval challenge of rich content
- Meta-data approach
- Content based retrieval
- Meta-data approach
- Select a set of descriptive attributes
- simpler SQL data types, e.g. numeric, string,
date, ... - example source, location, time stamp, subject,
resolution, ... - Store values of descriptive attributes for each
raster data-item - Allow SQL queries on the descriptive attributes
- Limitation of meta-data approach
- Restricts queries to content captured by
descriptive attributes - Does not support Similarity based queries
- example Find all raster data-items similar to a
given raster data item
16Content Based Retrieval (CBR)
- Examples
- Q1. Find all raster data-items similar to a given
raster data item - Q2. Locate a photograph of a river in Minnesota
with trees nearby - Q3. Find all images of state parks which have a
lake within them, are within a radius of one
hundred miles from Chicago, and are southwest of
Chicago - State of the Art
- However, few robust implementations of CBR are
available as of 2002 - several research prototypes address similarity
query Q1 - Result quality is similar to those of web
searches (e.g. www.google.com) - some of the retrieved raster data-item are useful
- many similar data item are not retrieved in the
result - usable in application domains such as publishing
- Our goal is to understand a current approach to
similarity queries - involving spatial similarities
17Content Based Retrieval (CBR)
- Spatial Similarity
- Consider a pair of raster images with common
objects (e.g. parks, lakes) - Spatial similarity between raster images can be
defined based on - similarity of spatial relationships (e.g.
topological, directional) - Q? Which pairs exhibit higher similarity?
- P1 (inside, disjoint) or P2 (inside, covered
by) - P3 (disjoint, touch) or P4 (disjoint, inside)
- P5 (north west, north) or P6 (west, east)
- A graph framework for comparing spatial
relationships - Nodes spatial relationships
- Edges connect most similar nodes
- Similarity metric number of edge on shortest
path between 2 nodes - See Figures 8.9 and 8.10
18Topological Relationship Similarity
- Study Figure 8.9, pp.234
- Nodes topological relationships
- Edges most similar
- Similarity measure path length
- Inference from Model
- P2 (inside, covered by) more similar than P1
(inside, disjoint) - Do you agree?
- Review Figure 2.3 (pp.30)
Figure 8.9
19Direction Relationship Similarity
- Study Figure 8.10, pp.235
- Nodes topological relationships Edges most
similar - Similarity measure path length
- Inference P5 (north-west, north) more similar
than P6 (west, east)
Figure 8.10
20Distance Similarity
- Distance similarity is based on
- Euclidean distance between the centroids of the
objects. - Example Image R is more similar to P than Q in
Figure 8.11 (pp.235)
Figure 8.11
21A Computational Approach to CBR
- Attribute Relation Graph (ARG)
- Node objects in a raster
- Edges relationships
- Example raster of Figure 8.12(a)
- ARG in Figure 8.12(b)
- point object O3
- rectangles O1, O2
- edge (O1, O2) shows that they are disjoint, at 61
degree direction and 5.2 units distant - Vector representation of ARG
- Lists objects and edge properties
- Example in Figure 8.12
Figure 8.12
22A Computational Approach to CBR
- Steps
- Represent each raster data item by its ARG vector
- Map query raster data item by its ARG vector
- Find most similar raster data-items in the
database by comparing ARG vector representations - use a distance metric
- use a multi-dimensional index
- Comment Result quality is similar to those of
web searches. Some of the retrieved raster
data-item are useful
Figure 8.13
23Why are Data Warehouses Interesting?
- Data Warehouse facilitate group decision making
- Consider a dataset
- 1 measure (i.e. Sales)
- 3 dimensions (e.g. Company, Year, Region)
- Analysis questions
- Q1. Rank Regions by total sales.
- Q2. Rank years by total sales.
- Q3. Where are sales consistently growing?
- Cross tabulates summaries reports used to analyze
the trends - Example
24Generating Cross-tabulation Summaries
- Traditional Approach
- Use custom software pulling data out of a DBMS
- Limitations redundant of work, inefficient use
of resources - Data Warehouse approach
- Cross-tab. Can be generated using a set of simple
report - each report is generated from a SQL Select ...
group by statement - Example Fig. 8.19 (pp. 244) and Table 8.3 (pp.
245) - cross-tab example in last slide is a union of
- SALES-L0-A, SALES-L1-A, SALES-L1-B and SALES-L2
- table 8.3 shows SQL queries to compute each part
- Advantage
- rest of SQL is available for pre/post processing
of data - performance gains by eliminating unnecessary
copying of data
25Example Data Warehouse
Figure 8.19
26Cross-tabulation vs. Report Hierarchy
- Spreadsheet view of a report
- Views a report a N-dim. Spreadsheet
- N number of dimension attributes
- Each cell contains value of measure
- Cross-tabulation view of a Report hierarchy
- Example report hierarchy for SALES-L0-A,
SALES-L2-A, SALES-L1-B, SALES-L2, Figure 8.19
(pp.244)
27What is a Data Warehouse?
- Data Warehouse is a special purpose database
- Primarily used for specialized data analysis
purposes - Facilitates generation and navigation of a
hierarchy of reports - Special purpose data-sets and queries
- Data consists of
- a few measure attributes
- a set of dimension attributes
- The measure attribute depends on dimension
attributes - Queries generate reports
- report measure for selected values of dimensions
- aggregate measure for given subset of dimensions
- What is a spatial data warehouse?
- Data warehouses with spatial measures or
dimensions - Example census data - census tract is a spatial
dimension - Example logistics data - route is a spatial
dimension
28Data Warehouse Operations
- Operations on a data warehouse
- Roll-up, Drill-down
- Slice, Dice
- Pivot
- Roll-up
- Inputs A report R, A subset S of dimensions in R
- Output A sequence of reports summarizing R
- Example 1 RSALES-Base, S(Year, Region) in
Figure 8.19 (pp. 244) - Output consists of reports SALES-L0-A,
SALES-L1-B, SALES-L2 - Example 2 RSALES-Base, S(Region, Year)
- Output consists of reports SALES-L0-A,
SALES-L1-A, SALES-L2 - Drill-down
- Inputs A report R, A dimension D not in R
- Output A reports detailing R on D
- Example R SALES-L1-B, D Region in Figure
8.19 (pp.244) - Output report SALES-L0-A
29Data Warehouse Operations
- Slice, Dice
- Reduce dimensions in a table (Figure 8.7, pp.232)
- Inputs A report R, A value V for a dimension D
in R - Output A subset of R where DV
- Example RSALES-L0-A, DYear, V1994 in Figure
8.19 (pp.244) - output Table 8.5 (pp.246)
- includes tuple (ALL, 1994, America, 35)
Figure 8.7
30Data Warehouse Operations
- Pivot
- For a spreadsheet view of reports
- Transposes a spreadsheet
- Example
- Inputs A spreadsheet view of a report R
- Output A transposed spreadsheet
- Example R SALES-L0-A, Figure 8.19 (pp.244)
31Logical Data Model of a DWH
- Purpose of a logical data model
- Specify a framework to specify computational
structure - Allow extension of SQL to model new needs
- Cube operation
- Input A fact table
- Output A set of summary reports covering all
subsets of dimension columns - equivalent to union of all tables and reports in
Figure 8.19 (pp.244) - Example Figure 8.18, pp.243
- SELECT Company, Year, Region, Sum(Sales) AS Sales
- FROM SALES
- GROUP BY CUBE Company, Year, Region
32Figure 8.18
33Physical Data Model of a DWH
- Purpose Computationally efficient implementation
- Ideas
- Pre-computation -
- pre-compute some of reports and use those to
compute other reports - New indexing methods, e.g. bit-map index
- Query Processing Strategies
- strategies for aggregate functions
- new strategies for multi-table joins
- Let us look at strategies for aggregate functions
34DWH Physical Model Aggregate Function Strategies
- Aggregate Functions
- Compute summary statistics for a given set of
values - Examples sum, average, centroid (Table 8.1,
pp.238) - Strategies for efficient computation
- Characterize easy to compute aggregate functions
- 3 categories
- distributive
- algebraic
- holistic
- First 2 categories can be computed easily in one
scan of the dataset
35Definitions of Aggregate Function Categories
- Notation
- F,G,G1,G2,,Gn are aggregate functions where n is
small - S is a set of values, e.g. S(1,2,3,4)
- P(S1,S2,,Sp) is a partition of S, e.g.
P(S1,S2), S1(1,2), S2(3,4) - Distributive (F) if there exists a G such that
- F(S) G(F(S1),F(S2),,F(Sn))
- Example sum is distributive
- Illustration sum(1,2,3,4) sum(sum(1,2),sum(3,4)
) - Algebraic (F) if there exists G1,,Gn, (where n
is small) and - F(S) G(G1(S1),,Gn(S1),G2(S1),,Gn(S2),,G1(Sp),
, Gn(Sp)) - Example average is distributive
- Illustration average(1,2,3,4)count(1,2)averag
e(1,2)count(3,4)average / count(1,2)count(3,
4)
36Example Distributive Aggregate Function
- Examples in cross-tabulation scenario (Figure
8.14, pp.238) - Example 1 Min is distributive
- Example 2 Count is distributive
Figure 8.14
37Examples Algebraic Aggregate Functions
- Examples in cross-tabulation scenario (Figure
8.15, pp.239) - Average and Variance are algebraic
Figure 8.15
38Discussion - Spatial Data Warehouse
- Example
- Consider the example in Figure 8.16, pp.241
- A map interpretation may be attached to each
report - each row has a spatial footprint, which can be
aggregated by geometric-union - The collection of maps may be called a mapcube
- Issues
- What is needed in OGIS standard to support
map-cube operation? - Hierarchical collection of maps in mapcube
- what is an appropriate cartography to convey the
relationship among maps?
39Spatial Data Warehouses and Mapcube
Figure 8.16
40Figure 8.17
41Summary
- Field data
- Useful in many applications due to rich content
- Represented as raster or image
- Operations can be categorized into local, focal,
zonal, and global - Field data storage and retrieval
- Tiling is a preferred way to divide raster data
into disk blocks - Meta-data based query is often used for retrieval
- Content based retrieval may be used for
similarity searches - Data warehouses support analysis e.g.
cross-tabulation reports - SQL CUBE operator support generation of DWH
reports - Distributive and Algebraic aggregate functions
can be computed easily