Christian B - PowerPoint PPT Presentation

About This Presentation
Title:

Christian B

Description:

Currently neglected by our model ... Schedule Optimization (NN Queries) Current expenses are traded for possible later savings ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 42
Provided by: dbs3
Category:
Tags: christian | lof | models | nn | striping

less

Transcript and Presenter's Notes

Title: Christian B


1
Christian BöhmUniversity for Health Informatics
and Technology, InnsbruckSimilarity Search and
Data Mining Database Techniques Supporting Next
Decade's ApplicationsKeynote at iiWAS 2002
2
1
Similarity Search
3
Feature Based Similarity
4
Simple Similarity Queries
  • Specify query object and
  • Find similar objects range query
  • Find the k most similar objects nearest
    neighbor q.

5
Multidimensional Index Structure (R-tree)
6
Range Query with Depth-First Traversal
7
Nearest Neighbor Priority Algorithm
Hjaltason, Samet Ranking in Spatial Databases,
SSD 1995
4 page accesses
8
Problems of High-Dim. Index Structures
  • Curse of dimensionality
  • Search performance of index deteriorates in high
    dim.
  • Outperformed by sequential scan
  • Solution
  • Optimize various parameters of index structures
  • Needed Cost model for queriesHow many pages are
    expected to be accessed for
  • Range queries (with given e)
  • Nearest neighbor queries (with given k)

9
Cost Estimation (Uniformity/Independence)
  • Minkowski sumEstimation of the access
    probability of a pageBöhm A Cost Model for
    Query Processing in High-Dimensional Data Spaces,
    TODS 25(2), 2000

Nearest neighbor Estimate distance by point
density
10
Cost Estimation
  • Boundary and saturation effects in high dim.
    space(considered by our model extension)
  • Correlation between attributes(considered by the
    concept of fractal dimension)
  • Cluster structure has also impact on performance
  • Currently neglected by our model
  • Histograms and similar data descriptions
    difficult in high-dimensional space (number of
    histo-bins exponential in dimensionality)
  • Other descriptions of cluster structure
    (dendrograms)
  • Subject to future work

11
Optimization of Index Structures
  • To avoid the possibility to outperform index
    based query processing by the sequential scan
  • Optimize various parameters such as
  • Logical block size of the index pages
  • Indexed dimension
  • I/O schedule optimization (fast index scan)
  • Data quantization
  • Observe the balance! (Master Confucius)

12
Page Size Optimization
13
Page Size Optimization
14
Optimized Dimension Assignment
Hi-dim. Index
Inverted List
Matching
R-tree
B-tree
Problem in hi-dim Too few splits ineach
dimension
Problem in hi-dim Too many resultsin each
dimension
15
Optimized Dimension Assignment
Hi-dim. Index
Inverted List
Matching
R-tree
B-tree
Compromise A moderate number of R-treeseach
indexing a few dimensions
OPTIMIZE!
16
Schedule Optimization (Fast Index Scan)
Range Query Required Pages are known from the
directory
17
Schedule Optimization (NN Queries)
  • Current expenses are traded for possible later
    savings
  • Start at 100 page and extend forward and
    backward
  • Optimize the cumulated cost balance (CCB)

18
Quantization
  • Approximate the points by quantization grid
    based on quantiles
  • Benefitfewer bits for representation
  • Cost Grid cell partially intersectedÞ access
    the original point data
  • How to choose grid resolution ???

Weber, Schek, Blott A Quantitative Analysis and
Performance Study..., VLDB 1998
19
Independent Quantization (IQ tree)
Combines index, scan, and quantization Berchtold,
Böhm, Jagadish, Kriegel, Sander Independent
Quantization..., ICDE 2000
Grid resolution optimized by cost model
20
Open Research Problems in Optimization
  • Multi-Parameter Optimization
  • How can parameters be optimized simultaneously?
  • Are there conflicts between optimization goals?
  • Example

Uniform dataÞ Quantization
Correlated dataÞ Tree Striping
21
Open Research Problems in Optimization
  • Consider Insert/Delete/Update
  • If the data set faces heavy update, the
    constructed index should look differently
    compared with more static data sets
  • Update-bound Construct index rather simple
  • Query-bound Spend more effort to organize data
  • Can be considered as an optimization problem

22
2
Data Mining
23
KDD Algorithms Based on Similarity Queries
24
Join Applikationen
  • Katalogkonversion (Catalogue Matching)
  • z.B. Astronomie-Kataloge

25
Clustering
  • Clustering (e.g. DBSCAN)Ester, Kriegel, Sander,
    Xu A Density Based Algorithm for Discovering
    Clusters, KDD 1996

26
Cache Behavior
27
Clustering and Similarity Join
  • DBSCAN uses similarity join as basic
    operationsBöhm, Braunmüller, Breunig, Kriegel
    High Perf. Clustering based on the Sim. Join,
    CIKM 2000

28
k-Nearest Neighbor Classification
  • Example

Objects with known class
29
Distance Range Join (e-Join)
  • Most widespread and best evaluated join
  • Often also called the similarity join

30
k-Closest Pair Query
SELECT FROM R, SORDER BY R.obj -
S.objSTOP AFTER k
  • In SQL notation

31
k-Nearest Neighbor Join
SELECT FROM R, SGROUP BY R.objORDER BY
R.obj - S.objSTOP AFTER K ( ¹ k )
  • In SQL notation
  • (limited to k 1)

32
R-tree Spatial Join (RSJ)
procedure r_tree_sim_join (R, S, e) if IsDirpg
(R) Ù IsDirpg (S) then foreach r Î R.children
do foreach s Î S.children do if
mindist (r,s) e then CacheLoad(r)
CacheLoad(s) r_tree_sim_join (r,s,e)
else ( assume R,S both DataPg ) foreach
p Î R.points do foreach q Î S.points do
if p - q e then report (p,q)
R
S
33
Modeling and Optimization
  • Böhm, Kriegel A Cost Model and Index
    Architecture for the Similarity Join, Wednesday,
    1630
  • Mating probability of index pages
  • Probability that distance between two pages e
  • Two-fold application of Minkowski sum

34
Modeling and Optimization
  • I/O cost
  • High const. cost per page
  • Large capacity optimum
  • CPU cost
  • Low const. cost per page
  • Low capacity optimum
  • CPU-performance like CPU optimized index
  • I/O- performance like I/O optimized index

35
Open Problems for Research (Sim. Join)
  • Modeling and Optimization
  • Dimension
  • Quantization
  • Page scheduling
  • Caching strategies
  • Nearest Neighbor Join
  • Applications
  • Algorithms
  • General
  • Integration into object-relational DBMS

36
3
New Challenges
37
New Challenges
  • Incertain Features
  • Application
  • Biometric Identification
  • Particularities
  • Features individually associated with incertainty
    (e.g. as Gaussian distributions)
  • Queries
  • Probability of match
  • Find objects with highes probability of match
  • Find objects with probability of match gt e

Relative probability
Feature a1
38
New Challenges
  • Support of e-commerce in all phases
  • Marketing ? customer segmentation
  • Sales and booking ? advanced similarity search
  • Add-on products ? Sales transaction analysis
  • Advanced Similarity
  • Adaptable
  • Multimodal models
  • Relevance-feedback
  • Convex hull

39
New Challenges
  • Stock quota Technical chart analysis
  • Known Database techniques for similarity search
    in time sequences (DFT, etc.)

40
New Challenges
  • Professional analyst tools use
  • Trading signals generated by indicators (etc.
    MACD)
  • Formations indicating trends in charts
  • Relationships to the market and to derivatives

41
Conclusion
  • Database primitives abstraction from
    application Similarity Search Þ
    Clustering Classification Þ Similarity Join
    Outlier Detection
  • Advantages
  • General solution, reuse
  • Separately optimizable

Range QueriesNearest Neighbor Queries
Write a Comment
User Comments (0)
About PowerShow.com