Substructure Similarity Search in Graph Databases - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Substructure Similarity Search in Graph Databases

Description:

Pairwise substructure similarity computation is very expensive. For one edge misses, exact substructure search ... DEFINITION (SUBSTRUCTURE SIMILARITY SEARCH) ... – PowerPoint PPT presentation

Number of Views:337
Avg rating:3.0/5.0
Slides: 33
Provided by: mehme9
Category:

less

Transcript and Presenter's Notes

Title: Substructure Similarity Search in Graph Databases


1
Substructure Similarity Search in Graph Databases
  • By X. Yan P. Yu J.Han
  • Ömer Can KOLÇAK

2
Outline
  • Introduction
  • Preliminary Concepts
  • Structural Filtering
  • Feature Set Selection
  • Algorithm Implementation

3
Introduction
  • Data research has been facing a new challenge
    raised by the emergence of complex structural
    data.
  • Graphs have broad applications and they are used
    in datasets especially in chemistry-informatics
    and bio-informatics. (eg. ChemIDplus, PDB)
  • Also in computer vision and pattern recognition,
    graphs are used to represent complex structures
    such as hand-drawn symbols, 3D objects and
    medical images.

4
Introduction
  • All these applications indicate the importance
    and the broad usage of graph database and its
    similarity search system.
  • While the discovery in graph datasets has been
    studied, a systematic examination of graph query
    systems becomes equally important.

5
Structure Search Queries
  • Full Structure Search
  • Substructure Search
  • Full Structure Similarity Search

6
Question
  • What if no matches occur for a given query graph?

Query Refinement Process
7
Query Refinement Process
  • manually time consuming
  • define the portion of the query for exact
    matching
  • and let the system change the portion slightly

RELAXATION RATIO
8
Example
9
Introduction
  • The existing tools such as ChemIDplus, could only
    provide the full structure similarity search and
    the exact substructure search.
  • Pairwise substructure similarity computation is
    very expensive.
  • For one edge misses, exact substructure search
    may work.
  • What if the number of deletions is more than one?

10
Graph Similarity Filtering
  • GRAFIL
  • A feature-based structural filtering algorithm
  • no pairwise computation
  • Instead, two data structures
  • feature-graph matrix
  • edge-feature matrix
  • filters the dataset by using these matrices
  • not on the database, on the matrices

11
Contribution of Grafil
  • A significant contribution of this study is an
    examination of an inceasingly important search
    problem in graph databases and the proposal of a
    feature-based filtering algorithm for efficient
    substructure similarity search.
  • The concept presented in Grafil can be applied to
    searching approximate,non-consequtive sequences,
    trees and other complicated structures as well.

12
Preliminary Concepts
  • DEFINITION (SUBSTRUCTURE SIMILARITY SEARCH)
  • Given a graph database DG1,G2,...,Gn and a
    query graph Q, similarity search is discover all
    the graphs that approximately contain this query
    graph.
  • Target Graph Graphs in Dataset

13
Preliminary Concepts
  • DEFINITION (RELAXATION RATIO)
  • Given two graphs G and Q, if P is the maximum
    common subgraph of G and Q, then the substructure
    similarity between G and Q is defined by E(P) /
    E(Q), and 1-E(P) / E(Q) is called
    relaxation ratio.

14
Example
Substructure Similarity 11/12 92
Maximum Common Subgraph , P E(P)11
Relaxation Ratio 1-(11/12) 8
15
Structural Filtering
  • Given a query graph, the major target of our
    algorithm is to filter as many graphs as possible
    using a feature-based approach.
  • Features
  • Paths
  • Discriminative Frequent Structures
  • Elementary Structures
  • etc...

16
Example
This Query Graph contains seven occurences of
these features One fa, two fbs and four fcs
17
Feature-Graph Matrix
- easily maintainable
18
Framework
  • Given a graph database and a query graph, the
    substructure similarity search can be performed
    in the following four steps
  • Index Costruction Select small structures as
    features in the graph database, and built the
    feature-graph matrix between the features and the
    graphs in the database.
  • Feature Miss Estimation Select a feature set,
    calculate the number of selected features
    contained in the query graph, then compute the
    upper bound of feature misses (dmax) if the query
    graph is relaxed with one edge deletion.

19
Framework
  • Query Processing Use the feature-graph matrix to
    calculate the difference in the number of
    features between each graph G in the database and
    query Q. If the difference is greater than dmax,
    eliminate graph G. The remaining graphs
    constitute a candidate answer, written as CQ.
  • Query Relaxation Relax the query further if the
    user needs more matches than those returned from
    the previous step iterate Steps 2 to 4.

20
Feature Miss Estimation
Construct a feature set all features
for k1 dmax4
21
Framework on Example
  • Given a graph database and a query graph
  • Index Construction
  • Built the feature graph matrix
  • Feature Miss Estimation
  • Calculate dmax4

22
Framework on Example
  • Query Processing
  • Calculate the difference in the number of
    features between each graph G and query Q.

Total number of occurrences 7
dmax4
Misses 5
3
2
3
CQG2, G3, G4
23
Question
  • Should we use all the features together in a
    single filter?
  • Does a filter achieve good filtering performance
    if all the features are used together?
  • Intuitively, such a strategy would improve the
    performance since all the available information
    is used.
  • But, not true

24
Question Feature Miss Estimation
for k1 dmax2
25
Question Query Processing
  • Query Processing

Total number of occurrences 3
dmax2
Misses 3
3
2
2
CQG2, G3
26
Answer
  • By adding all features in the feature set, we may
    fail to filter some graphs that do not satisfy
    the query requirement.
  • To improve the accuracy of the filtering, we
    should select feature sets by grouping the
    features.
  • The example implies that the filtering power may
    be weakened if we deploy all the features in one
    filter.
  • In order to measure the filtering power,
    selectivity

27
Selectivity
  • DEFINITION (SELECTIVITY)
  • Given a graph database D, a query Q, and a
    feature f, the selectivity is defined by its
    average frequency difference within D and Q.

Occurrence of fa in query graph1
Occurrence of fb in query graph2
Occurrence of fc in query graph4
Selectivity of fa 3/4
Selectivity of fb 7/4
Selectivity of fc 3/4
28
Feature Set Selection
  • Rule 1. Select a large number of features
  • Rule 2. Make sure features cover the query graph
    uniformly.
  • Rule 3. Separate features with different
    selectivity.

29
Feature Set Selection of Grafil
  • Grafil has two types of feature set selection
  • Based component (Grafil-base) combines features
    with the same size
  • Clustering Component

30
Clustering Component
  • Grafil first combines the features whose size
    differs at most by 1, and sort them by
    selectivity.
  • Hierarchical clustering
  • Grafil divides them into three groups with high
    selectivity, medium selectivity and low
    selectivity.

31
Grafil Algorithm
-base component -clustering component -pipeline
model
32
Conclusion
  • We discuss the problem of substructure similarity
    search in large scale graph databases, a problem
    raised by the emergence of massive, complex
    structural data
  • Different from the previous work, our solution
    explored the filtering algorithm using indexed
    structural patterns, without doing costly
    structure comparisons
  • The successful transformation of the
    structure-based similarity measure to the
    feature-based measure renders our method
    attractive in terms of accuracy and efficiency
Write a Comment
User Comments (0)
About PowerShow.com