Correlation Search in Graph Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Correlation Search in Graph Databases

Description:

Candidate key. High complexity graph operations. Vast search space. Problem Definition ... Efficient candidate generation. Significant reduction in search space. ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 23
Provided by: Yarla
Category:

less

Transcript and Presenter's Notes

Title: Correlation Search in Graph Databases


1
Correlation Search in Graph Databases
  • Yiping Ke James Cheng Wilfred Ng

  • Presented By Phani Yarlagadda

2
Outline
  • Motivation
  • Challenges
  • Problem Definition
  • Solution
  • Performance Evaluation
  • Related Works

3
Motivation
  • Graph Databases and their importance
  • Correlation mining from graph databases
  • Structural similarity and statistical similarity

4
Challenges
  • Candidate key
  • High complexity graph operations
  • Vast search space

5
Problem Definition
  • Pearsons Correlation Coefficient
  • Popularly used correlation measure
  • Definition
  • Given two graphs g1 and g2, the Pearsons
    Correlation Coefficient of g1 and g2, denoted as
    f(g1, g2), is defined as follows
  • When supp(g1) or supp(g2) is equal to 0 or
    1, f(g1, g2) is defined to be 0.The range of
    f(g1, g2) falls within -1, 1
  • In this paper we are concerned about
    positively correlated graphs only

6
Problem Definition
  • Correlated Graphs
  • Two graphs g1 and g2 are correlated if and
    only if f(g1, g2) ?,
  • where ? (0 lt ? 1) is a user-specified
    minimum correlation threshold.

7
Problem Definition
  • Correlated Graph Search
  • Given a graph database D, a correlation query
    graph q and a minimum correlation threshold ?,
    the problem of Correlated Graph Search (CGS) is
    to find the set of all graphs that are correlated
    with q. The answer set of the CGS problem is
    defined as Aq (g,Dg) f(q, g) ?.

8
Solution-Candidate Set Generation
  • Mine the set of frequent graphs (FGs) from D
    using the thresholds
  • Drawbacks
  • All existing FG mining algorithms generate graphs
    with higher support before those with lower
    support.
  • Not efficient and scalable ,especially when D is
    large or the lower bound is low.

9
Solution-Candidate Set Generation
  • Mine the set of FGs using the threshold
  • Advantages
  • Efficient candidate generation.
  • Significant reduction in search space.

10
Solution-Framework
  • The framework of the solution consists of the
    following four steps.
  • Obtain the projected database Dq of q.
  • Mine the set of candidate graphs C from Dq,
    using lower(q,g)/supp(q) as the minimum support
    threshold.
  • Refine C by three heuristic rules.
  • For each candidate graph g C,
  • Obtain Dg.
  • Add (g,Dg) to Aq if f(q, g) ?.

11
Solution-Heuristic Rules
  • Heuristic Rule 1
  • Given a graph g, if g C and g q, then
  • g base(Aq)
  • Identifies graphs that are guaranteed to be
    answers

12
Solution-Heuristic Rules
  • Heuristic Rule 2
  • Given two graphs g1 and g2,
  • where g1 g2 and
  • supp(g1, q) supp(g2, q),
  • if g1 base(Aq), then g2 base(Aq)
  • Helps in reduction of the search space so that
    the unrewarding query costs for false positives.

13
Solution-Heuristic Rules
  • Heuristic Rule 3
  • Given two graphs g1 and g2,
  • where g1 g2,
  • if supp(g2, q) lt f(supp(g1)),
  • then g2 base(Aq)
  • Helps in reduction of the search space so that
    the unrewarding query costs for false positives.

14
Solution-Algorithm
  • Input A graph database D, a query graph q, and a
    correlation threshold ?.
  • Output The answer set Aq.
  • Obtain Dq
  • Mine FGs from Dq using lower(q,g) supp(q) as the
    minimum support threshold and add the FGs to C
  • for each graph g C in size-descending order do
  • if (g q)
  • Add (g,Dg) to Aq
  • else
  • Obtain Dg
  • if (f(q, g) ?)
  • Add (g,Dg) to Aq
  • else
  • H2 ? g C g g, supp(gDq) supp(gDq)
  • C ? C-H2
  • H3 ? g C g g, supp(gDq) lt
    f(supp(g))/supp(q)
  • C ? C-H3

15
Solution-Example
  • Consider the graph database below

16
Solution-Example
  • Query q
  • Candidate set

17
Performance Evaluation
  • The dataset contains the compound structures of
    cancer and AIDS data from NCI open database
    compunds.
  • The dataset contains about 249k graphs.
  • On average each graph in dataset has 21 nodes and
    23 edges. The number of distinct labels for nodes
    and edges is 88.
  • We randomly generate four sets of queries, F1,
    F2, F3 and F4 each of which contain 100 queries.
    The support ranges for the queries in F1 to F4
    are 0.02,0.05,(0.05,0.07,(0.07,0.1 and
    (0.1,1.0

18
Performance Evaluation
  • Effect of candidate generation

19
Performance Evaluation
  • Effect of

20
Performance Evaluation
  • Effect of Heuristic Rules

21
Performance Evaluation
  • Effect of Graph Size

22
Related Works
  • Raymond proposes an efficient algorithm MCES for
    similarity search.
  • Williams proposes an indexing technique that
    adopts graph decomposition method for similarity
    search.
  • Zhang and Feigenbaum adopted f correlation
    coefficient to measure the correlated pairs in
    transaction databases.
Write a Comment
User Comments (0)
About PowerShow.com