Correlation Search in Graph Databases

About This Presentation

Title:

Correlation Search in Graph Databases

Description:

Candidate key. High complexity graph operations. Vast search space. Problem Definition ... Efficient candidate generation. Significant reduction in search space. ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 23

Provided by: Yarla

Learn more at: http://protocols.netlab.uky.edu

Category:

more less

Transcript and Presenter's Notes

Title: Correlation Search in Graph Databases

1
Correlation Search in Graph Databases

Yiping Ke James Cheng Wilfred Ng
Presented By Phani Yarlagadda

2
Outline

Motivation
Challenges
Problem Definition
Solution
Performance Evaluation
Related Works

3
Motivation

Graph Databases and their importance
Correlation mining from graph databases
Structural similarity and statistical similarity

4
Challenges

Candidate key
High complexity graph operations
Vast search space

5
Problem Definition

Pearsons Correlation Coefficient
Popularly used correlation measure
Definition
Given two graphs g1 and g2, the Pearsons
Correlation Coefficient of g1 and g2, denoted as
f(g1, g2), is defined as follows
When supp(g1) or supp(g2) is equal to 0 or
1, f(g1, g2) is defined to be 0.The range of
f(g1, g2) falls within -1, 1
In this paper we are concerned about
positively correlated graphs only

6
Problem Definition

Correlated Graphs
Two graphs g1 and g2 are correlated if and
only if f(g1, g2) ?,
where ? (0 lt ? 1) is a user-specified
minimum correlation threshold.

7
Problem Definition

Correlated Graph Search
Given a graph database D, a correlation query
graph q and a minimum correlation threshold ?,
the problem of Correlated Graph Search (CGS) is
to find the set of all graphs that are correlated
with q. The answer set of the CGS problem is
defined as Aq (g,Dg) f(q, g) ?.

8
Solution-Candidate Set Generation

Mine the set of frequent graphs (FGs) from D
using the thresholds
Drawbacks
All existing FG mining algorithms generate graphs
with higher support before those with lower
support.
Not efficient and scalable ,especially when D is
large or the lower bound is low.

9
Solution-Candidate Set Generation

Mine the set of FGs using the threshold
Advantages
Efficient candidate generation.
Significant reduction in search space.

10
Solution-Framework

The framework of the solution consists of the
following four steps.
Obtain the projected database Dq of q.
Mine the set of candidate graphs C from Dq,
using lower(q,g)/supp(q) as the minimum support
threshold.
Refine C by three heuristic rules.
For each candidate graph g C,
Obtain Dg.
Add (g,Dg) to Aq if f(q, g) ?.

11
Solution-Heuristic Rules

Heuristic Rule 1
Given a graph g, if g C and g q, then
g base(Aq)
Identifies graphs that are guaranteed to be
answers

12
Solution-Heuristic Rules

Heuristic Rule 2
Given two graphs g1 and g2,
where g1 g2 and
supp(g1, q) supp(g2, q),
if g1 base(Aq), then g2 base(Aq)
Helps in reduction of the search space so that
the unrewarding query costs for false positives.

13
Solution-Heuristic Rules

Heuristic Rule 3
Given two graphs g1 and g2,
where g1 g2,
if supp(g2, q) lt f(supp(g1)),
then g2 base(Aq)
Helps in reduction of the search space so that
the unrewarding query costs for false positives.

14
Solution-Algorithm

Input A graph database D, a query graph q, and a
correlation threshold ?.
Output The answer set Aq.
Obtain Dq
Mine FGs from Dq using lower(q,g) supp(q) as the
minimum support threshold and add the FGs to C
for each graph g C in size-descending order do
if (g q)
Add (g,Dg) to Aq
else
Obtain Dg
if (f(q, g) ?)
Add (g,Dg) to Aq
else
H2 ? g C g g, supp(gDq) supp(gDq)
C ? C-H2
H3 ? g C g g, supp(gDq) lt
f(supp(g))/supp(q)
C ? C-H3

15
Solution-Example

Consider the graph database below

16
Solution-Example

Query q
Candidate set

17
Performance Evaluation

The dataset contains the compound structures of
cancer and AIDS data from NCI open database
compunds.
The dataset contains about 249k graphs.
On average each graph in dataset has 21 nodes and
23 edges. The number of distinct labels for nodes
and edges is 88.
We randomly generate four sets of queries, F1,
F2, F3 and F4 each of which contain 100 queries.
The support ranges for the queries in F1 to F4
are 0.02,0.05,(0.05,0.07,(0.07,0.1 and
(0.1,1.0

18
Performance Evaluation

Effect of candidate generation

19
Performance Evaluation

Effect of

20
Performance Evaluation

Effect of Heuristic Rules

21
Performance Evaluation

Effect of Graph Size

22
Related Works

Raymond proposes an efficient algorithm MCES for
similarity search.
Williams proposes an indexing technique that
adopts graph decomposition method for similarity
search.
Zhang and Feigenbaum adopted f correlation
coefficient to measure the correlated pairs in
transaction databases.

Write a Comment

User Comments (0)

About PowerShow.com

Correlation Search in Graph Databases - PowerPoint PPT Presentation

Correlation Search in Graph Databases

Candidate key. High complexity graph operations. Vast search space. Problem Definition ... Efficient candidate generation. Significant reduction in search space. ... – PowerPoint PPT presentation