Mining%20Tree-Query%20Associations%20in%20a%20Graph - PowerPoint PPT Presentation

About This Presentation
Title:

Mining%20Tree-Query%20Associations%20in%20a%20Graph

Description:

University of Antwerp, Belgium. Eveline Hoekx. Jan Van den Bussche. Hasselt University, Belgium ... P is a tree pattern, the body of Q ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 51
Provided by: evel89
Category:

less

Transcript and Presenter's Notes

Title: Mining%20Tree-Query%20Associations%20in%20a%20Graph


1
Mining Tree-Query Associations in a Graph
  • Bart Goethals
  • University of Antwerp, Belgium
  • Eveline Hoekx
  • Jan Van den Bussche
  • Hasselt University, Belgium

2
Graph Data
  • A (directed) graph over a set of nodes N is a set
    G of edges ordered pairs ?i?j? with i?j ? N.

Snapshot of a graph representing the complete
metabolic pathway of a human.
3
Graph Mining
  • Transactional category
  • dataset set of many small graphs (transactions)
  • frequency ?transactions in which the pattern
    occurs (at least once)
  • ILP Warmr
  • AGM, FSG, TreeMiner, gSpan, FFSM
  • Single graph category
  • dataset single large graph
  • frequency ?copies of the pattern in the large
    graph
  • Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM,
    Jeh-Widom

Focus on pattern mining, few work on association
rule mining!
4
Our work
  • Single graph category
  • Pattern association rule mining
  • Patterns with
  • Existential nodes
  • Parameters
  • Occurrence of the pattern in G is any
    homomorphism from the pattern in G.
  • So far only considered in the ILP (transactional)
    setting

5
Example of a pattern
frequency? ? ??x???? z? ?5?z? ? G ? ?z?8??? G ?
?z?x? ? G?
6
Patterns are conjunctive queries.
  • select distinct G3.to as x
  • from G G1, G G2, G G3
  • where G1.from5 and G1.toG2.from
  • and G1.toG3.from and G2.to8

frequency? ? ??x???? z? ?5?z? ? G ? ?z?8??? G ?
?z?x? ? G?
7
Example of an Association Rule
8
Features of the presented algorithms
  • Pattern mining phase association mining phase
  • Restriction to trees gt efficient algorithms
  • Equivalence checking
  • Apply theory of conjunctive database queries
  • Database oriented implementation

9
Outline rest of talk
  • Formal problem definition
  • Algorithms
  • Pattern Mining
  • Overall approach
  • Outer loop incremental
  • Inner loop levelwise
  • Equivalence checking
  • Association Rule Mining
  • Result management
  • Experimental results
  • Future work

10
Formal definition of a tree pattern.
  • A tree pattern is a tree P whose nodes are called
    variables, and
  • some variables marked as existential ???
  • some variables are parameters (labeled with a
    constant)
  • remaining variables are called distinguished

11
Formal definition of a tree query.
  • A tree query Q is a pair (H,P) where
  • P is a tree pattern, the body of Q
  • H is a tuple of distinguished variables and
    parameters of P. All distinguished variables of P
    must appear at least once in H, the head of Q

12
Formal definition of a matching
  • A matching of a pattern P in a graph G is a
    homomorphism
  • h P ? G, with h?z????a, for parameters labeled
    a.

13
Example Matching
z? y z? x

14
Example Matching
z? y z? x

15
Example Matching
z? y z? x
h? 0 1 8 4
16
Example Matching
z? y z? x
h? 0 1 8 4
h? 0 1 8 8
17
Example Matching
z? y z? x
h? 0 1 8 4
h? 0 1 8 8
h? 0 2 8 4
18
Example Matching
z? y z? x
h? 0 1 8 4
h? 0 1 8 8
h? 0 2 8 4
h? 0 2 8 5
19
Example Matching
z? y z? x
h? 0 1 8 4
h? 0 1 8 8
h? 0 2 8 4
h? 0 2 8 5
h? 0 2 8 8
20
Formal definition of frequency
We define the answer set of Q in G as follows
Q(G)??f(H)f is a matching of P in G?
  • The frequency of Q in G is answers in the answer
    set.

21
Example Matching
z? y z? x
h? 0 1 8 4
h? 0 1 8 8
h? 0 2 8 4
h? 0 2 8 5
h? 0 2 8 8
?
?
frequency ???
22
Problem statement 1 Tree query mining
  • Given a graph G and a threshold k, find all tree
    queries that
  • have frequency at least k in G, those queries are
    called
  • frequent.

23
Formal definition of an association rule
An association rule (AR) is of the form Q1 ? Q2
with Q1 and Q2 tree queries. The AR is legal if
Q2 ? Q1. The confidence of the AR in a graph G
is defined as the frequency of Q2 divided by the
frequency of Q1.
24
Problem statement 2 Association rule mining
  • Input a graph G, minsup, a tree query Qleft
    frequent in G, minconf
  • Output all tree queries Q such that Qleft ? Q is
    a legal and confident association rule in G.

25
Outline rest of talk
  • Formal problem definition
  • Algorithms
  • Pattern Mining
  • Overall approach
  • Outer loop incremental
  • Inner loop levelwise
  • Equivalence checking
  • Association Rule Mining
  • Result management
  • Experimental results
  • Future work

26
Pattern Mining Algorithm
  • Outer loop
  • Generate, incrementally, all possible trees of
    increasing sizes. Avoid generation of isomorphic
    trees.

Inner loop For each newly generated tree,
generate all queries based on that tree, and test
their frequency.
...
27
Outer loop
  • It is well known how to efficiently generate all
    trees uniquely up to isomorphism
  • Based on canonical form of trees.
  • Scions, Li-Ruskey, Zaki, Chi-Young-Muntz

28
Inner loop Levelwise approach
  • A query Q is characterized by?
  • ?Q? set of existential nodes
  • ?Q? set of parameters
  • Labeling ?Q?of the parameters by constants.
  • Q?????? ??? ??? specializes Q?????? ??? ??? if
    ???? ??, ?? ? ?? and ?? agrees with ?? on ??.
  • If Q? specializes Q? then freq?Q?? ? freq?Q???
  • Most general query T (?, ?, ?)

29
Inner loop Candidate generation
  • CanTab?????????????????????? is a candidate
    query?
  • FreqTab???????????????????????is a frequent
    query?
  • Q?????? is a parent of Q?????? if either
  • ??? and ? has precisely one more node than ?,
    or
  • ??? and ? has precisely one more node than ?
  • Join Lemma
  • Each candidacy table can be computed by taking
    the
  • natural join of its parent frequency tables.

30
Inner loop Frequency counting
  • Each candidacy table can be computed by a single
    SQL query. (ref. Join lemma).
  • Suppose G?from??to? table in the database, then
    each frequency table can be computed with a
    single SQL query.
  • ???????
  • formulate in SQL and count
  • ??? ???
  • formulate ?????? ?? in SQL?? E
  • natural join of E with CanTab???
  • group by ?
  • count each group

31
Inner loop Example
?????x?? ?????x?? x?? ?????x????? x????
32
Inner loop Example
?????x?? ?????x?? x?? ?????x????? x????
  • Join expression
  • CanTabx?x?,x? FreqTab?x???x????
    FreqTab?x????x?? ? FreqTab????x???x??

33
Inner loop Example
?????x?? ?????x?? x?? ?????x????? x????
  • SQL expression E for ??x??? ?? ???

select distinct G1.from as x1, G2.to as x3,
G3.to as x4 from G G1, G G2, G G3 where G1.to
G2.from and G3.from G2.from
34
Inner loop Example
?????x?? ?????x?? x?? ?????x????? x????
  • SQL expression for filling the frequency table

select distinct E.x1, E.x3, count(E.x4) from E,
CanTabx2x1,x3 as CT where E.x1 CT.x1 and
E.x3 CT.x3 group by E.x1, E.x3 having
count(E.x4) gt k
35
Equivalent queries
  • Queries Q? and Q? are equivalent if same answer
    sets on all
  • graphs G (up to renaming of the distinguished
    variables)
  • 2 cases of equivalent queries
  • Q1 has fewer nodes than Q2
  • Q1 and Q2 have the same number of nodes

36
Equivalence theorem
Two queries are equivalent if and only if there
are containment mappings between them in both
directions.
  • A containment mapping from Q? to Q? is a h
    Q???Q? that
  • maps distinguished variables of Q? one-to-one to
    distinguished
  • variables of Q?, and maps parameters of Q? to
    parameters of Q?,
  • preserving labels

37
Case ? Q? fewer nodes than Q2
  • Redundancy lemma
  • Let Q be a tree query without selected nodes.
    Then Q has a
  • redundancy if and only if it contains a subtree C
    in the form of a
  • linear chain of ? nodes (possibly just a single
    node), such that the
  • parent of C has another subtree that is at least
    as deep as C.

Redundant subtree
38
Case ? Q? and Q? same number of nodes
  • Q? and Q? must be isomorphic.
  • Canonical form of queries refine the canonical
    ordering of the underlying unlabeled tree, taking
    into account node labels.

39
Association Mining Algorithm
  • Input a graph G, minsup, a tree query Qleft
    frequent in G, minconf
  • Output all tree queries Q such that Qleft ? Q is
    a legal and confident association rule in G.

40
Containment mappings
  • For each tree query, generate all containment
    mappings from Qleft to Q, ignoring parameter
    assignments.

41
Instantiations
  • For each containment mapping, generate all
    parameter assignments such that Qleft ? Q is
    frequent and confident.

42
Equivalent Association rules
  • Equivalence checking of association rules is as
    hard as general graph isomorphism testing.

43
Outline rest of talk
  • Result management
  • Experimental results
  • Future work

44
Result management
  • Output frequency tables stored in a relational
    database.
  • Browser

45
(No Transcript)
46
Experimental results Real-life datasets
  • Food web ??nodes????? ?edges?????

frequency 176
47
Experimental results Real-life datasets
  • Food web ??nodes????? ?edges?????

confidence 11
48
Experimental results Performance
  • Fully implemented on top of IBM DB2
  • Preliminary performance results
  • pattern mining algorithm
  • adequate performance
  • huge number of patterns
  • constant overhead per discovered pattern
  • association mining algorithm
  • very fast
  • constant overhead per discovered rule

49
Future work
  • Applications scientific data mining
  • Loosen restriction to trees

50
References
  • Bart Goethals, Eveline Hoekx and Jan Van den
    Bussche, Mining Tree Queries in a Graph, in
    Proceedings of the eleventh ACM SIGKDD
    International conference on Knowledge Discovery
    and Data Mining, p 61-69, ACM Press 2005
  • Eveline Hoekx and Jan Van den Bussche, Mining for
    Tree-Query Associations in a Graph, to appear in
    Proceedings of the 2006 IEEE International
    Conference on Data Mining (ICDM 2006)
Write a Comment
User Comments (0)
About PowerShow.com