Relational Data Mining with Inductive Logic Programming for Link Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Relational Data Mining with Inductive Logic Programming for Link Discovery

Description:

Relational Data Mining with Inductive Logic Programming for Link Discovery Ray Mooney, Prem Melville, Rupert Tang University of Texas at Austin – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 36
Provided by: Raymond167
Category:

less

Transcript and Presenter's Notes

Title: Relational Data Mining with Inductive Logic Programming for Link Discovery


1
Relational Data Mining with Inductive Logic
Programming for Link Discovery
  • Ray Mooney, Prem Melville, Rupert Tang
  • University of Texas at Austin
  • Jude Shavlik, Inês de Castro Dutra, David Page,
    Vítor Santos Costa
  • University of Wisconsin at Madison

2
EELD Program
  • Evidence Extraction
  • Link Discovery
  • Pattern Learning

3
Link Discovery Task(from Jim Antonisse, GITI)
Vetted hyp cases
Evidence request(s)
Link Discovery Core Pattern Matching
Alerts based on Hypothesized cases
Pattern(s) of Interest
Domain Patterns
Legend pre-run-time processing
run-time processing
4
Link Discovery
  • Data is multi-relational with many people,
    places, objects and actions and numerous types of
    relations between them.
  • Link analysis in intelligence and criminology
    investigates exploring and visualizing such data
    as a graph with many nodes and edges of various
    types.
  • Link discovery entails finding new links and
    recognizing threatening patterns in such
    highly-relational data.

5
EELD Program
  • Evidence Extraction
  • Link Discovery
  • Pattern Learning

6
Pattern Learning for Link Discovery
  • Automated discovery of patterns of interest
    that indicate potentially threatening activities
    in large amounts of heterogeneous,
    multi-relational data.
  • Requires inducing multi-relational patterns that
    characterize multiple entities and multiple links
    between them.

7
Limitations of Traditional Data Mining
  • Traditional KDD methods assume the data to be
    mined is in a single relational table and that
    examples are flat tuples of attribute values.
  • This assumption stems from
  • 1) Properties of the typical data mining tasks
    like market basket analysis.
  • 2) Focus in machine learning and statistics on
    classification or regression using feature
    vectors as inputs.

8
Relational Data Mining
  • Data contains multiple relations.
  • Patterns to be discovered contain multiple
    relations.
  • Knowledge to be discovered may be the definition
    of another relation rather than a classification
    or regression function.

9
Relational Data Mining Example
Male Female
Alice
Bob
Married
Mary
Jack
Jane
Tom
Parent
Carol
John
Fred
Sue
, Male(X), not(XW).
Uncle(X,Y) - Parent(Z,X), Parent(Z,W),
Parent(W,Y)
10
Relational Data Mining Example (cont)
Male Female
Alice
Bob
Married
Mary
Jack
Jane
Tom
Parent
Carol
John
Fred
Sue
, Male(X), not(ZV).
11
Most KDD Ignores RDM
  • KDD textbooks barely mention RDM
  • Han Kamber, 2001
  • Hand, Mannila, Smyth, 2001
  • Witten Frank, 1999
  • But there is a recent edited collection on RDM
  • S. Džeroski N. Lavrac, eds. Relational Data
    Mining, Springer Verlag, 2001.

12
Inductive Logic Programming(ILP)
  • Standard formal language for representing
    relational knowledge is first-order predicate
    logic.
  • ILP studies the induction of hypotheses in
    first-order predicate logic.
  • Logic programs (e.g. Prolog) or function-free
    logic programs (e.g. Datalog), are a useful,
    reasonably-tractable subset of first-order
    predicate logic.
  • ILP is the most well-studied approach to
    relational data mining.

13
ILP Problem Definition
  • Given
  • Positive Example Set P
  • Negative Example Set N
  • Background Knowledge B
  • Find
  • Hypothesis, H, such that

P, N, B and H are all sets of rules in
first-order logic (i.e. Horn clauses, logic
programs)
14
ILP Algorithms
  • We have utilized two ILP systems for EELD
    problems in link discovery.
  • Aleph (Srinivasan, 2001) A variant of the
    popular Progol algorithm (Muggleton, 1995)
  • mFoil (Tang and Mooney, 2002) A variant of the
    popular Foil algorithm (Quinlan, 1990)

15
EELD Russian Nuclear Smuggling Data
  • Data manually extracted from new sources about
    events related to nuclear smuggling (developed by
    Veridian Inc.)
  • Size of data set
  • 40 relational tables
  • 2 to 800 tuples per relation
  • Translated Access database to Prolog, mapping
    each relational table to a predicate.
  • Used Aleph to learn rules for the relation
    Linked(A,B)which determines whether or not two
    events are part of the same incident.
  • 143 positive examples
  • 517 negative examples

16
Illustration of Linked Relation
New Event
Partial Incident N
Partial Incident M
17
Find Correct Incident for New Event
Partial Incident M
Expanded Incident N
18
Sample Rule
linked(EventA,EventB) - lk_event_material(_,Eve
ntA,_,_,_, ConcealmentG,DescH),
lk_event_person(_,EventB,PersonD,_,C,C,_),
lk_person_material(_,PersonD,MatF,EvE,_,_,_,_,_),
lk_event_material(_,EvE,MatF,I,_,
ConcealmentG,DescH), l_relations(I,_,"Stolen").
If A is linked to a specific type of material
ltG,Hgt, and B is linked to a person linked to the
same specific type of material, through an event
in which that material was stolen, then events A
and B are linked.
19
Linked(A,B)
B
A
Event Material Person
20
Linked(A,B)
B
A
Material Type GH
Event Material Person
21
Linked(A,B)
B
A
E
D
Material Type GH
Material Type GH
Event Material Person
22
Linked(A,B)
B
A
E
D
Stolen
Material Type GH
Material Type GH
Event Material Person
23
Linked(A,B)
B
A
E
D
Stolen
Material Type GH
Material Type GH
Event Material Person
24
Accuracy Results for Learning Linkedfor Nuclear
Smuggling Data
  • Experimental Method 5-fold cross validation.
  • Also tried bagging Aleph to produce an ensemble
    of 25 hypotheses.

Majority Class (not Linked) Aleph Bagged Aleph
78 83 86
25
Synthetic Contract Killing Data
  • Data generated by a plan-based simulator that
    generates evidence emulating contract killings
    and other types of murders (developed by IET
    Inc.).
  • Simulator used to generate evidence from 200
    murder events of three types
  • Murder for Hire (71 exs)
  • First Degree (75 exs)
  • Second Degree (54 exs)
  • Use mFoil to classify events into one of these
    three categories.

26
Sample Rules
  • Murder For Hire(A)-
  • groupMemberMaleficiary(A, B),
  • subEvents(A, C), crimeMotive(C, economic).
  • First Degree Murder(A)-
  • subEvents(A, B), performedBy(B, C),
    loves(C,D).
  • Second Degree Murder(A)-
  • subEvents(A, B), eventOccursAtLocationType
    (B,publicProperty), crimeMotive(B, rival),
    occurrentSubeventType(B, stealing_Generic).

27
Results on Synthetic Contract Killing Data
MurderForHire FirstDegree SecondDegree
PRECISION 85.52 91.17 95.83
RECALL 91.07 88.48 59.45
Majority Class mFOIL
ACCURACY 37.50 76.67
28
Recent Result from EELD Challenge Problem
  • murder_for_hire(A) -
  • eventOccursAt(A,B), perpetrator(A,C),
  • agentPhoneNumber(C,D),callerNumber(E,D),
  • accountHolder(F,C), to_Generic(G,F),
  • from_Generic(G,H), to_Generic(I,H).
  • Says an event is a murder for hire if it has a
    recorded location and perpetrator, we have a
    recorded phone call to the perpetrator, and there
    was a chain of bank transfers resulting in money
    reaching the perpetrators account.
  • 100 accuracy on a held-out test set.
  • Similar pattern found manually by LD researchers
    working on this challenge problem.

29
Future Research
  • Scaling to larger datasets
  • Stochastic search
  • Logic program optimization
  • Integration with relational and deductive
    database technology.
  • Integrating probabilistic reasoning
  • Logic programs with Bayes-net constraints
  • Active Learning
  • Theory Refinement

30
Related Research
  • Graph-based Relational Data Mining
  • Subdue (Cook Holder, UT Arlington)
  • Probabilistic Relational Models
  • PRMs (Koller, Stanford)
  • Relational Feature Construction
  • PROXIMITY (Jensen, UMass)

31
Record Linkage
  • Identify and merge duplicate field values and
    duplicate records in a database.
  • Applications
  • Duplicates in mailing lists
  • Merging multiple databases of stores,
    restaurants, etc.
  • Matching bibliographic references in research
    papers (Cora/ResearchIndex)
  • Identifying individuals who are trying to hide
    their identity by providing slightly erroneous
    personal information.

32
Record Linkage Examples
Author Title
Venue Address Year
Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby Information, prediction, and query by committee Advances in Neural Information Processing System San Mateo, CA 1993
Freund, Y., Seung, H.S., Shamir, E. Tishby, N. Information, prediction, and query by committee Advances in Neural Information Processing Systems San Mateo, CA.
Name Address City
Cusine
Second Avenue Deli 156 2nd Ave. at 10th New York Delicatessen
Second Avenue Deli 156 Second Ave. New York City Delis
33
Trainable Record Linkage
  • MARLIN (Multiply Adaptive Record Linkage using
    INduction)
  • Learn parameterized similarity metrics for
    comparing each field.
  • Trainable edit-distance
  • Use EM to set edit-operation costs
  • Learn to combine multiple similarity metrics for
    each field to determine equivalence.
  • Use SVM to decide on duplicates

34
MARLIN Record Linkage Framework
Trainable duplicate detector
Trainable similarity metrics





35
Conclusions
  • Pattern Learning for Link Discovery is an
    important application of data mining for
    counter-terrorism.
  • Learning for Link Discovery requires Relational
    Data Mining (RDM).
  • Other problem domains require RDM
  • Bioinformatics
  • Web
  • Natural Language Understanding
  • RDM is an important next-generation KDD
    capability.
Write a Comment
User Comments (0)
About PowerShow.com