Relational Data Mining with Inductive Logic Programming for Link Discovery - PowerPoint PPT Presentation

About This Presentation

Title:

Relational Data Mining with Inductive Logic Programming for Link Discovery

Description:

Relational Data Mining with Inductive Logic Programming for Link Discovery Ray Mooney, Prem Melville, Rupert Tang University of Texas at Austin – PowerPoint PPT presentation

Number of Views:184

Avg rating:3.0/5.0

Slides: 36

Provided by: Raymond167

Learn more at: https://redirect.cs.umbc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Relational Data Mining with Inductive Logic Programming for Link Discovery

1
Relational Data Mining with Inductive Logic
Programming for Link Discovery

Ray Mooney, Prem Melville, Rupert Tang
University of Texas at Austin
Jude Shavlik, Inês de Castro Dutra, David Page,
Vítor Santos Costa
University of Wisconsin at Madison

2
EELD Program

Evidence Extraction
Link Discovery
Pattern Learning

3
Link Discovery Task(from Jim Antonisse, GITI)
Vetted hyp cases
Evidence request(s)
Link Discovery Core Pattern Matching
Alerts based on Hypothesized cases
Pattern(s) of Interest
Domain Patterns
Legend pre-run-time processing
run-time processing
4
Link Discovery

Data is multi-relational with many people,
places, objects and actions and numerous types of
relations between them.
Link analysis in intelligence and criminology
investigates exploring and visualizing such data
as a graph with many nodes and edges of various
types.
Link discovery entails finding new links and
recognizing threatening patterns in such
highly-relational data.

5
EELD Program

Evidence Extraction
Link Discovery
Pattern Learning

6
Pattern Learning for Link Discovery

Automated discovery of patterns of interest
that indicate potentially threatening activities
in large amounts of heterogeneous,
multi-relational data.
Requires inducing multi-relational patterns that
characterize multiple entities and multiple links
between them.

7
Limitations of Traditional Data Mining

Traditional KDD methods assume the data to be
mined is in a single relational table and that
examples are flat tuples of attribute values.
This assumption stems from
1) Properties of the typical data mining tasks
like market basket analysis.
2) Focus in machine learning and statistics on
classification or regression using feature
vectors as inputs.

8
Relational Data Mining

Data contains multiple relations.
Patterns to be discovered contain multiple
relations.
Knowledge to be discovered may be the definition
of another relation rather than a classification
or regression function.

9
Relational Data Mining Example
Male Female
Alice
Bob
Married
Mary
Jack
Jane
Tom
Parent
Carol
John
Fred
Sue
, Male(X), not(XW).
Uncle(X,Y) - Parent(Z,X), Parent(Z,W),
Parent(W,Y)
10
Relational Data Mining Example (cont)
Male Female
Alice
Bob
Married
Mary
Jack
Jane
Tom
Parent
Carol
John
Fred
Sue
, Male(X), not(ZV).
11
Most KDD Ignores RDM

KDD textbooks barely mention RDM
Han Kamber, 2001
Hand, Mannila, Smyth, 2001
Witten Frank, 1999
But there is a recent edited collection on RDM
S. Džeroski N. Lavrac, eds. Relational Data
Mining, Springer Verlag, 2001.

12
Inductive Logic Programming(ILP)

Standard formal language for representing
relational knowledge is first-order predicate
logic.
ILP studies the induction of hypotheses in
first-order predicate logic.
Logic programs (e.g. Prolog) or function-free
logic programs (e.g. Datalog), are a useful,
reasonably-tractable subset of first-order
predicate logic.
ILP is the most well-studied approach to
relational data mining.

13
ILP Problem Definition

Given
Positive Example Set P
Negative Example Set N
Background Knowledge B
Find
Hypothesis, H, such that

P, N, B and H are all sets of rules in
first-order logic (i.e. Horn clauses, logic
programs)
14
ILP Algorithms

We have utilized two ILP systems for EELD
problems in link discovery.
Aleph (Srinivasan, 2001) A variant of the
popular Progol algorithm (Muggleton, 1995)
mFoil (Tang and Mooney, 2002) A variant of the
popular Foil algorithm (Quinlan, 1990)

15
EELD Russian Nuclear Smuggling Data

Data manually extracted from new sources about
events related to nuclear smuggling (developed by
Veridian Inc.)
Size of data set
40 relational tables
2 to 800 tuples per relation
Translated Access database to Prolog, mapping
each relational table to a predicate.
Used Aleph to learn rules for the relation
Linked(A,B)which determines whether or not two
events are part of the same incident.
143 positive examples
517 negative examples

16
Illustration of Linked Relation
New Event
Partial Incident N
Partial Incident M
17
Find Correct Incident for New Event
Partial Incident M
Expanded Incident N
18
Sample Rule
linked(EventA,EventB) - lk_event_material(_,Eve
ntA,_,_,_, ConcealmentG,DescH),
lk_event_person(_,EventB,PersonD,_,C,C,_),
lk_person_material(_,PersonD,MatF,EvE,_,_,_,_,_),
lk_event_material(_,EvE,MatF,I,_,
ConcealmentG,DescH), l_relations(I,_,"Stolen").
If A is linked to a specific type of material
ltG,Hgt, and B is linked to a person linked to the
same specific type of material, through an event
in which that material was stolen, then events A
and B are linked.
19
Linked(A,B)
B
A
Event Material Person
20
Linked(A,B)
B
A
Material Type GH
Event Material Person
21
Linked(A,B)
B
A
E
D
Material Type GH
Material Type GH
Event Material Person
22
Linked(A,B)
B
A
E
D
Stolen
Material Type GH
Material Type GH
Event Material Person
23
Linked(A,B)
B
A
E
D
Stolen
Material Type GH
Material Type GH
Event Material Person
24
Accuracy Results for Learning Linkedfor Nuclear
Smuggling Data

Experimental Method 5-fold cross validation.
Also tried bagging Aleph to produce an ensemble
of 25 hypotheses.

Majority Class (not Linked) Aleph Bagged Aleph
78 83 86
25
Synthetic Contract Killing Data

Data generated by a plan-based simulator that
generates evidence emulating contract killings
and other types of murders (developed by IET
Inc.).
Simulator used to generate evidence from 200
murder events of three types
Murder for Hire (71 exs)
First Degree (75 exs)
Second Degree (54 exs)
Use mFoil to classify events into one of these
three categories.

26
Sample Rules

Murder For Hire(A)-
groupMemberMaleficiary(A, B),
subEvents(A, C), crimeMotive(C, economic).
First Degree Murder(A)-
subEvents(A, B), performedBy(B, C),
loves(C,D).
Second Degree Murder(A)-
subEvents(A, B), eventOccursAtLocationType
(B,publicProperty), crimeMotive(B, rival),
occurrentSubeventType(B, stealing_Generic).

27
Results on Synthetic Contract Killing Data
MurderForHire FirstDegree SecondDegree
PRECISION 85.52 91.17 95.83
RECALL 91.07 88.48 59.45
Majority Class mFOIL
ACCURACY 37.50 76.67
28
Recent Result from EELD Challenge Problem

murder_for_hire(A) -
eventOccursAt(A,B), perpetrator(A,C),
agentPhoneNumber(C,D),callerNumber(E,D),
accountHolder(F,C), to_Generic(G,F),
from_Generic(G,H), to_Generic(I,H).
Says an event is a murder for hire if it has a
recorded location and perpetrator, we have a
recorded phone call to the perpetrator, and there
was a chain of bank transfers resulting in money
reaching the perpetrators account.
100 accuracy on a held-out test set.
Similar pattern found manually by LD researchers
working on this challenge problem.

29
Future Research

Scaling to larger datasets
Stochastic search
Logic program optimization
Integration with relational and deductive
database technology.
Integrating probabilistic reasoning
Logic programs with Bayes-net constraints
Active Learning
Theory Refinement

30
Related Research

Graph-based Relational Data Mining
Subdue (Cook Holder, UT Arlington)
Probabilistic Relational Models
PRMs (Koller, Stanford)
Relational Feature Construction
PROXIMITY (Jensen, UMass)

31
Record Linkage

Identify and merge duplicate field values and
duplicate records in a database.
Applications
Duplicates in mailing lists
Merging multiple databases of stores,
restaurants, etc.
Matching bibliographic references in research
papers (Cora/ResearchIndex)
Identifying individuals who are trying to hide
their identity by providing slightly erroneous
personal information.

32
Record Linkage Examples
Author Title
Venue Address Year
Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby Information, prediction, and query by committee Advances in Neural Information Processing System San Mateo, CA 1993
Freund, Y., Seung, H.S., Shamir, E. Tishby, N. Information, prediction, and query by committee Advances in Neural Information Processing Systems San Mateo, CA.
Name Address City
Cusine
Second Avenue Deli 156 2nd Ave. at 10th New York Delicatessen
Second Avenue Deli 156 Second Ave. New York City Delis
33
Trainable Record Linkage

MARLIN (Multiply Adaptive Record Linkage using
INduction)
Learn parameterized similarity metrics for
comparing each field.
Trainable edit-distance
Use EM to set edit-operation costs
Learn to combine multiple similarity metrics for
each field to determine equivalence.
Use SVM to decide on duplicates

34
MARLIN Record Linkage Framework
Trainable duplicate detector
Trainable similarity metrics

35
Conclusions

Pattern Learning for Link Discovery is an
important application of data mining for
counter-terrorism.
Learning for Link Discovery requires Relational
Data Mining (RDM).
Other problem domains require RDM
Bioinformatics
Web
Natural Language Understanding
RDM is an important next-generation KDD
capability.