Title: Relational Data Mining with Inductive Logic Programming for Link Discovery
1Relational Data Mining with Inductive Logic
Programming for Link Discovery
- Ray Mooney, Prem Melville, Rupert Tang
- University of Texas at Austin
- Jude Shavlik, Inês de Castro Dutra, David Page,
Vítor Santos Costa - University of Wisconsin at Madison
2EELD Program
- Evidence Extraction
- Link Discovery
- Pattern Learning
3Link Discovery Task(from Jim Antonisse, GITI)
Vetted hyp cases
Evidence request(s)
Link Discovery Core Pattern Matching
Alerts based on Hypothesized cases
Pattern(s) of Interest
Domain Patterns
Legend pre-run-time processing
run-time processing
4Link Discovery
- Data is multi-relational with many people,
places, objects and actions and numerous types of
relations between them. - Link analysis in intelligence and criminology
investigates exploring and visualizing such data
as a graph with many nodes and edges of various
types. - Link discovery entails finding new links and
recognizing threatening patterns in such
highly-relational data.
5EELD Program
- Evidence Extraction
- Link Discovery
- Pattern Learning
6Pattern Learning for Link Discovery
- Automated discovery of patterns of interest
that indicate potentially threatening activities
in large amounts of heterogeneous,
multi-relational data. - Requires inducing multi-relational patterns that
characterize multiple entities and multiple links
between them.
7Limitations of Traditional Data Mining
- Traditional KDD methods assume the data to be
mined is in a single relational table and that
examples are flat tuples of attribute values. - This assumption stems from
- 1) Properties of the typical data mining tasks
like market basket analysis. - 2) Focus in machine learning and statistics on
classification or regression using feature
vectors as inputs.
8Relational Data Mining
- Data contains multiple relations.
- Patterns to be discovered contain multiple
relations. - Knowledge to be discovered may be the definition
of another relation rather than a classification
or regression function.
9Relational Data Mining Example
Male Female
Alice
Bob
Married
Mary
Jack
Jane
Tom
Parent
Carol
John
Fred
Sue
, Male(X), not(XW).
Uncle(X,Y) - Parent(Z,X), Parent(Z,W),
Parent(W,Y)
10Relational Data Mining Example (cont)
Male Female
Alice
Bob
Married
Mary
Jack
Jane
Tom
Parent
Carol
John
Fred
Sue
, Male(X), not(ZV).
11Most KDD Ignores RDM
- KDD textbooks barely mention RDM
- Han Kamber, 2001
- Hand, Mannila, Smyth, 2001
- Witten Frank, 1999
- But there is a recent edited collection on RDM
- S. Džeroski N. Lavrac, eds. Relational Data
Mining, Springer Verlag, 2001.
12Inductive Logic Programming(ILP)
- Standard formal language for representing
relational knowledge is first-order predicate
logic. - ILP studies the induction of hypotheses in
first-order predicate logic. - Logic programs (e.g. Prolog) or function-free
logic programs (e.g. Datalog), are a useful,
reasonably-tractable subset of first-order
predicate logic. - ILP is the most well-studied approach to
relational data mining.
13ILP Problem Definition
- Given
- Positive Example Set P
- Negative Example Set N
- Background Knowledge B
- Find
- Hypothesis, H, such that
-
-
P, N, B and H are all sets of rules in
first-order logic (i.e. Horn clauses, logic
programs)
14ILP Algorithms
- We have utilized two ILP systems for EELD
problems in link discovery. - Aleph (Srinivasan, 2001) A variant of the
popular Progol algorithm (Muggleton, 1995) - mFoil (Tang and Mooney, 2002) A variant of the
popular Foil algorithm (Quinlan, 1990)
15EELD Russian Nuclear Smuggling Data
- Data manually extracted from new sources about
events related to nuclear smuggling (developed by
Veridian Inc.) - Size of data set
- 40 relational tables
- 2 to 800 tuples per relation
- Translated Access database to Prolog, mapping
each relational table to a predicate. - Used Aleph to learn rules for the relation
Linked(A,B)which determines whether or not two
events are part of the same incident. - 143 positive examples
- 517 negative examples
16 Illustration of Linked Relation
New Event
Partial Incident N
Partial Incident M
17 Find Correct Incident for New Event
Partial Incident M
Expanded Incident N
18Sample Rule
linked(EventA,EventB) - lk_event_material(_,Eve
ntA,_,_,_, ConcealmentG,DescH),
lk_event_person(_,EventB,PersonD,_,C,C,_),
lk_person_material(_,PersonD,MatF,EvE,_,_,_,_,_),
lk_event_material(_,EvE,MatF,I,_,
ConcealmentG,DescH), l_relations(I,_,"Stolen").
If A is linked to a specific type of material
ltG,Hgt, and B is linked to a person linked to the
same specific type of material, through an event
in which that material was stolen, then events A
and B are linked.
19Linked(A,B)
B
A
Event Material Person
20Linked(A,B)
B
A
Material Type GH
Event Material Person
21Linked(A,B)
B
A
E
D
Material Type GH
Material Type GH
Event Material Person
22Linked(A,B)
B
A
E
D
Stolen
Material Type GH
Material Type GH
Event Material Person
23Linked(A,B)
B
A
E
D
Stolen
Material Type GH
Material Type GH
Event Material Person
24Accuracy Results for Learning Linkedfor Nuclear
Smuggling Data
- Experimental Method 5-fold cross validation.
- Also tried bagging Aleph to produce an ensemble
of 25 hypotheses.
Majority Class (not Linked) Aleph Bagged Aleph
78 83 86
25Synthetic Contract Killing Data
- Data generated by a plan-based simulator that
generates evidence emulating contract killings
and other types of murders (developed by IET
Inc.). - Simulator used to generate evidence from 200
murder events of three types - Murder for Hire (71 exs)
- First Degree (75 exs)
- Second Degree (54 exs)
- Use mFoil to classify events into one of these
three categories.
26Sample Rules
- Murder For Hire(A)-
- groupMemberMaleficiary(A, B),
- subEvents(A, C), crimeMotive(C, economic).
- First Degree Murder(A)-
- subEvents(A, B), performedBy(B, C),
loves(C,D). - Second Degree Murder(A)-
- subEvents(A, B), eventOccursAtLocationType
(B,publicProperty), crimeMotive(B, rival),
occurrentSubeventType(B, stealing_Generic).
27Results on Synthetic Contract Killing Data
MurderForHire FirstDegree SecondDegree
PRECISION 85.52 91.17 95.83
RECALL 91.07 88.48 59.45
Majority Class mFOIL
ACCURACY 37.50 76.67
28Recent Result from EELD Challenge Problem
- murder_for_hire(A) -
- eventOccursAt(A,B), perpetrator(A,C),
- agentPhoneNumber(C,D),callerNumber(E,D),
- accountHolder(F,C), to_Generic(G,F),
- from_Generic(G,H), to_Generic(I,H).
- Says an event is a murder for hire if it has a
recorded location and perpetrator, we have a
recorded phone call to the perpetrator, and there
was a chain of bank transfers resulting in money
reaching the perpetrators account. - 100 accuracy on a held-out test set.
- Similar pattern found manually by LD researchers
working on this challenge problem.
29Future Research
- Scaling to larger datasets
- Stochastic search
- Logic program optimization
- Integration with relational and deductive
database technology. - Integrating probabilistic reasoning
- Logic programs with Bayes-net constraints
- Active Learning
- Theory Refinement
30Related Research
- Graph-based Relational Data Mining
- Subdue (Cook Holder, UT Arlington)
- Probabilistic Relational Models
- PRMs (Koller, Stanford)
- Relational Feature Construction
- PROXIMITY (Jensen, UMass)
31Record Linkage
- Identify and merge duplicate field values and
duplicate records in a database. - Applications
- Duplicates in mailing lists
- Merging multiple databases of stores,
restaurants, etc. - Matching bibliographic references in research
papers (Cora/ResearchIndex) - Identifying individuals who are trying to hide
their identity by providing slightly erroneous
personal information.
32Record Linkage Examples
Author Title
Venue Address Year
Yoav Freund, H. Sebastian Seung, Eli Shamir, and Naftali Tishby Information, prediction, and query by committee Advances in Neural Information Processing System San Mateo, CA 1993
Freund, Y., Seung, H.S., Shamir, E. Tishby, N. Information, prediction, and query by committee Advances in Neural Information Processing Systems San Mateo, CA.
Name Address City
Cusine
Second Avenue Deli 156 2nd Ave. at 10th New York Delicatessen
Second Avenue Deli 156 Second Ave. New York City Delis
33Trainable Record Linkage
- MARLIN (Multiply Adaptive Record Linkage using
INduction) - Learn parameterized similarity metrics for
comparing each field. - Trainable edit-distance
- Use EM to set edit-operation costs
- Learn to combine multiple similarity metrics for
each field to determine equivalence. - Use SVM to decide on duplicates
34MARLIN Record Linkage Framework
Trainable duplicate detector
Trainable similarity metrics
35Conclusions
- Pattern Learning for Link Discovery is an
important application of data mining for
counter-terrorism. - Learning for Link Discovery requires Relational
Data Mining (RDM). - Other problem domains require RDM
- Bioinformatics
- Web
- Natural Language Understanding
- RDM is an important next-generation KDD
capability.