Statistical Learning from Relational Data - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Learning from Relational Data

Description:

Webpages (& the entities they represent), hyperlinks. Social networks ... Topics of linked webpages are correlated. Data instances are not identically distributed: ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 71
Provided by: btas4
Category:

less

Transcript and Presenter's Notes

Title: Statistical Learning from Relational Data


1
Statistical Learning from Relational Data
  • Daphne Koller
  • Stanford University
  • Joint work with many many people

2
Relational Data is Everywhere
  • The web
  • Webpages ( the entities they represent),
    hyperlinks
  • Social networks
  • People, institutions, friendship links
  • Biological data
  • Genes, proteins, interactions, regulation
  • Bibliometrics
  • Papers, authors, journals, citations
  • Corporate databases
  • Customers, products, transactions

3
Relational Data is Different
  • Data instances not independent
  • Topics of linked webpages are correlated
  • Data instances are not identically distributed
  • Heterogeneous instances (papers, authors)

No IID assumption ?
This is a good thing ?
4
New Learning Tasks
  • Collective classification of related instances
  • Labeling an entire website of related webpages
  • Relational clustering
  • Finding coherent clusters in the genome
  • Link prediction classification
  • Predicting when two people are likely to be
    friends
  • Pattern detection in network of related objects
  • Finding groups (research groups, terrorist groups)

5
Probabilistic Models
  • Uncertainty model
  • space of possible worlds
  • probability distribution over this space.
  • Worlds often defined via a set of state
    variables
  • medical diagnosis diseases, symptoms, findings,
  • each world an assignment of values to variables
  • Number of worlds is exponential in of vars
  • 2n if we have n binary variables

6
Outline
  • Relational Bayesian networks
  • Relational Markov networks
  • Collective Classification
  • Relational clustering

with Avi Pfeffer, Nir Friedman, Lise Getoor
7
Bayesian Networks
Difficulty
Intelligence
Grade
nodes variables edges direct influence
SAT
Job
Graph structure encodes independence
assumptions Letter conditionally independent
of Intelligence given Grade
8
BN semantics
conditional independencies in BN structure
local probability models
full joint distribution over domain

  • Compact natural representation
  • nodes have ? k parents ?? 2kn vs. 2n params
  • parameters natural and easy to elicit

9
Reasoning using BNs
Difficulty
Intelligence
Grade
SAT
Letter
Letter
SAT
Full joint distribution specifies answer to any
query P(variable evidence about others)
10
Bayesian Networks Problem
  • Bayesian nets use propositional representation
  • Real world has objects, related to each other

Intelligence
Difficulty
These instances are not independent
A
C
Grade
11
Relational Schema
  • Specifies types of objects in domain, attributes
    of each type of object types of relations
    between objects

Classes
Student
Professor
Intelligence
Teaching-Ability
Teach
Take
Attributes
Relations
In
Course
Difficulty
12
St. Nordaf University
World ?
Prof. Smith
Prof. Jones
Teaches
Teaches
Grade
In-course
Registered
Satisfac
George
Grade
Registered
Satisfac
In-course
Welcome to
CS101
Grade
Registered
Jane
Satisfac
In-course
13
Relational Bayesian Networks
  • Universals Probabilistic patterns hold for all
    objects in class
  • Locality Represent direct probabilistic
    dependencies
  • Links define potential interactions

K. Pfeffer Poole Ngo Haddawy
14
RBN Semantics
  • Ground model
  • variables attributes of all objects
  • dependencies determined by
  • relational links template model

Prof. Smith
Prof. Jones
George
Welcome to
Welcome to
CS101
CS101
Jane
15
The Web of Influence
easy / hard
low / high
16
Likelihood Function
  • Likelihood of a BN with shared parameters
  • Joint likelihood is a product of likelihood terms
  • One for each attribute X.A and its family
  • For each X.A, the likelihood function aggregates
    counts from all occurrences x.A in world ?

Friedman, Getoor, K., Pfeffer, 1999
17
Likelihood Function Multinomials
Log-likelihood
Sufficient statistics
18
RBN Parameter Estimation
  • MLE parameters
  • Bayesian estimation
  • Prior for each attribute X.A
  • Posterior uses aggregated sufficient statistics

aggregated sufficient statistics
19
Learning RBN Structure
  • Define set of legal RBN structures
  • Ones with legal class dependency graphs
  • Define scoring function ? Bayesian score
  • Product of family scores
  • One for each X.A
  • Uses aggregated sufficient statistics
  • Search for high-scoring legal structure

Friedman, Getoor, K., Pfeffer, 1999
20
Learning RBN Structure
  • All operations done at class level
  • Dependency structure parents for X.A
  • Acyclicity checked using class dependency graph
  • Score computed at class level
  • Individual objects only contribute to sufficient
    statistics
  • Can be obtained efficiently using standard DB
    queries

21
Outline
  • Relational Bayesian networks
  • Relational Markov networks
  • Collective Classification
  • Relational clustering

with Avi Pfeffer, Nir Friedman, Lise Getoor
with Ben Taskar, Pieter Abbeel
22
Why Undirected Models?
  • Symmetric, non-causal interactions
  • E.g., web categories of linked pages are
    correlated
  • Cannot introduce direct edges because of cycles
  • Patterns involving multiple entities
  • E.g., web triangle patterns
  • Directed edges not appropriate
  • Solution Impose arbitrary direction
  • Not clear how to parameterize CPD for variables
    involved in multiple interactions
  • Very difficult within a class-based
    parameterization

Taskar, Abbeel, K. 2001
23
Markov Networks
  • A Markov network is an undirected graph over some
    set of variables V
  • Graph associated with a set of potentials ?i
  • Each potential is factor over subset Vi
  • Variables in Vi must be a (sub)clique in network

24
Markov Networks
James
Mary
Kyle
Noah
Laura
25
Relational Markov Networks
  • Universals Probabilistic patterns hold for all
    groups of objects
  • Locality Represent local probabilistic
    dependencies
  • Sets of links give us possible interactions

26
RMN Semantics
Intelligence
Grade
Geo Study Group
George
Grade
Welcome to
CS101
Intelligence
Grade
Jane
CS Study Group
Grade
Intelligence
Jill
27
Outline
  • Relational Bayesian Networks
  • Relational Markov Networks
  • Collective Classification
  • Discriminative training
  • Web page classification
  • Link prediction
  • Relational clustering

with Ben Taskar, Carlos Guestrin, Ming Fai
Wong, Pieter Abbeel
28
Collective Classification
Probabilistic Relational Model
Training Data
Features ?.x Labels ?.y
Learning
Model Structure
New Data
Conclusions
Inference
Features ?.x
Labels ?.y
Example
  • Train on one year of student intelligence, course
    difficulty, and grades
  • Given only grades in following year, predict all
    students intelligence

29
Learning RMN Parameters
Parameterize potentials as log-linear model
Template potential ?
30
Max Likelihood Estimation
We dont care about the joint distribution
P(?.x, ?.y)
Estimation
Classification
maximizew
argmaxy
?.x ?.y
31
Web ? KB
Craven et al.
32
Web Classification Experiments
  • WebKB dataset
  • Four CS department websites
  • Bag of words on each page
  • Links between pages
  • Anchor text for links
  • Experimental setup
  • Trained on three universities
  • Tested on fourth
  • Repeated for all four combinations

33
Standard Classification
Page
Categories faculty course project student other
Professor department extract information computer
science machine learning
34
Standard Classification
Page
working with Tom Mitchell
Discriminatively trained naïve Markov Logistic
Regression
test set error
4-fold CV Trained on 3 universities Tested on 4th
35
Power of Context
Professor?
Student?
Post-doc?
36
Collective Classification
37
Collective Classification
Classify all pages collectively, maximizing the
joint label probability
test set error
Taskar, Abbeel, K., 2002
38
More Complex Structure
39
More Complex Structure
40
Collective Classification Results
35.4 error reduction over logistic
test set error
Taskar, Abbeel, K., 2002
Logistic
Links
Section
LinkSection
41
Max Conditional Likelihood
We dont care about the conditional
distribution P(?.y ?.x)
Estimation
Classification
maximizew
argmaxy
?.x ?.y
42
Max Margin Estimation
What we really want correct class labels
Quadratic program ?
margin
labeling mistakes in y
Exponentially many constraints ?
Taskar, Guestrin, K., 2003 (see also
Collins, 2002 Hoffman 2003)
43
Max Margin Markov Networks
  • We use structure of Markov network to provide
    equivalent formulation of QP
  • Exponential only in tree width of network
  • Complexity max-likelihood classification
  • Can solve approximately in networks where induced
    width is too large
  • Analogous to loopy belief propagation
  • Can use kernel-based features!
  • SVMs meet graphical models

Taskar, Guestrin, K., 2003
44
WebKB Revisited
16.1 relative reduction in error relative to
cond. likelihood RMNs
45
Predicting Relationships
Tom Mitchell Professor
WebKB Project
Sean Slattery Student
  • Even more interesting relationships between
    objects

46
Predicting Relations
  • Introduce exists/type attribute for each
    potential link
  • Learn discriminative model for this attribute
  • Collectively predict its value in new world

72.9 error reduction over flat
Page
To-
Page
From-
Category
Category
...
...
Word1
WordN
Word1
WordN
Relation
Exists/ Type
...
LinkWord1
LinkWordN
Taskar, Wong, Abbeel, K., 2003
47
Outline
  • Relational Bayesian Networks
  • Relational Markov Networks
  • Collective Classification
  • Relational clustering
  • Movie data
  • Biological data

with Ben Taskar, Eran Segal
with Eran Segal, Nir Friedman, Aviv Regev, Dana
Peer, Haidong Wang, Micha Shapira, David
Botstein
48
Relational Clustering
Probabilistic Relational Model
Unlabeled Relational Data
Learning
Clustering of instances
Model Structure
Example
  • Given only students grades, cluster similar
    students

49
Learning w. Missing Data EM
  • EM Algorithm applies essentially unchanged
  • E-step computes expected sufficient statistics,
    aggregated over all objects in class
  • M-step uses ML (or MAP) parameter estimation
  • Key difference
  • In general, the hidden variables are not
    independent
  • Computation of expected sufficient statistics
    requires inference over entire network

50
Learning w. Missing Data EM
Dempster et al. 77
low / high
easy / hard
51
Movie Data
Internet Movie Database http//www.imdb.com
52
Discovering Hidden Types
Learn model using EM
Type
Type
Type
Taskar, Segal, K., 2001
53
Discovering Hidden Types
Taskar, Segal, K., 2001
54
Biology 101 Gene Expression
Swi5
DNA
Cells express different subsets of their
genes in different tissues and under different
conditions
55
Gene Expression Microarrays
  • Measure mRNA level for all genes in one condition
  • Hundreds of experiments
  • Highly noisy

Expression of gene i in experiment j
Experiments
Induced
Genes
Repressed
56
Standard Analysis
  • Cluster genes by similarity of expression
    profiles
  • Manually examine clusters to understand whats
    common to genes in cluster

57
General Approach
  • Expression level is a function of gene properties
    and experiment properties
  • Learn model that best explains the data
  • Observed properties gene sequence, array
    condition,
  • Hidden properties gene cluster
  • Assignment to hidden variables (e.g., module
    assignment)
  • Expression level as function of properties

58
Clustering as a PRM
Gene
Experiment
Cluster
ID
Level
Expression
59
Modular Regulation
  • Learn functional modules
  • Clusters of genes that are similarly controlled
  • Learn control program for modules
  • Expression as function of control genes

60
Module Network PRM
Gene
Experiment
Cluster
Controlk
Control2
Control1
Activity level of control gene in experiment
Level
Expression
Segal, Regev, Peer, Koller, Friedman, 2003
61
Experimental Results
  • Yeast Stress Data (Gasch et al.)
  • 2355 genes that showed activity
  • 173 experiments (microarrays)
  • Diverse environmental stress conditions (e.g.
    heat shock)
  • Learned module network with 50 modules
  • Cluster assignments are hidden variables
  • Structure of dependency trees unknown
  • Learned model using structural EM algorithm

Segal et al., Nature Genetics, 2003
62
Biological Evaluation
  • Find sets of co-regulated genes (regulatory
    module)
  • Find the regulators of each module

46/50
30/50
Segal et al., Nature Genetics, 2003
63
Experimental Results
  • Hypothesis Regulator X regulates process Y
  • Experiment Knock out X and rerun the experiment

X
Segal et al., Nature Genetics, 2003
64
Differentially Expressed Genes
Segal et al., Nature Genetics, 2003
65
Biological Experiments Validation
  • Were the differentially expressed genes predicted
    as targets?
  • Rank modules by enrichment for diff. expressed
    genes

Segal et al., Nature Genetics, 2003
66
Biology 102 Pathways
  • Pathways are sets of genes that act together to
    achieve a common function

67
Finding Pathways Attempt I
  • Use protein-protein interaction data

68
Finding Pathways Attempt I
  • Use protein-protein interaction data

69
Finding Pathways Attempt I
  • Use protein-protein interaction data
  • Problems
  • Data is very noisy
  • Structure is lost
  • Large connected component in interaction graph
    (3527/3589 genes)

70
Finding Pathways Attempt II
  • Use expression microarray clusters

Pathway I
  • Problems
  • Expression is only weak indicator of
    interaction
  • Interacting pathways are not separable

Pathway II
71
Finding Pathways Our Approach
  • Use both types of data to find pathways
  • Find active interactions using gene expression
  • Find pathway-related co-expression using
    interactions

Pathway I
Pathway III
Pathway II
Pathway IV
Segal, Wang, K., 2003
72
Probabilistic Model
Gene
1
Pathway
...
ExpN
Exp1
Interacts
Expression level in N arrays
protein product interaction
Cluster all genes collectively, maximizing the
joint model likelihood
Compatibility potential
Segal, Wang, K., 2003
73
Capturing Protein Complexes
  • Independent data set of interacting proteins

400
Our method
350
Standard expression clustering
300
  • 124 complexes covered at 50 for our method
  • 46 complexes covered at 50 for clustering

250
200
Num Complexes
150
100
50
0
0
10
20
30
40
50
60
70
80
90
100
Segal, Wang, K., 2003
Complex Coverage ()
74
RNAse Complex Pathway
YHR081W RRP40 RRP42 MTR3 RRP45 RRP4 RRP43 DIS3 TRM
7 SKI6 RRP46 CSL4
  • Includes all 10 known pathway genes
  • Only 5 genes found by clustering

RRP43
RRP40
TRM7
RRP42
DIS3
RRP45
CSL4
RRP46
YHR081W
SKI6
MTR3
RRP4
Segal, Wang, K., 2003
75
Interaction Clustering
  • RNAse complex found by interaction clustering as
    part of cluster with 138 genes

Segal, Wang, K., 2003
76
Truth in Advertising
  • Huge graphical models
  • 3000-50,000 hidden variables
  • Hundreds of thousands of observed nodes
  • Very densely connected
  • Learning
  • Multiple iterations of model updates
  • Each requires running inference on the model
  • Inference
  • Exact inference is intractable
  • Use belief propagation
  • Single inference iteration 1-6 hours
  • Algorithmic ideas key to scaling

77
Relational Data A New Challenge
Opportunity
  • Data consists of different types of instances
  • Instances are related in complex networks
  • Instances are not independent
  • New tasks for machine learning
  • Collective classification
  • Relational clustering
  • Link prediction
  • Group detection

78
Thank You!
http//robotics.stanford.edu/koller/
Write a Comment
User Comments (0)
About PowerShow.com