Magic Moments: Momentbased Approaches to Structured Output Prediction - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Magic Moments: Momentbased Approaches to Structured Output Prediction

Description:

Temporal, spatial and structural dependencies between objects ... PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO POR LA JUNTA Merida. O N N N M m m N N N O L ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 49

Provided by: Eli456

Category:

more less

Transcript and Presenter's Notes

Title: Magic Moments: Momentbased Approaches to Structured Output Prediction

1
Magic MomentsMoment-based Approaches
toStructured Output Prediction
The Analysis of Patterns

Elisa Ricci
joint work with Nobuhisa Ueda, Tijl De Bie, Nello
Cristianini

Thursday, October 25th
2
Outline
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Learning in structured output spaces
New algorithms based on Z-score
Experimental results and computational issues
Conclusions

3
Structured data everywhere!!!
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Many problems involve highly structured data
which can be represented by sequences, trees and
graphs.
Temporal, spatial and structural dependencies
between objects are modeled.
This phenomenon is observed in several fields
such as computational biology, computer vision,
natural language processing or web data analysis.

4
Learning with structured data
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Machine learning and data mining algorithms must
be able to analyze efficiently and automatically
a vast amount of complex and structured data.
The goal of structured learning algorithms is to
predict complex structures, such as sequences,
trees, or graphs.
Using traditional algorithms to cope with
problems involving structured data often implies
a loss of information about the structure.

5
Supervised learning
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Data are available in form of examples and their
associated correct answers.

Training set
Hypotheses space
Find s.t.
Learning
Prediction
6
Classification
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

A typical supervised learning task is
classification.

Named entity recognition (NER) locate named
entities in text. Entities of interest are person
names, location names, organization names,
miscellaneous (dates, times...)
x
Observed variable word in a sentence.
Multiclass classification
y
Label entity tag.
PP ESTUDIA YA PROYECTO LEY TV REGIONAL REMITIDO
POR LA JUNTA Merida. O N N
N M m m
N N N O L
7
Sequence labeling
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Can we consider the interactions between adjacent
words?
Goal realize a joint labeling for all the words
in the sentence.

Sequence labeling given an input sequence x,
reconstruct the associated label sequence y of
equal length.
x (x1...xn)
Observed sequence words in a sentence.
Label sequence entity tags.
y (y1...yn)
8
Sequence alignment
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Biological sequence alignment is used to
determine the similarity between biological
sequences.
ACTGATTACGTGAACTGGATCCA ACTC--TAGGTGAAGTG-ATCCA
?
Given two sequences S1, S2 ? S a global
alignment is an assignment of gaps, so as to line
up each letter in one sequence with either a gap
or a letter in the other sequence.
S A,T,G,C, S1 , S2 ? S
ATGCTTTC--- ---CTGTCGCC
S1 ATGCTTTC S2 CTGTCGCC
9
Sequence alignment
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence alignment given a sequences pair x,
predict the correct sequence y of alignment
operations (e.g. matches, mismatches,
gaps). Alignments can be represented as
paths from the upper-left to the lower-right
corner in the alignment graph.
10
RNA secondary structure prediction
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
RNA secondary structure prediction given a RNA
sequence, predict the most likely secondary
structure. The study of RNA structure
is important in understanding its functions.
AUGAGUAUAAGUUAAUGGUUAAAGUAAAUGUCUUCCACACAUUCCAUCUG
AUUUCGAUUCUCACUACUCAU
?
11
Sequence parsing
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence parsing given an input sequence x,
determine the associated parse tree y given an
underlying context-free grammar. Example
Context-free grammar GV, A, R, S V S set
of non-terminals symbols S G, A, U, C set of
terminals symbols R S ? SS GSC CSG ASU
USA e .
y
12
Generative models
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling

Traditionally HMMs have been used for sequence
labeling.
Two main drawbacks
The conditional independence assumptions are
often too restrictive. HMMs cannot represent
multiple interacting features or long range
dependencies between the observations.
They are typically trained by maximum likelihood
(ML) estimation.

13
Discriminative models
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Specify the probability of possible output y
given an observation x (consider conditional
probability P(yx) rather than joint probability
P(y,x)).
Do not require strict independence assumptions of
generative models.
Arbitrary features of the observations are
considered.
Conditional Random Fields (CRFs)
Lafferty et al., 01

14
Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Several discriminative algorithms have emerged
recently in order to predict complex structures,
such as sequences, trees, or graphs.
New discriminative approaches.
Problems analyzed
Given a training set of correct pairs of
sentences and their associated entity tags learn
to extract entities from a new sentence.
Given a training set of correct biological
alignments learn to align two unknown sequences.
Given a training set of corrects RNA secondary
structures associated to a set of sequences learn
to determine the secondary structure of a new
sequence.
This is not an exhaustive list of possible
applications.

15
Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Multilabel supervised classification (Output y
(y1...yn)).

Training set
Hypotheses space
Find s.t.
Learning
Prediction
16
Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Three main phases
Encoding
define a suitable feature map f(x,y).
Compression
characterize the output space in a synthetic and
compact way.
Optimization
define a suitable objective function and use it
for learning.

17
Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Encoding
define a suitable feature map f(x,y).
Compression
characterize the output space in a synthetic and
compact way.
Optimization
define a suitable objective function and use it
for learning.

18
Encoding
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
S1 ATGCTTTC S2 CTGTCGCC

Features must be defined in a way such that
prediction can be computed efficiently.
The feature vector f(x,y) decomposes as sum of
elementary features f(x,y) on parts.
Parts are typically edges or nodes in graphs.

19
Encoding
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling
Example CRF with HMM features
In general features reflect long range
interactions (when labeling xi past and future
observations are taken into account). Arbitrary
features of the observations are considered (e.g.
spelling properties in NER).
20
Encoding
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence alignment

3-parameters model
In practice more complex models are used
4-parameters model affine function for gap
penalties, i.e. different costs if the gap starts
(gap opening penalty) in a given position or if
it continues (gap extension penalty).
211/212-parameters model f(x,y) contains the
statistics associated to the gap penalties and
all the possible pairs of amino acids.

21
Encoding
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence parsing
The feature vector contains the statistics
associated to the occurrences of the rules.
y
22
Encoding
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Having defined these features, predictions can be
computed efficiently with dynamic programming
(DP).
Sequence labeling Viterbi algorithm
Sequence alignment Needleman-Wunsch
algorithm
Sequence parsing Cocke-Younger-Kasami (CYK)
algorithm

DP TABLE
23
Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Encoding
define a suitable feature map f(x,y).
Compression
characterize the output space in a synthetic and
compact way.
Optimization
define a suitable objective function and use it
for learning.

24
Computing moments
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

The number N of possible output vector yk given
an observation x is typically huge.
To characterize the distribution of the scores
its mean and its variance are considered.
C and m can be computed efficiently with DP
techniques.

25
Computing moments
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling
Recursive formula
The number N of possible label sequences yk given
an observation sequence x is exponential in the
length of the sequences. An algorithm similar to
the forward algorithm is used to compute m and C.

Mean value associated to the feature which
represents the emission of a symbol q at state p.
26
Computing moments
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Basic idea behind recursive formulas
Mean values are computed considering
Variances are computed centering the second order
moments

27
Computing moments
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Problem high computational cost for large
feature spaces.
1st Solution Exploit the structure and the
sparseness of the covariance matrix C.
In sequence labeling for CRF with HMM features
the number of different values in C is linear in
the size of the observation alphabet.
2nd Solution Sampling strategy.

Example
28
Learning in structured output spaces
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Encoding
define a suitable feature map f(x,y).
Compression
characterize the output space in a synthetic and
compact way.
Optimization
define a suitable objective function and use it
for learning.

29
Z-score
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

New optimization criterion particularly suited
for non-separable cases.
Minimize the number of output vectors with score
higher than the score of the correct pairs.
Maximize the Z-score

30
Z-score
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

The Z-score can be expressed as a function of the
parameters w.
Two equivalent optimization problems

31
Z-score
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Ranking loss
An upper bound on the ranking loss is minimized
The number of output vectors with score higher
than the score of the correct pairs is minimized.

32
Previous approaches
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Minimize the number of incorrect macrolabels y.
CRFs Lafferty et al., 01, HMSVM Altun at al.,
03, averaged perceptron Collins 02.
Minimize the number of incorrect microlabels y.
M3Ns Taskar et al., 03, SVMISO Tsochantaridis
et al., 04.

33
SODA
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Given a training set T the empirical risk
associated to the upper-bound on the ranking loss
is minimized.
An equivalent formulation in terms of C and b is
considered to solve it .

SODA (Structured Output Discriminant Analysis)
34
SODA
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Convex optimization
If C is not PSD, regularization can be
introduced.
Solution simple matrix inversion .
Fast conjugate gradient methods available.

35
Rademacher bound
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

The bound shows that learning based on the upper
bound on the ranking loss is effectively
achieved.
The bound holds also in the case where b and C
are estimated by sampling.
Two directions of sampling
For each only a limited number n
of incorrect outputs is considered to estimate b
and C.
Only a finite number l of input-output pairs is
given in the training set.
The empirical expectation of the estimated loss
(estimated by computing b
and C by random sampling) is a good approximate
upper bound for the expected loss
.
The latter is an upper bound for the ranking loss
, such that the Rademacher bound is
also a bound on the expectation of the ranking
loss.

36
Rademacher bound
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Theorem (Rademacher bound for SODA). With
probability at least 1-d over the joint of the
random sample T and the random samples from the
output space for each that are
taken to approximate the matrices b and C, the
following bound holds for any w with squared
norm smaller than c
whereby M is a constant and we assume that the
number of random samples for each training
pair is equal to n.
The Rademacher complexity terms and
decrease with and respectively,
such
that the bound becomes tight for increasing n and
l, as long as n grows faster than log(l).

37
Z-score approach
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

How to define the Z-score of a training set?
Another possible approach (independence
assumption)
Convex optimization problem which can be solved
again by simple matrix inversion.
Maximizing the Z-score most linear constraints
are satisfied.

Z-score approach
38
Iterative approach
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

One may want to impose explicitly the violated
constraints.
This is again a convex optimization problem that
can be solved with an iterative algorithm similar
to previous approaches (HMSVM Altun at al., 03,
averaged perceptron Collins 02).
Eventually relax constraints (e.g. add slack
variables for non separable problems).

39
Iterative approach
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
40
Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling artificial data.

Chain CRF with HMM features.
Sequence length 50. Training set size 20
pairs. Test set size 100 pairs.
Comparison with SVMISO Tsochantaridis et al.,
04, Perceptron Collins 02, CRFs Lafferty et
al., 01.
Average number of incorrect labels varying the
level of noise p.

41
Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling artificial data.

HMM features ( ).
Noise level p0.2.
Average number of incorrect labels and
computational time as function of the training
set size.

42
Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling artificial data.
Chain CRF with HMM features (
). Sequence length 10. Training set size 50
pairs. Test set size 100 pairs. Level of
noise p0.2 Comparison with SVMISO
Tsochantaridis et al., 04. Labeling error on
test set and average training time as function of
the observation alphabet size.
43
Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling artificial data.

Chain CRF with HMM features (
).
Adding constraints is not very useful when data
are noisy and non linearly separable.

44
Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence labeling
NER Spanish news wire article - Special
Session of CoNLL02 300 sentences with average
length of 30 words. 9 labels non-name, beginning
and continuation of persons, organizations,
locations and miscellaneous names. Two sets of
binary features S1 (HMM features) and S2 (S1 and
HMM features for the previous and the next word).
Labeling error on test set (5-fold
crossvalidation)
45
Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions
Sequence alignment artificial sequences.
Test error (number of incorrectly aligned pairs)
as function of the training set size.
Original and reconstructed substitution matrices.

46
Experimental results
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

Sequence parsing
G6 grammar in Dowell and Eddy, 2004.
RNA sequences of five families extracted from the
Rfam database Griffiths-Jones et al., 2003

Prediction on five-fold crossvalidation.
47
Conclusions
Learning in structured output spaces Z-score Expe
rimental results and computational
issues Conclusions

New methods for learning in structured output
spaces.
Accuracy comparable with state-of-the-art
techniques.
Easy to implement (DP for matrix computations and
simple optimization problem).
Fast for large training set and reasonable number
of features.
Mean and variance computations parallelizable for
large training set.
Conjugate gradient techniques used in the
optimization phase.
Three application analyzed sequence labeling,
sequence parsing and sequence alignment.
Future works
Test the scalability of this approach using
approximate techniques.
Develop a dual version with kernels.