A Review of Sequential Supervised Learning presentation

About This Presentation

Transcript and Presenter's Notes

Title: A Review of Sequential Supervised Learning

1
A Review of Sequential Supervised Learning

Guohua Hao
School of Electrical Engineering and Computer
Science
Oregon State University

2
Outline

Sequential supervised learning
Evaluation criteria
Methods for sequential supervised learning
Open problems

3
Traditional supervised learning

Given training examples (x, y)
x observation feature vector
y class label
Independently and identically assumption

4
Sequential learning tasks

Part-of-speech tagging
Assign a part-of-speech tag to each word in a
sentence
Example
I am taking my examination
ltpron verb verb pron noungt

5
Sequential supervised learning

Given training examples
observation sequence
label sequence
Goal learn a classifier

6
Applications

Protein secondary structure prediction
Name entity recognition
FAQ document segmentation
Optical character recognition for words
NetTalk task

7
Outline

Sequential supervised learning
Evaluation criteria
Methods for sequential supervised learning
Open problems

8
Modeling accuracy

Two kinds of relations
Horizontal relations
Directed or undirected
graphical model
Vertical relations
Generative or discriminative model
Feature representation

yt-1
yt1
yt
Xt-1
xt1
xt
9
Feature representation

Arbitrary non-independent features of observation
sequence
Overlapping features
Global features
Very high or infinite dimensional feature spaces
kernel method

10
Computational efficiency

Training efficiency
Testing efficiency
Scalability to large number of features and labels

11
Generalization bound

Margin-maximizing property

12
Outline

Sequential supervised learning
Evaluation criteria
Methods for sequential supervised learning
Open problems

13
Sliding window/recurrent sliding window

Sliding window
Vertical relations ok
Horizontal relations that
depend on nearby xs
Recurrent sliding window
Both vertical and horizontal
relations ok
Only one direct horizon
relations

yt-1
yt1
yt
Xt-1
xt1
xt
yt-1
yt1
yt
Xt-1
xt1
xt
14
Evaluation

Feature representation
arbitrary non-independent observation features
ok
Computational efficiency
Depends on the classical supervised learning
algorithm used

15
Generative model Hidden Markov Models (HMM)
yt-1
yt1
yt

Representation of joint
distribution
Extension of Naïve Bayesian networks
Two kinds of distributions
State transition probability
state-specific observation distribution

Xt-1
xt1
xt
16
Evaluation

Computational efficiency
Maximize the likelihood (ML) training efficient
Prediction
Per sequence lossViterbi O(TN2)
Per label lossForward backward O(TN2)

17
Evaluation

Modeling accuracy
Vertical relations modeled in generative way
ML training leads to suboptimal
Feature representation
Conditional independence assumption
Arbitrary non-independent observation features
not allowed

18
Discriminative graphical model

Model the conditional distribution
Extension of logistic regression
Arbitrary non-independent observation features
allowed
Typical methods
Maximum entropy Markov model (MEMM)
Conditional random fields (CRF)

19
Maximum entropy Markov models

Per state transition distribution
Maximum entropy formulation

yt-1
yt1
yt
Xt-1
xt1
xt
20
Evaluation

Training Generalized iterative scaling
Prediction Viterbi or forward backward
Label bias problem
Drawback of directed conditional Markov model
Per source state normalization
Low entropy distribution pay little attention to
the observation
Favorite the more frequent sequences

21
Conditional Random Fields

Undirected Graphical model
Markov Random Fields globally
conditioned on X
Two kinds of features
Dependence between neighboring labels on x
Dependence of current label and x

yt-1
yt1
yt
x
22
Training CRF

Loss function
Per sequence vs. per label
Optimization methods
Improved iterative scaling slow
General purpose convex optimization improved

23
Problems with rich feature representation

Large number features
Slow down parameter estimation
Eliminate redundancy
More expressive features
Improve prediction accuracy
Combinatorial explosion
Incorporate necessary combinations

24
Feature induction

Iteratively construction feature conjunctions
Candidate features
Atomic
Conjunction of atomic and incorporated
Maximum increase in conditional log likelihood

25
Feature induction (contd)

Gradient tree boosting
Potential function sum of gradient trees
Path of tree feature combination
Value in leaf weight
Significant improvement of training speed and
prediction accuracy

26
More evaluation

No scale to large number classes
Forward backward
Viterbi
No generalization bound

27
Traditional discriminant function

Directly measure compatibility between label and
observation
Simple linear discriminant function
Classifier

28
Support vector machines

Maximize margin of classification confidence
L1-norm soft-margin SVMs formulation
Functional margin
Slack variable

29
Dual formulation

Lagrange multiplier
Dual optimization problem
Dual discriminant function

30
Kernel trick

Kernel function
Avoid explicit feature representation
Feature space of very high/ infinite dimension
Non-linear discriminant function/classifier

31
Voted perceptron

Perceptrononline algorithm
Voted perceptronconvert to batch algorithm
Deterministic leave-one-out method
Predict by majority voting
Kernel trick and computationally efficient

32
Extension to multi-class

Discriminant function
Classifier
Functional margin

33
Multi-class SVMs

L1 norm soft-margin SVMs

34
Discriminant function in sequential supervised
learning

Treat as multi-class problems
Exponentially large number of classes
Learning the discriminant function
Voted perceptron
Support vector machines

35
Feature representation

Arbitrary non-independent observation feature
Feature space of high/infinite dimension
Feature formulation assumption
Chain structured
Additive

36
Voted perceptron

Update step
Average the perceptrons
Prediction Viterbi algorithm
Computationally efficient

37
SVMs with loss function

Re-scale slack variables
Re-scale margin
Higher loss requires larger margin ( more
confidence)

38
Hidden Markov Support Vector Machines

Sparseness assumption on Support Vectors
Small number of non-zero dual variables
Small number of active constraints
Iteratively add new SVs
Working set containing current SVs
Candidate SV violates margin constraint most
Strict increase of dual objective function

Margin violation exceeds more than
Upper bound on dual objective
Polynomial number of SVs when convergence
Dual objective optimization tractable
Close to the optimal solution

40
Max Margin Markov Networks

Exploiting structure of output space
Convert to polynomial number of constraints
Rescale margin with Hamming loss
per label zero-one loss

41
Dual formulation

42
Structure Decomposition

Loss function
Feature

43
Factorization

44
Factored dual

Objective function
Constraint and consistency check
Polynomial

45
Evaluation

Arbitrary non-independent observation feature
Kernel trick
feature space of very high/infinite dimension
Complex non-linear discriminant function
Margin maximizing generalization bound
Not sure of scalability

46
Open problems

Training CRF faster and make it practical
Effect of inference in training
Scalability in discriminant function methods
Other algorithms to learn discriminant function
Deal with missing value
Novel sequential supervised learning algorithm

47
Thank You
48
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

A Review of Sequential Supervised Learning PowerPoint PPT Presentation