A Review of Sequential Supervised Learning - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

A Review of Sequential Supervised Learning

Description:

Model the conditional distribution. Extension of logistic regression ... Drawback of directed conditional Markov model. Per source state normalization ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 49
Provided by: haog
Category:

less

Transcript and Presenter's Notes

Title: A Review of Sequential Supervised Learning


1
A Review of Sequential Supervised Learning
  • Guohua Hao
  • School of Electrical Engineering and Computer
    Science
  • Oregon State University

2
Outline
  • Sequential supervised learning
  • Evaluation criteria
  • Methods for sequential supervised learning
  • Open problems

3
Traditional supervised learning
  • Given training examples (x, y)
  • x observation feature vector
  • y class label
  • Independently and identically assumption

4
Sequential learning tasks
  • Part-of-speech tagging
  • Assign a part-of-speech tag to each word in a
    sentence
  • Example
  • I am taking my examination
  • ltpron verb verb pron noungt

5
Sequential supervised learning
  • Given training examples
  • observation sequence
  • label sequence
  • Goal learn a classifier

6
Applications
  • Protein secondary structure prediction
  • Name entity recognition
  • FAQ document segmentation
  • Optical character recognition for words
  • NetTalk task

7
Outline
  • Sequential supervised learning
  • Evaluation criteria
  • Methods for sequential supervised learning
  • Open problems

8
Modeling accuracy
  • Two kinds of relations
  • Horizontal relations
  • Directed or undirected
  • graphical model
  • Vertical relations
  • Generative or discriminative model
  • Feature representation

yt-1
yt1
yt
Xt-1
xt1
xt
9
Feature representation
  • Arbitrary non-independent features of observation
    sequence
  • Overlapping features
  • Global features
  • Very high or infinite dimensional feature spaces
    kernel method

10
Computational efficiency
  • Training efficiency
  • Testing efficiency
  • Scalability to large number of features and labels

11
Generalization bound
  • Margin-maximizing property

12
Outline
  • Sequential supervised learning
  • Evaluation criteria
  • Methods for sequential supervised learning
  • Open problems

13
Sliding window/recurrent sliding window
  • Sliding window
  • Vertical relations ok
  • Horizontal relations that
  • depend on nearby xs
  • Recurrent sliding window
  • Both vertical and horizontal
  • relations ok
  • Only one direct horizon
  • relations

yt-1
yt1
yt
Xt-1
xt1
xt
yt-1
yt1
yt
Xt-1
xt1
xt
14
Evaluation
  • Feature representation
  • arbitrary non-independent observation features
  • ok
  • Computational efficiency
  • Depends on the classical supervised learning
    algorithm used

15
Generative model Hidden Markov Models (HMM)
yt-1
yt1
yt
  • Representation of joint
  • distribution
  • Extension of Naïve Bayesian networks
  • Two kinds of distributions
  • State transition probability
  • state-specific observation distribution

Xt-1
xt1
xt
16
Evaluation
  • Computational efficiency
  • Maximize the likelihood (ML) training efficient
  • Prediction
  • Per sequence lossViterbi O(TN2)
  • Per label lossForward backward O(TN2)

17
Evaluation
  • Modeling accuracy
  • Vertical relations modeled in generative way
  • ML training leads to suboptimal
  • Feature representation
  • Conditional independence assumption
  • Arbitrary non-independent observation features
    not allowed

18
Discriminative graphical model
  • Model the conditional distribution
  • Extension of logistic regression
  • Arbitrary non-independent observation features
    allowed
  • Typical methods
  • Maximum entropy Markov model (MEMM)
  • Conditional random fields (CRF)

19
Maximum entropy Markov models
  • Per state transition distribution
  • Maximum entropy formulation

yt-1
yt1
yt
Xt-1
xt1
xt
20
Evaluation
  • Training Generalized iterative scaling
    Prediction Viterbi or forward backward
  • Label bias problem
  • Drawback of directed conditional Markov model
  • Per source state normalization
  • Low entropy distribution pay little attention to
    the observation
  • Favorite the more frequent sequences

21
Conditional Random Fields
  • Undirected Graphical model
  • Markov Random Fields globally
  • conditioned on X
  • Two kinds of features
  • Dependence between neighboring labels on x
  • Dependence of current label and x

yt-1
yt1
yt
x
22
Training CRF
  • Loss function
  • Per sequence vs. per label
  • Optimization methods
  • Improved iterative scaling slow
  • General purpose convex optimization improved

23
Problems with rich feature representation
  • Large number features
  • Slow down parameter estimation
  • Eliminate redundancy
  • More expressive features
  • Improve prediction accuracy
  • Combinatorial explosion
  • Incorporate necessary combinations

24
Feature induction
  • Iteratively construction feature conjunctions
  • Candidate features
  • Atomic
  • Conjunction of atomic and incorporated
  • Maximum increase in conditional log likelihood

25
Feature induction (contd)
  • Gradient tree boosting
  • Potential function sum of gradient trees
  • Path of tree feature combination
  • Value in leaf weight
  • Significant improvement of training speed and
    prediction accuracy

26
More evaluation
  • No scale to large number classes
  • Forward backward
  • Viterbi
  • No generalization bound

27
Traditional discriminant function
  • Directly measure compatibility between label and
    observation
  • Simple linear discriminant function
  • Classifier

28
Support vector machines
  • Maximize margin of classification confidence
  • L1-norm soft-margin SVMs formulation
  • Functional margin
  • Slack variable

29
Dual formulation
  • Lagrange multiplier
  • Dual optimization problem
  • Dual discriminant function

30
Kernel trick
  • Kernel function
  • Avoid explicit feature representation
  • Feature space of very high/ infinite dimension
  • Non-linear discriminant function/classifier

31
Voted perceptron
  • Perceptrononline algorithm
  • Voted perceptronconvert to batch algorithm
  • Deterministic leave-one-out method
  • Predict by majority voting
  • Kernel trick and computationally efficient

32
Extension to multi-class
  • Discriminant function
  • Classifier
  • Functional margin

33
Multi-class SVMs
  • L1 norm soft-margin SVMs

34
Discriminant function in sequential supervised
learning
  • Treat as multi-class problems
  • Exponentially large number of classes
  • Learning the discriminant function
  • Voted perceptron
  • Support vector machines

35
Feature representation
  • Arbitrary non-independent observation feature
  • Feature space of high/infinite dimension
  • Feature formulation assumption
  • Chain structured
  • Additive

36
Voted perceptron
  • Update step
  • Average the perceptrons
  • Prediction Viterbi algorithm
  • Computationally efficient

37
SVMs with loss function
  • Re-scale slack variables
  • Re-scale margin
  • Higher loss requires larger margin ( more
    confidence)

38
Hidden Markov Support Vector Machines
  • Sparseness assumption on Support Vectors
  • Small number of non-zero dual variables
  • Small number of active constraints
  • Iteratively add new SVs
  • Working set containing current SVs
  • Candidate SV violates margin constraint most
  • Strict increase of dual objective function

39
  • Margin violation exceeds more than
  • Upper bound on dual objective
  • Polynomial number of SVs when convergence
  • Dual objective optimization tractable
  • Close to the optimal solution

40
Max Margin Markov Networks
  • Exploiting structure of output space
  • Convert to polynomial number of constraints
  • Rescale margin with Hamming loss
  • per label zero-one loss

41
Dual formulation

42
Structure Decomposition
  • Loss function
  • Feature

43
Factorization

44
Factored dual
  • Objective function
  • Constraint and consistency check
  • Polynomial

45
Evaluation
  • Arbitrary non-independent observation feature
  • Kernel trick
  • feature space of very high/infinite dimension
  • Complex non-linear discriminant function
  • Margin maximizing generalization bound
  • Not sure of scalability

46
Open problems
  • Training CRF faster and make it practical
  • Effect of inference in training
  • Scalability in discriminant function methods
  • Other algorithms to learn discriminant function
  • Deal with missing value
  • Novel sequential supervised learning algorithm

47
Thank You
48
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com