Subsequent String Kernel by Han Cheng Liang - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Subsequent String Kernel by Han Cheng Liang

Description:

Subsequent String Kernel by Han Cheng Liang Advanced Machine Learning Prof. Tony Jebara Subsequent String Kernel (SSK) SSK function Measures how similar two strings ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 13
Provided by: LabU409
Category:

less

Transcript and Presenter's Notes

Title: Subsequent String Kernel by Han Cheng Liang


1
Subsequent String Kernel by Han Cheng Liang
  • Advanced Machine Learning
  • Prof. Tony Jebara

2
Subsequent String Kernel (SSK)
  • SSK function
  • Measures how similar two strings are by how many
    subsequences they share in common.
  • The subsequences do not have to be contiguous

s science is organized knowledge t wisdom is
organized life subsequence sie
s science is organized knowledge t wisdom is
organized life
s science is organized knowledge t wisdom is
organized life subsequence sie
3
SSK Continued
  • But the further apart the first and the last
    characters in the subsequence are, the more it is
    penalized. Define a decay factor,

4
SSK Formally Defined
  • Alphabet set
  • Set of all subsequences with length n, from
    alphabet set
  • string
  • string
  • u is a subsequence of s, if theres a set of
    indices i, such that
  • length of the subsequence u

5
SSK Formally Defined
  • run time

6
Improve Performance Using DP
  • Can improve the runtime to
  • Define
  • 3 Basic Cases
  • Recursive Step

7
DP Continued
  • Define
  • Two Cases

8
Experiments Performed
  • SSK vs. NGK vs. WK
  • Varying sequence lengths and decay factors
  • Combining Kernels of Different lengths- has
    potential
  • Combining SSK and NGK- no good
  • Combining SSK with different decay factors- no
    good

9
Subsequent Word Kernel
  • Instead of having individual letters and the
    space character in

have whole English words.
  • The size of the alphabet set much larger, but
    using the DP technique, the runtime is still

10
Experiments
  • Data Yahoo! News. News articles from AP,
    Reuters, etc.
  • Four categories business, politics,
    entertainment, sports
  • 60 articles in each, 50 of them used for training
    and 10 used for testing
  • Comparable performance to SSK (n3). Accuracy
    rate both around 90. Outperformed SSK in some
    categories and underperformed in others.
  • Combining SSK and word subsequence Kernel did not
    yield improvements.

11
Kernel Estimation
  • SSK used most frequent contiguous subsequences
    found in some data set
  • Me used most frequently used English words.
  • Results
  • top 2000 bad
  • top 3000 bad
  • top 4000 80 accuracy

12
Future Work
  • Kernels with different lengths
  • Upper/lower bounds for the kernel estimation.
Write a Comment
User Comments (0)
About PowerShow.com