Subsequent String Kernel by Han Cheng Liang

About This Presentation

Title:

Subsequent String Kernel by Han Cheng Liang

Description:

Subsequent String Kernel by Han Cheng Liang Advanced Machine Learning Prof. Tony Jebara Subsequent String Kernel (SSK) SSK function Measures how similar two strings ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 13

Provided by: LabU409

Learn more at: http://www.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Subsequent String Kernel by Han Cheng Liang

1
Subsequent String Kernel by Han Cheng Liang

Advanced Machine Learning
Prof. Tony Jebara

2
Subsequent String Kernel (SSK)

SSK function
Measures how similar two strings are by how many
subsequences they share in common.
The subsequences do not have to be contiguous

s science is organized knowledge t wisdom is
organized life subsequence sie
s science is organized knowledge t wisdom is
organized life
s science is organized knowledge t wisdom is
organized life subsequence sie
3
SSK Continued

But the further apart the first and the last
characters in the subsequence are, the more it is
penalized. Define a decay factor,

4
SSK Formally Defined

Alphabet set

Set of all subsequences with length n, from
alphabet set

string

string

u is a subsequence of s, if theres a set of
indices i, such that

length of the subsequence u

5
SSK Formally Defined

run time

6
Improve Performance Using DP

Can improve the runtime to

Define

3 Basic Cases

Recursive Step

7
DP Continued

Define

Two Cases

8
Experiments Performed

SSK vs. NGK vs. WK
Varying sequence lengths and decay factors
Combining Kernels of Different lengths- has
potential
Combining SSK and NGK- no good
Combining SSK with different decay factors- no
good

9
Subsequent Word Kernel

Instead of having individual letters and the
space character in

have whole English words.

The size of the alphabet set much larger, but
using the DP technique, the runtime is still

10
Experiments

Data Yahoo! News. News articles from AP,
Reuters, etc.
Four categories business, politics,
entertainment, sports
60 articles in each, 50 of them used for training
and 10 used for testing
Comparable performance to SSK (n3). Accuracy
rate both around 90. Outperformed SSK in some
categories and underperformed in others.
Combining SSK and word subsequence Kernel did not
yield improvements.

11
Kernel Estimation