Ad Hoc Data and the Token Ambiguity Problem - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Ad Hoc Data and the Token Ambiguity Problem

Description:

Grammar induction & structure discovery without token ambiguity problem ... Identify the Token Ambiguity Problem and take initial steps towards solving it ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 24
Provided by: Qian54
Category:

less

Transcript and Presenter's Notes

Title: Ad Hoc Data and the Token Ambiguity Problem


1
Ad Hoc Data and the Token Ambiguity Problem
  • Qian Xi, Kathleen Fisher, David Walker, Kenny
    Zhu
  • 2009/1/19

Princeton University, ATT Labs Research
2
Ad Hoc Data
  • Standardized data formats HTML, XML
  • Data processing tools Visualizers (HTML
    browsers), XQuery
  • Non-standard, semi-structured
  • Not many data processing tools
  • Examples web server log (CLF), phone call
    provisioning data

207.136.97.49 - - 15/Oct/1997184651 -0700
"GET /tk/p.txt HTTP/1.0" 200 30 244.133.108.200
- - 16/Oct/1997143222 -0700 "POST
/scpt/ddorg/confirm HTTP/1.0" 200 941
91522729152272128136400922813640092281364009
22813640092no_ii152272EDTF_60MARVINS1UNO10
1000295291 915227291522721281364009228136400
9228136400922813640092no_ii15222EDTF_60MARV
INS1UNO101000295291201000295291171001649600
191001
1/19
3
learnPADS Goal
  • Automatically generates a description of the
    format
  • Automatically generates a suite of data
    processing tools

Punion payload Pint32 i PstringFW(3)
s2 Pstruct source \ payload
p1 , payload p2 \
0,24 bar,end foo,16
Declarative Description
XML converter, Grapher, etc.
2/19
4
learnPADS Architecture
XML converter
Raw Data
Profiler
Chunking Tokenization
Format Inference Engine
Structure Discovery
Format Refinement
PADS Compiler
Data Description
3/19
5
learnPADS framework
Chunking Tokenization
0,24 bar,end foo,bag 0,56 cat,name
int, int str, str str, str int, int str,
str
Structure Discovery
struct
Format Refinement

union

struct
struct
struct

union
,
union

0
INT
STR
STR
,
,
INT
STR
INT
STR
4/19
6
Token Ambiguity Problem (TAP)
Given a string, therere multiple ways to
tokenize it.
  • Message
  • Word White Word White Word White... White URL
  • Word White Quote Filepath Quote White Word
    White...
  • old learnPADS
  • user defines a set of base tokens with fixed
    order
  • take the first, longest match
  • new solution probabilistic tokenization
  • use probabilistic models to find most likely
    token sequences

5/19
7
Probabilistic Graphical Models
earthquake
burglar
alarm
parent comes home
node random variable edge probabilistic
relationship
6/19
8
Hidden Markov Model (HMM)
  • Observation/Character Ci
  • Character Features upper/lower case, digit,
    punctuation...
  • Hidden state/Pseudo-token Ti
  • maximize probability P(token sequencecharacter
    sequence)

Quote
Word
Comma
Int
Quote
tokens
Quote
Word
Word
Word
Comma
Int
Int
Quote
pseudo-tokens
,
input characters

f
o
o
1
6

transition probability P(TiTi-1)
emission probability P(CiTi)
7/19
9
Hidden Markov Model Formula
the probability of token sequence given
character sequence the probability that
token T1 comes first the probability that
token Ti follows Ti-1 for all i the
probability that we see character Ci given token
Ti for all i
transition probability
emission probability
8/19
10
Hidden Markov Model Parameters
transition probability
emission probability
9/19
11
Hierarchical Models
Quote
Comma
Word
Quote
Int
,

foo

16
Maximum Entropy Support Vector Machines
10/19
12
Three Probabilistic Tokenizers
  • Character-by-character Hidden Markov Model (HMM)
  • One pseudo-token only depends on the previous
    one.
  • Hierarchical Maximum Entropy Model (HMEM)
  • The upper level models the transition
    probabilities.
  • The lower level constructs Maximum Entropy
    models for individual tokens.
  • Hierarchical Support Vector Machines (HSVM)
  • Same as HMEM, except that the lower level
    constructs Support Vector Machine models for
    individual tokens.

11/19
13
Tokenization By the old learnPADS, HMM and HMEM
Sat Jun 24 063846 crashreporterd120
mach_msg() reply failed (ipc/send) invalid
destination port
dateSat Jun 24 white time063846
white int2006 white
stringcrashreporterd char int120
char char white stringmach_msg
char( char) white stringreply
white stringfailed char white
char( stringipc char/ stringsend
char) white stringinvalid white
stringdestination white stringport
wordSat white wordJun white
int24 white time063846 white
int2006 white wordcrashreporterd
punctuation int120 punctuation
punctuation messagemach_msg() reply
failed punctuation message(ipc/send)
invalid destination port
dateSat Jun 24 white time063846
white int2006 white
wordcrashreporterd punctuation
int120 punctuation punctuation
messagemach_msg() reply failed
punctuation message(ipc/send) invalid
destination port
12/19
14
Test Data Sources
13/19
15
Evaluation 1 Tokenization Accuracy
Token error rate misidentified tokens Token
boundary error rate misidentified token
boundaries
input string qian
Jan/19/09 ideal token sequence id
white date inferred token sequence id
white filepath token error rate
1/3 token boundary error rate 0/3
14/19
16
Evaluation 1 Tokenization Accuracy
PT probabilistic tokenization testing
data sources 20
15/19
17
Evaluation 2 Type and Data Costs
PT probabilistic tokenization testing data
sources 20
type cost cost in bits of transmitting the
description data cost cost in bits of
transmitting the data given the description
16/19
18
Evaluation 3 Execution Time
  • The old learnPADS system takes 10 secs to 25
    mins.
  • The new system using probabilistic tokenization
    approaches takes a few seconds to several hours.
  • requires extra time to find all possible token
    sequences
  • requires extra time to find the most likely
    token sequences
  • fastest Hidden Markov Model
  • most time-consuming Hierarchical Support Vector
    Machines

17/19
19
Related Work
  • Grammar induction structure discovery without
    token ambiguity problem
  • Arasu Garcia-Molina 03 extracting structure
    from web pages
  • Garofalakis et al. 00 XTRACT for infering
    DTDs
  • Kushmerick et al. 97 wrapper induction
  • Detect row table components by Hidden Markov
    Model Conditional Random Fields
  • Pinto et al. 03
  • Extract certain fields in records from text
  • Borkar et al. 01
  • Predict exons and introns in DNA sequences
    using generalized HMM
  • Kulp 96
  • Part-of-speech tagging in natural language
    processing
  • Heeman 99 (Decision Tree)
  • Speech Recognition
  • Rabiner 89

18/19
20
Contributions
  • Identify the Token Ambiguity Problem and take
    initial steps towards solving it by statistical
    models
  • Use all possible token sequences.
  • Integrate 3 statistical approaches into the
    learnPADS framework.
  • Hidden Markov Model
  • Hierarchical Maximum Entropy Model
  • Hierarchical Support Vector Machines Model
  • Evaluate correctness and performance by a number
    of measures
  • Results have shown that multiple token sequences
    and statistical methods achieve partial success.

19/19
21
End
22
Future Work
  • How to make use of vertical information
  • one record is not independent of others
  • key alignment
  • Conditional Random Fields
  • Online learning
  • old description new data new
    description

23
Evaluation 3 Qualitative Comparison
0
The description is too general and it loses much
useful information.
The description is too verbose and the structure
is unclear.
-2
-1
1
2
optimal
Write a Comment
User Comments (0)
About PowerShow.com