Title: Inducing Structure for Perception
1Inducing Structure for Perception
a.k.a. Slavs splitmerge Hammer
- Slav Petrov
- Advisors Dan Klein, Jitendra Malik
- Collaborators L. Barrett, R. Thibaux, A. Faria,
A. Pauls, P. Liang, A. Berg
2The Main Idea
True structure
Manually specifiedstructure
MLE structure
He was right.
Observation
Complex underlying process
3The Main Idea
Automatically refinedstructure
EM
He was right.
Manually specifiedstructure
Observation
Complex underlying process
4Why Structure?
the the the food cat dog ate and
t e c a e h t g f a o d o o d n h e t d a
5Structure is important
6Syntactic Ambiguity
- Last night I shot an
- elephant in my pajamas.
7Visual Ambiguity
Old or young?
8Three Peaks?
9No, One Mountain!
10Three Domains
11Timeline
12Syntax
Split Merge Learning
Coarse-to-Fine Inference
Syntax
Syntactic Machine Translation
Non- parametric Bayesian Learning
Language Modeling
Generative vs. Conditional Learning
13Learning accurate, compact and interpretable Tree
Annotation
- Slav Petrov, Leon Barrett, Romain Thibaux, Dan
Klein
14Motivation (Syntax)
He was right.
- Why?
- Information Extraction
- Syntactic Machine Translation
15Treebank Parsing
16Non-Independence
- Independence assumptions are often too strong.
All NPs
17The Game of Designing a Grammar
- Annotation refines base treebank symbols to
improve statistical fit of the grammar - Parent annotation Johnson 98
18The Game of Designing a Grammar
- Annotation refines base treebank symbols to
improve statistical fit of the grammar - Parent annotation Johnson 98
- Head lexicalization Collins 99, Charniak 00
19The Game of Designing a Grammar
- Annotation refines base treebank symbols to
improve statistical fit of the grammar - Parent annotation Johnson 98
- Head lexicalization Collins 99, Charniak 00
- Automatic clustering?
20Learning Latent Annotations
- Brackets are known
- Base categories are known
- Only induce subcategories
Just like Forward-Backward for HMMs.
21Inside/Outside Scores
Inside
Outside
Ax
22Learning Latent Annotations (Details)
23Overview
- Hierarchical Training - Adaptive Splitting -
Parameter Smoothing
24Refinement of the DT tag
DT
25Refinement of the DT tag
DT
26Hierarchical refinement of the DT tag
DT
27Hierarchical Estimation Results
Model F1
Baseline 87.3
Hierarchical Training 88.4
28Refinement of the , tag
- Splitting all categories the same amount is
wasteful
29The DT tag revisited
30Adaptive Splitting
- Want to split complex categories more
- Idea split everything, roll back splits which
were least useful
31Adaptive Splitting
- Want to split complex categories more
- Idea split everything, roll back splits which
were least useful
32Adaptive Splitting
- Evaluate loss in likelihood from removing each
split - Data likelihood with split reversed
- Data likelihood with split
- No loss in accuracy when 50 of the splits are
reversed.
33Adaptive Splitting (Details)
- True data likelihood
- Approximate likelihood with split at n reversed
- Approximate loss in likelihood
34Adaptive Splitting Results
Model F1
Previous 88.4
With 50 Merging 89.5
35Number of Phrasal Subcategories
36Number of Phrasal Subcategories
NP
VP
PP
37Number of Phrasal Subcategories
NAC
X
38Number of Lexical Subcategories
POS
TO
,
39Number of Lexical Subcategories
RB
VBx
IN
DT
40Number of Lexical Subcategories
NNP
JJ
NNS
NN
41Smoothing
- Heavy splitting can lead to overfitting
- Idea Smoothing allows us to pool
- statistics
42Linear Smoothing
43Result Overview
Model F1
Previous 89.5
With Smoothing 90.7
44Linguistic Candy
- Proper Nouns (NNP)
- Personal pronouns (PRP)
NNP-14 Oct. Nov. Sept.
NNP-12 John Robert James
NNP-2 J. E. L.
NNP-1 Bush Noriega Peters
NNP-15 New San Wall
NNP-3 York Francisco Street
PRP-0 It He I
PRP-1 it he they
PRP-2 it them him
45Linguistic Candy
- Relative adverbs (RBR)
- Cardinal Numbers (CD)
RBR-0 further lower higher
RBR-1 more less More
RBR-2 earlier Earlier later
CD-7 one two Three
CD-4 1989 1990 1988
CD-11 million billion trillion
CD-0 1 50 100
CD-3 1 30 31
CD-9 78 58 34
46Nonparametric PCFGs using Dirichlet Processes
- Percy Liang, Slav Petrov,
- Dan Klein and Michael Jordan
47Improved Inference for Unlexicalized Parsing
- Slav Petrov and Dan Klein
48 49Coarse-to-Fine Parsing
Goodman 97, CharniakJohnson 05
50Prune?
- For each chart item Xi,j, compute posterior
probability -
lt threshold
E.g. consider the span 5 to 12
coarse
QP NP VP
refined
51- 1621 min
- 111 min
- (no search error)
52Hierarchical Pruning
- Consider again the span 5 to 12
coarse
QP NP VP
split in two
QP1 QP2 NP1 NP2 VP1 VP2
split in four
QP1 QP1 QP3 QP4 NP1 NP2 NP3 NP4 VP1 VP2 VP3 VP4
split in eight
53Intermediate Grammars
X-BarG0
G
54- 1621 min
- 111 min
- 35 min
- (no search error)
55State Drift (DT tag)
56Projected Grammars
X-BarG0
G
57Estimating Projected Grammars
NP0
NP1
VP1
VP0
S0
S1
Nonterminals in ?(G)
Nonterminals in G
58Estimating Projected Grammars
S ? NP VP
S1 ? NP1 VP1 0.20 S1 ? NP1 VP2 0.12 S1 ?
NP2 VP1 0.02 S1 ? NP2 VP2 0.03 S2 ? NP1
VP1 0.11 S2 ? NP1 VP2 0.05 S2 ? NP2 VP1
0.08 S2 ? NP2 VP2 0.12
59Estimating Projected Grammars
Corazza Satta 06
Estimating Grammars
0.56
60Calculating Expectations
- Nonterminals
- ck(X) expected counts up to depth k
- Converges within 25 iterations (few seconds)
- Rules
61- 1621 min
- 111 min
- 35 min
- 15 min
- (no search error)
62Parsing times
X-BarG0
G
63Bracket Posteriors
(after G0)
64Bracket Posteriors (after G1)
65Bracket Posteriors
(Movie)
(Final Chart)
66Bracket Posteriors (Best Tree)
67Parse Selection
- Computing most likely unsplit tree is NP-hard
- Settle for best derivation.
- Rerank n-best list.
- Use alternative objective function.
68Final Results (Efficiency)
- Berkeley Parser
- 15 min
- 91.2 F-score
- Implemented in Java
- Charniak Johnson 05 Parser
- 19 min
- 90.7 F-score
- Implemented in C
69Final Results (Accuracy)
40 words F1 all F1
ENG CharniakJohnson 05 (generative) 90.1 89.6
ENG This Work 90.6 90.1
GER Dubey 05 76.3 -
GER This Work 80.8 80.1
CHN Chiang et al. 02 80.0 76.6
CHN This Work 86.3 83.4
70Conclusions (Syntax)
- Split Merge Learning
- Hierarchical Training
- Adaptive Splitting
- Parameter Smoothing
- Hierarchical Coarse-to-Fine Inference
- Projections
- Marginalization
- Multi-lingual Unlexicalized Parsing
71Generative vs. Discriminative
- Conditional Estimation
- L-BFGS
- Iterative Scaling
- Conditional Structure
- Alternative Merging Criterion
72How much supervision?
73Syntactic Machine Translation
- Collaboration with ISI/USC
- Use parse trees
- Use annotated parse trees
- Learn split synchronous grammars
74Speech
Split Merge Learning
Coarse-to-Fine Decoding
Speech
Speech Synthesis
Combined Generative
Conditional Learning
75Learning Structured Models for Phone Recognition
- Slav Petrov, Adam Pauls,
- Dan Klein
76Motivation (Speech)
77Traditional Models
d
a
d
End
Start
Begin - Middle - End Structure
78Model Overview
Traditional
Our Model
79Differences to Grammars
80(No Transcript)
81Refinement of the ih-phone
82Inference
- Coarse-To-Fine
- Variational Approximation
83Phone Classification Results
Method Error Rate
GMM Baseline (Sha and Saul, 2006) 26.0
HMM Baseline (Gunawardana et al., 2005) 25.1
SVM (Clarkson and Moreno, 1999) 22.4
Hidden CRF (Gunawardana et al., 2005) 21.7
This Paper 21.4
Large Margin GMM (Sha and Saul, 2006) 21.1
84Phone Recognition Results
Method Error Rate
State-Tied Triphone HMM (HTK) (Young and Woodland, 1994) 27.1
Gender Dependent Triphone HMM (Lamel and Gauvain, 1993) 27.1
This Paper 26.1
Bayesian Triphone HMM (Ming and Smith, 1998) 25.6
Heterogeneous classifiers (Halberstadt and Glass, 1998) 24.4
85Confusion Matrix
86How much supervision?
- Hand-aligned
- Exact phone boundaries are known
- Automatically-aligned
- Only sequence of phones is known
87Generative Conditional Learning
- Learn structure generatively
- Estimate Gaussians conditionally
- Collaboration with Fei Sha
88Speech Synthesis
- Acoustic phone model
- Generative
- Accurate
- Models phone internal structure well
- Use it for speech synthesis!
89Large Vocabulary ASR
- ASR System Acoustic Model Decoder
- Coarse-to-Fine Decoder
- Subphone ? Phone
- Phone ? Syllable ? Word ? Bigram ?
90Scenes
Split Merge Learning
Decoding
Scenes
91Motivation (Scenes)
Seascape
92Motivation (Scenes)
93Learning
- Oversegment the image
- Extract vertical stripes
- Extract features
- Train HMMs
94Inference
- Decode stripes
- Enforce horizontal consistency
95Alternative Approach
- Conditional Random Fields
- Pro
- Vertical and horizontal dependencies learnt
- Inference more natural
- Contra
- Computationally more expensive
96Timeline
97Results so far
- State of the art parser for different languages
- Automatically learnt
- Simple Compact
- Fast Accurate
- Available for download
- Phone recognizer
- Automatically learnt
- Competitive performance
- Good foundation for speech recognizer
98Proposed Deliverables
- Syntax Parser
- Speech Recognizer
- Speech Synthesizer
- Syntactic Translation Machine
- Scene Recognizer
99