Title: A COMPARISON OF HANDCRAFTED SEMANTIC GRAMMARS VERSUS STATISTICAL NATURAL LANGUAGE PARSING IN DOMAINS
1A COMPARISON OF HAND-CRAFTED SEMANTIC GRAMMARS
VERSUS STATISTICAL NATURAL LANGUAGE PARSING IN
DOMAIN-SPECIFIC VOICE TRANSCRIPTION
- Curry Guinn
- Dave Crist
- Haley Werth
2Outline
- Probabilistic language models
- N-grams
- The EPA project
- Experiments
3Probabilistic Language Processing What is it?
- Assume a note is given to a bank teller, which
the teller reads as I have a gub. (cf. Woody
Allen) - NLP to the rescue .
- gub is not a word
- gun, gum, Gus, and gull are words, but gun has a
higher probability in the context of a bank
4Real Word Spelling Errors
- They are leaving in about fifteen minuets to go
to her house. - The study was conducted mainly be John Black.
- Hopefully, all with continue smoothly in my
absence. - Can they lave him my messages?
- I need to notified the bank of.
- He is trying to fine out.
5Letter-based Language Models
- Shannons Game
- Guess the next letter
-
6Letter-based Language Models
- Shannons Game
- Guess the next letter
- W
7Letter-based Language Models
- Shannons Game
- Guess the next letter
- Wh
8Letter-based Language Models
- Shannons Game
- Guess the next letter
- Wha
9Letter-based Language Models
- Shannons Game
- Guess the next letter
- What
10Letter-based Language Models
- Shannons Game
- Guess the next letter
- What d
11Letter-based Language Models
- Shannons Game
- Guess the next letter
- What do
12Letter-based Language Models
- Shannons Game
- Guess the next letter
- What do you think the next letter is?
13Letter-based Language Models
- Shannons Game
- Guess the next letter
- What do you think the next letter is?
- Guess the next word
-
14Letter-based Language Models
- Shannons Game
- Guess the next letter
- What do you think the next letter is?
- Guess the next word
- What
15Letter-based Language Models
- Shannons Game
- Guess the next letter
- What do you think the next letter is?
- Guess the next word
- What do
16Letter-based Language Models
- Shannons Game
- Guess the next letter
- What do you think the next letter is?
- Guess the next word
- What do you
17Letter-based Language Models
- Shannons Game
- Guess the next letter
- What do you think the next letter is?
- Guess the next word
- What do you think
18Letter-based Language Models
- Shannons Game
- Guess the next letter
- What do you think the next letter is?
- Guess the next word
- What do you think the
19Letter-based Language Models
- Shannons Game
- Guess the next letter
- What do you think the next letter is?
- Guess the next word
- What do you think the next
20Letter-based Language Models
- Shannons Game
- Guess the next letter
- What do you think the next letter is?
- Guess the next word
- What do you think the next word is?
21Word-based Language Models
- A model that enables one to compute the
probability, or likelihood, of a sentence S,
P(S). - Simple Every word follows every other word w/
equal probability (0-gram) - Assume V is the size of the vocabulary V
- Likelihood of sentence S of length n is 1/V
1/V 1/V - If English has 100,000 words, probability of
each next word is 1/100000 .00001
22Word Prediction Simple vs. Smart
- Smarter probability of each next word is related
to word frequency (unigram) - Likelihood of sentence S P(w1) P(w2)
P(wn) - Assumes probability of each word is independent
of probabilities of other words. - Even smarter Look at probability given previous
words (N-gram) - Likelihood of sentence S P(w1) P(w2w1)
P(wnwn-1) - Assumes probability of each word is dependent
on probabilities of other words.
23Training and Testing
- Probabilities come from a training corpus, which
is used to design the model. - Overly narrow corpus probabilities don't
generalize - Overly general corpus probabilities don't
reflect task or domain - A separate test corpus is used to evaluate the
model, typically using standard metrics - Held out test set
24Simple N-Grams
- An N-gram model uses the previous N-1 words to
predict the next one - P(wn wn-N1 wn-N2 wn-1 )
- unigrams P(dog)
- bigrams P(dog big)
- trigrams P(dog the big)
- quadrigrams P(dog chasing the big)
25The EPA task
- Detailed diary of a single individuals daily
activity and location - Methods of collecting the data
- External Observer
- Camera
- Self-reporting
- Paper diary
- Handheld menu-driven diary
- Spoken diary
26Spoken Diary
- From an utterance like I am in the kitchen
cooking spaghetti, map that utterance into - Activity(cooking)
- Location(kitchen)
- Text abstraction
- Technique
- Build a grammar
- Example
27Sample Semantic Grammar
- ACTIVITY_LOCATION -gt ACTIVITY' LOCATION'
CHAD(ACTIVITY',LOCATION') . - ACTIVITY_LOCATION -gt LOCATION' ACTIVITY'
CHAD(ACTIVITY',LOCATION') . - ACTIVITY_LOCATION -gt ACTIVITY' CHAD(ACTIVITY',
null) . - ACTIVITY_LOCATION -gt LOCATION'
CHAD(null,LOCATION') . - LOCATION -gt IAM LOCx' LOCx' .
- LOCATION -gt LOCx' LOCx' .
- IAM -gt IAM1 .
- IAM -gt IAM1 just .
- IAM -gt IAM1 going to .
- IAM -gt IAM1 getting ready to .
- IAM -gt IAM1 still .
- LOC2 -gt HOUSE_LOC' HOUSE_LOC' .
- LOC2 -gt OUTSIDE_LOC' OUTSIDE_LOC' .
- LOC2 -gt WORK_LOC' WORK_LOC' .
- LOC2 -gt OTHER_LOC' OTHER_LOC' .
- HOUSE_LOC -gt kitchen kitchen_code .
- HOUSE_LOC -gt bedroom bedroom_code .
- HOUSE_LOC -gt living room living_room_code .
- HOUSE_LOC -gt house house_code .
28Statistical Natural Language Parsing
- Use unigram, bigram and trigram probabilities
- Use Bayes rule to obtain these probabilities
P(AB) P(BA) P(A)/ P(B) - The formula P(kitchen30121 Kitchen) is
computed by determining the percentage of times
the word kitchen appears in diary entries that
have been transcribed in the category 30121
Kitchen. - P(30121 Kitchen) is the probability that a diary
entry is of the semantic category 30121 Kitchen.
- P(kitchen) is the probability that kitchen
appears in any diary entry. - Bayes rule can be extended to take into account
each word in the input string.
29The Experiment
- Digital Voice Recorder Heart Rate Monitor
- Heart rate monitor will beep if the rate changes
by more than 15 beats per minute between
measurements (every 2 minutes)
30Subjects
31Recordings Per Day
32Heart Rate Change Indicator Tones and Subject
Compliance
33Per Word Speech Recognition
34Semantic Grammar Location/Activity Encoding
Precision and Recall
35Word Recognition Accuracys Effect on Semantic
Grammar Precision and Recall
36Statistical Processing Accuracy
37Word Recognition Affects Statistical Semantic
Categorization
38Per Word Recognition Rate Versus Statistical
Semantic Encoding Accuracy
39Time, Activity, Location, Exertion Data Gathering
Platform
40Research Topics
- Currently, guesses for the current activity and
location are computed independently of each other - They are not independent!
- Currently, guesses are based on the current
utterance. - However, the current activity/location is not
independent from previous activity/locations. - How do we fuse data from other sources (gps,
beacons, heart rate monitor, etc.)?