Title: Hidden Markov Models Applied to Information Extraction
1Hidden Markov ModelsApplied to Information
Extraction
- Part I Concept
- HMM Tutorial
- Part II Sample Application
- AutoBib web information extraction
Larry Reeve INFO629 Artificial IntelligenceDr.
Weber, Fall 2004
2Part I Concept HMM Motivation
- Real-world has structures and processes which
have (or produce) observable outputs - Usually sequential (process unfolds over time)
- Cannot see the event producing the output
- Example speech signals
- Problem how to construct a model of the
structure or process given only observations
3HMM Background
- Basic theory developed and published in 1960s and
70s - No widespread understanding and application until
late 80s - Why?
- Theory published in mathematic journals which
were not widely read by practicing engineers - Insufficient tutorial material for readers to
understand and apply concepts
4HMM Uses
- Uses
- Speech recognition
- Recognizing spoken words and phrases
- Text processing
- Parsing raw records into structured records
- Bioinformatics
- Protein sequence prediction
- Financial
- Stock market forecasts (price pattern prediction)
- Comparison shopping services
5HMM Overview
- Machine learning method
- Makes use of state machines
- Based on probabilistic models
- Useful in problems having sequential steps
- Can only observe output from states, not the
states themselves - Example speech recognition
- Observe acoustic signals
- Hidden States phonemes
- (distinctive sounds of a language)
State machine
6Observable Markov Model Example
State transition matrix
- Weather
- Once each day weather is observed
- State 1 rain
- State 2 cloudy
- State 3 sunny
- What is the probability the weather for the next
7 days will be - sun, sun, rain, rain, sun, cloudy, sun
- Each state corresponds to a physical observable
event
Rainy Cloudy Sunny
Rainy 0.4 0.3 0.3
Cloudy 0.2 0.6 0.2
Sunny 0.1 0.1 0.8
7Observable Markov Model
8Hidden Markov Model Example
- Coin toss
- Heads, tails sequence with 2 coins
- You are in a room, with a wall
- Person behind wall flips coin, tells result
- Coin selection and toss is hidden
- Cannot observe events, only output (heads, tails)
from events - Problem is then to build a model to explain
observed sequence of heads and tails
9HMM Components
- A set of states (xs)
- A set of possible output symbols (ys)
- A state transition matrix (as)
- probability of making transition from one state
to the next - Output emission matrix (bs)
- probability of a emitting/observing a symbol at a
particular state - Initial probability vector
- probability of starting at a particular state
- Not shown, sometimes assumed to be 1
10HMM Components
11Common HMM Types
- Ergodic (fully connected)
- Every state of model can be reached in a single
step from every other state of the model - Bakis (left-right)
- As time increases, states proceed from left to
right
12HMM Core Problems
- Three problems must be solved for HMMs to be
useful in real-world applications - 1) Evaluation
- 2) Decoding
- 3) Learning
13HMM Evaluation Problem
- Purpose score how well a given model matches a
given observation sequence - Example (Speech recognition)
- Assume HMMs (models) have been built for words
home and work. - Given a speech signal, evaluation can determine
the probability each model represents the
utterance
14HMM Decoding Problem
- Given a model and a set of observations, what are
the hidden states most likely to have generated
the observations? - Useful to learn about internal model structure,
determine state statistics, and so forth
15HMM Learning Problem
- Goal is to learn HMM parameters (training)
- State transition probabilities
- Observation probabilities at each state
- Training is crucial
- it allows optimal adaptation of model parameters
to observed training data using real-world
phenomena - No known method for obtaining optimal parameters
from data only approximations - Can be a bottleneck in HMM usage
16HMM Concept Summary
- Build models representing the hidden states of a
process or structure using only observations - Use the models to evaluate probability that a
model represents a particular observation
sequence - Use the evaluation information in an application
to recognize speech, parse addresses, and many
other applications
17Part II Application AutoBib System
- Provide a uniform view of several computer
science bibliographic web data sources - An automated web information extraction system
that requires little human input - Web pages designed differently from site-to-site
- IE requires training samples
- HMMs used to parse unstructured bibliographic
records into a structured format NLP
18Web Information Extraction Converting Raw Records
19Approach
- 1) Provide seed database of structured records
- 2) Extract raw records from relevant Web pages
- 3) Match structured records to raw records
- To build training samples
- 4) Train HMM-based parser
- 5) Parse unmatched raw recs into structured recs
- 6) Merge new structured records into database
20AutoBib Architecture
21Step 1 - Seeding
- Provide seed database of structured records
- Take small collection of BibTeX format records
and insert into database - Cleaning step normalizes record fields
- Examples
- Proc. ? Proceedings
- Jan ? January
- Manual step, executed once only
22Step 2 Extract Raw Records
- Extract raw records from relevant Web pages
- User specifies
- Web pages to extract from
- How to follow next page links for multiple
pages - Raw records are extracted
- Uses record-boundary discovery techniques
- Subtree of Interest largest subtree of HTML
tags - Record separators frequent HTML tags
23Tokenized Records
(Replace all HTML tags with )
24Step 3 - Matching
- Match raw records R to structured records S
- Apply 4 tests (heuristic-based)
- Match at least author in R to an author in S
- S.year must appear in R
- If S.pages exists, R must contain it
- S.title is approximately contained in R
- Levenshtein edit distance approximate string
match
25Step 4 Parser Training
- Train HMM-based parser
- For each pair of R and S that match, annotate
tokens in raw record with field names - Annotated raw records are fed into HMM parser in
order to learn - State transition probabilities
- Symbol probabilities at each state
26Parser Training, continued
- Key consideration is HMM structure for navigating
record fields (fields, delimiters) - Special states
- start, end
- Normal states
- author, title, year, etc.
- Best structure found
- Have multiple delimiter and tag states,
- one for each normal state
- Example author-delimiter, author-tag
27Sample HMM (Method 3)
Source http//www.cs.duke.edu/geng/autobib/web/h
mm.jpg
28Step 5 - Conversion
- Parse unmatched raw recs into structured recs
using HMM parser - Matched raw records can be directly converted
without parsing because they were annotated in
matching step
29Step 6 - Merging
- Merge new structured records into database
- Initial seed database has now grown
- New records will be used for improved matching on
the next run
30Evaluation
- Success rate
- of tokens labeled by HMM
- -------------------------------------
- of tokens labeled by person
- DBLP 98.9
- Computer Science Bibliography
- CSWD 93.4
- CompuScience WWW-Database
31HMM Advantages / Disadvantages
- Advantages
- Effective
- Can handle variations in record structure
- Optional fields
- Varying field ordering
- Disadvantages
- Requires training using annotated data
- Not completely automatic
- May require manual markup
- Size of training data may be an issue
32Other methods
- Wrappers
- Specification of areas of interest on Web page
- Hand-crafted
- Wrapper induction
- Requires manual training
- Not always accommodating to changing structure
- Syntax-based no semantic labeling
33Application to Other Domains
- E-Commerce
- Comparison shopping sites
- Extract product/pricing information from many
sites - Convert information into structured format and
store - Provide interface to look up product information
and then display pricing information gathered
from many sites - Saves users time
- Rather than navigating to and searching many
sites, users can consult a single site
34References
- Concept
- Rabiner, L. R. (1989). A Tutorial on Hidden
Markov Models and Selected Applications in Speech
Recognition. Proceedings of the IEEE, 77(2),
257-285. - Application
- Geng, J. and Yang, J. (2004). Automatic
Extraction of Bibliographic Information on the
Web. Proceedings of the 8th International
Database Engineering and Applications Symposium
(IDEAS04), 193-204.