Hidden Markov Models Applied to Information Extraction - PowerPoint PPT Presentation

About This Presentation
Title:

Hidden Markov Models Applied to Information Extraction

Description:

Example: speech signals ... Once each day weather is observed. State 1: rain. State 2: cloudy. State 3: sunny ... Given a speech signal, evaluation can ... – PowerPoint PPT presentation

Number of Views:231
Avg rating:3.0/5.0
Slides: 35
Provided by: pagesD
Category:

less

Transcript and Presenter's Notes

Title: Hidden Markov Models Applied to Information Extraction


1
Hidden Markov ModelsApplied to Information
Extraction
  • Part I Concept
  • HMM Tutorial
  • Part II Sample Application
  • AutoBib web information extraction

Larry Reeve INFO629 Artificial IntelligenceDr.
Weber, Fall 2004
2
Part I Concept HMM Motivation
  • Real-world has structures and processes which
    have (or produce) observable outputs
  • Usually sequential (process unfolds over time)
  • Cannot see the event producing the output
  • Example speech signals
  • Problem how to construct a model of the
    structure or process given only observations

3
HMM Background
  • Basic theory developed and published in 1960s and
    70s
  • No widespread understanding and application until
    late 80s
  • Why?
  • Theory published in mathematic journals which
    were not widely read by practicing engineers
  • Insufficient tutorial material for readers to
    understand and apply concepts

4
HMM Uses
  • Uses
  • Speech recognition
  • Recognizing spoken words and phrases
  • Text processing
  • Parsing raw records into structured records
  • Bioinformatics
  • Protein sequence prediction
  • Financial
  • Stock market forecasts (price pattern prediction)
  • Comparison shopping services

5
HMM Overview
  • Machine learning method
  • Makes use of state machines
  • Based on probabilistic models
  • Useful in problems having sequential steps
  • Can only observe output from states, not the
    states themselves
  • Example speech recognition
  • Observe acoustic signals
  • Hidden States phonemes
  • (distinctive sounds of a language)

State machine
6
Observable Markov Model Example
State transition matrix
  • Weather
  • Once each day weather is observed
  • State 1 rain
  • State 2 cloudy
  • State 3 sunny
  • What is the probability the weather for the next
    7 days will be
  • sun, sun, rain, rain, sun, cloudy, sun
  • Each state corresponds to a physical observable
    event

Rainy Cloudy Sunny
Rainy 0.4 0.3 0.3
Cloudy 0.2 0.6 0.2
Sunny 0.1 0.1 0.8
7
Observable Markov Model
8
Hidden Markov Model Example
  • Coin toss
  • Heads, tails sequence with 2 coins
  • You are in a room, with a wall
  • Person behind wall flips coin, tells result
  • Coin selection and toss is hidden
  • Cannot observe events, only output (heads, tails)
    from events
  • Problem is then to build a model to explain
    observed sequence of heads and tails

9
HMM Components
  • A set of states (xs)
  • A set of possible output symbols (ys)
  • A state transition matrix (as)
  • probability of making transition from one state
    to the next
  • Output emission matrix (bs)
  • probability of a emitting/observing a symbol at a
    particular state
  • Initial probability vector
  • probability of starting at a particular state
  • Not shown, sometimes assumed to be 1

10
HMM Components
11
Common HMM Types
  • Ergodic (fully connected)
  • Every state of model can be reached in a single
    step from every other state of the model
  • Bakis (left-right)
  • As time increases, states proceed from left to
    right

12
HMM Core Problems
  • Three problems must be solved for HMMs to be
    useful in real-world applications
  • 1) Evaluation
  • 2) Decoding
  • 3) Learning

13
HMM Evaluation Problem
  • Purpose score how well a given model matches a
    given observation sequence
  • Example (Speech recognition)
  • Assume HMMs (models) have been built for words
    home and work.
  • Given a speech signal, evaluation can determine
    the probability each model represents the
    utterance

14
HMM Decoding Problem
  • Given a model and a set of observations, what are
    the hidden states most likely to have generated
    the observations?
  • Useful to learn about internal model structure,
    determine state statistics, and so forth

15
HMM Learning Problem
  • Goal is to learn HMM parameters (training)
  • State transition probabilities
  • Observation probabilities at each state
  • Training is crucial
  • it allows optimal adaptation of model parameters
    to observed training data using real-world
    phenomena
  • No known method for obtaining optimal parameters
    from data only approximations
  • Can be a bottleneck in HMM usage

16
HMM Concept Summary
  • Build models representing the hidden states of a
    process or structure using only observations
  • Use the models to evaluate probability that a
    model represents a particular observation
    sequence
  • Use the evaluation information in an application
    to recognize speech, parse addresses, and many
    other applications

17
Part II Application AutoBib System
  • Provide a uniform view of several computer
    science bibliographic web data sources
  • An automated web information extraction system
    that requires little human input
  • Web pages designed differently from site-to-site
  • IE requires training samples
  • HMMs used to parse unstructured bibliographic
    records into a structured format NLP

18
Web Information Extraction Converting Raw Records
19
Approach
  • 1) Provide seed database of structured records
  • 2) Extract raw records from relevant Web pages
  • 3) Match structured records to raw records
  • To build training samples
  • 4) Train HMM-based parser
  • 5) Parse unmatched raw recs into structured recs
  • 6) Merge new structured records into database

20
AutoBib Architecture
21
Step 1 - Seeding
  • Provide seed database of structured records
  • Take small collection of BibTeX format records
    and insert into database
  • Cleaning step normalizes record fields
  • Examples
  • Proc. ? Proceedings
  • Jan ? January
  • Manual step, executed once only

22
Step 2 Extract Raw Records
  • Extract raw records from relevant Web pages
  • User specifies
  • Web pages to extract from
  • How to follow next page links for multiple
    pages
  • Raw records are extracted
  • Uses record-boundary discovery techniques
  • Subtree of Interest largest subtree of HTML
    tags
  • Record separators frequent HTML tags

23
Tokenized Records
(Replace all HTML tags with )
24
Step 3 - Matching
  • Match raw records R to structured records S
  • Apply 4 tests (heuristic-based)
  • Match at least author in R to an author in S
  • S.year must appear in R
  • If S.pages exists, R must contain it
  • S.title is approximately contained in R
  • Levenshtein edit distance approximate string
    match

25
Step 4 Parser Training
  • Train HMM-based parser
  • For each pair of R and S that match, annotate
    tokens in raw record with field names
  • Annotated raw records are fed into HMM parser in
    order to learn
  • State transition probabilities
  • Symbol probabilities at each state

26
Parser Training, continued
  • Key consideration is HMM structure for navigating
    record fields (fields, delimiters)
  • Special states
  • start, end
  • Normal states
  • author, title, year, etc.
  • Best structure found
  • Have multiple delimiter and tag states,
  • one for each normal state
  • Example author-delimiter, author-tag

27
Sample HMM (Method 3)
Source http//www.cs.duke.edu/geng/autobib/web/h
mm.jpg
28
Step 5 - Conversion
  • Parse unmatched raw recs into structured recs
    using HMM parser
  • Matched raw records can be directly converted
    without parsing because they were annotated in
    matching step

29
Step 6 - Merging
  • Merge new structured records into database
  • Initial seed database has now grown
  • New records will be used for improved matching on
    the next run

30
Evaluation
  • Success rate
  • of tokens labeled by HMM
  • -------------------------------------
  • of tokens labeled by person
  • DBLP 98.9
  • Computer Science Bibliography
  • CSWD 93.4
  • CompuScience WWW-Database

31
HMM Advantages / Disadvantages
  • Advantages
  • Effective
  • Can handle variations in record structure
  • Optional fields
  • Varying field ordering
  • Disadvantages
  • Requires training using annotated data
  • Not completely automatic
  • May require manual markup
  • Size of training data may be an issue

32
Other methods
  • Wrappers
  • Specification of areas of interest on Web page
  • Hand-crafted
  • Wrapper induction
  • Requires manual training
  • Not always accommodating to changing structure
  • Syntax-based no semantic labeling

33
Application to Other Domains
  • E-Commerce
  • Comparison shopping sites
  • Extract product/pricing information from many
    sites
  • Convert information into structured format and
    store
  • Provide interface to look up product information
    and then display pricing information gathered
    from many sites
  • Saves users time
  • Rather than navigating to and searching many
    sites, users can consult a single site

34
References
  • Concept
  • Rabiner, L. R. (1989). A Tutorial on Hidden
    Markov Models and Selected Applications in Speech
    Recognition. Proceedings of the IEEE, 77(2),
    257-285.
  • Application
  • Geng, J. and Yang, J. (2004). Automatic
    Extraction of Bibliographic Information on the
    Web. Proceedings of the 8th International
    Database Engineering and Applications Symposium
    (IDEAS04), 193-204.
Write a Comment
User Comments (0)
About PowerShow.com