Chapter 5 Probabilistic models fo pronunciation and spelling - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Chapter 5 Probabilistic models fo pronunciation and spelling

Description:

Chapter 5 Probabilistic models fo pronunciation and spelling. Xiaomeng Su. 6 November ... This chapter discusses the problem of detecting and correcting ... – PowerPoint PPT presentation

Number of Views:482
Avg rating:3.0/5.0
Slides: 22
Provided by: xiao48
Category:

less

Transcript and Presenter's Notes

Title: Chapter 5 Probabilistic models fo pronunciation and spelling


1
Chapter 5 Probabilistic models fo pronunciation
and spelling
  • Xiaomeng Su
  • 6 November

2
Main points
  • This chapter discusses the problem of detecting
    and correcting spelling errors.
  • First introduce the problems of detecting and
    correcting spelling errors also summarize
    typical human spelling error paterns
  • Introduce ways to solve the spelling problem
    Bayes Rule and the noisy channel model.

3
Outline
  • 5.1 Dealing with spelling errors.
  • 5.2 Spelling error patterns.
  • 5.3 Detecting non-word errors.
  • 5.4 Probabilistic models.
  • 5.5 Applying the bayesian method to spelling.
  • 5.6 Minimum edit distance.
  • 5.11 Summary.

4
5.1 Dealing with spelling errors
  • Application aera
  • Typed text (word-processors).
  • Optical character recognition OCR (optical
    scanner)
  • On-line handwriting recognition (Palm,Chinese)
  • Classification of spelling correction.(Kukich1992)
  • Non-word error detection detecting spelling
    errors that result in non-words (graffe for
    giraffe).
  • Isolated-word error correction correcting
    spelling errors that result in non-words.
    (correcting graffe to giraffe, but looking only
    at the word in isolation.)
  • Context-dependent error detection and correction
    using the context to help detect and correct
    real-word errors. (dessert for desert or there
    for three).

5
5.2 Spelling errors patterns
  • The number and nature of spelling errors in human
    typed text differs from those caused by
    pattern-recognition devices like OCR and
    handwriting recognizers.
  • Number.
  • 1-3 in human typed text.
  • Vary. 0.2-20 for OCR. Special input script for
    Palm.
  • Nature.

6
Nature of spelling errors
  • Human typing errors
  • Insertion the as ther
  • Deletion the as th
  • Substitution the as thw
  • Transposition the as teh
  • Other dimension of classification
  • Typographic errors Keyboard related. spell as
    spwll
  • Cognitive errors the writer doesnt know how to
    spell . separate as seperate
  • OCR errors.
  • Substitution
  • Multisubstitution
  • Space deletion
  • Insertion
  • Failure.

7
An example for OCR errors
  • Correct The quick brown fox jumps over the lazy
    dog.
  • Recognized lhe qick brown foxjurnps ovcr tb l
    azy dog.
  • Errors substitution (e ? c) and
    multisubstitutions (T ? l, m?rn, he?b) are
    caused by visual simlarity rather than keybooard
    distance failures (u?) are cases where OCR does
    not select any letter with sufficient accuracy.

8
5.3 Detecting non-word errors
  • Detecting non-word errors in text, whether typed
    by humans ro scanned, is commonly done by using
    dictionary.
  • Small or big dictionary?
  • Small Large dictionary contains rare words that
    resemble misspelling of other words wont as
    wont
  • Large Emperical study found large dictionary are
    more helpful than harmful.
  • Use model of morphology for to deal with
    inflection.

9
5.4 Probabilistic models
  • The noisy channel model.

10
Equation for picking the best word
11
Using Bayesian rules to make the equation
computable
12
5.5 Applying Bayesian method
  • Bayesian algorithm
  • Proposing candidate correction.
  • Scoring the candidates.
  • Proposing candidates
  • Simplifying assumption single spelling error.
  • Example misspelling acress

13
Example
14
Scoring the correction
  • p(c) can be estimated by counting how often the
    word c occurs in some corpus.

15
Calculating p(tc)
  • Still a research question.
  • Can be estimated.
  • Some simply ways. For example..
  • Confusion matrix
  • A square 2626 table which represents how many
    times one letter was incorrectly used instead of
    another.
  • For example the cell o,e in a substitution
    confusion matrix would give the count of times
    that e was substituted for o.
  • Usually, there are four confusion matrix
    deletion, insertion, substitution and
    transposition.

16
Result
  • ...was called a stellar and versatile acress
    whose combination of sass and glamour has defined
    here...
  • Chapter 6 will show how to augment the prior
    probability by using surrounding words.

17
5.6 Minimum edit distance
  • Previous sections relied on the simplifying
    assumption single spelling error.
  • We need a more powerful algorithm to handle
    multiple errors.
  • Minimum edit distance algorithm
  • String distance, is some metric of how alike two
    strings are to each other.
  • The minimum edit distance between two strings is
    the minimum number of editing operations.

18
Three methods for representing distances
19
Minimum edit distance algorithm
  • Is an application of dynamic programming, which
    solving problems by combining solutions to
    subproblems.
  • The edit-distance matrix.

20
Minimum edit distance algorithm
21
5.11 Summary
  • We can present many language problems as if a
    clean string of symbols had been corrupted by
    passing through a noisy channel and it is our job
    to recover the original string.
  • One way to do it is to consider all possible
    original strings and rank them by their
    probability.
  • We use Bayes Rule to break down the probability
    into prior and likelihood.
  • Prior is computed by taking word frequencies.
    Likelihood is computed by training a simple
    probabilistic model (confusion matrix, a decision
    tree or a hand-written rule) on a database.
  • The minimum edit distance is introduced to solve
    multi-spelling errors.The minimum edit distance
    algorithm can be used to produce the distance two
    strings.
Write a Comment
User Comments (0)
About PowerShow.com