Chapter 5 Probabilistic models fo pronunciation and spelling

About This Presentation

Title:

Chapter 5 Probabilistic models fo pronunciation and spelling

Description:

Chapter 5 Probabilistic models fo pronunciation and spelling. Xiaomeng Su. 6 November ... This chapter discusses the problem of detecting and correcting ... – PowerPoint PPT presentation

Number of Views:482

Avg rating:3.0/5.0

Slides: 22

Provided by: xiao48

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 5 Probabilistic models fo pronunciation and spelling

1
Chapter 5 Probabilistic models fo pronunciation
and spelling

Xiaomeng Su
6 November

2
Main points

This chapter discusses the problem of detecting
and correcting spelling errors.
First introduce the problems of detecting and
correcting spelling errors also summarize
typical human spelling error paterns
Introduce ways to solve the spelling problem
Bayes Rule and the noisy channel model.

3
Outline

5.1 Dealing with spelling errors.
5.2 Spelling error patterns.
5.3 Detecting non-word errors.
5.4 Probabilistic models.
5.5 Applying the bayesian method to spelling.
5.6 Minimum edit distance.
5.11 Summary.

4
5.1 Dealing with spelling errors

Application aera
Typed text (word-processors).
Optical character recognition OCR (optical
scanner)
On-line handwriting recognition (Palm,Chinese)
Classification of spelling correction.(Kukich1992)
Non-word error detection detecting spelling
errors that result in non-words (graffe for
giraffe).
Isolated-word error correction correcting
spelling errors that result in non-words.
(correcting graffe to giraffe, but looking only
at the word in isolation.)
Context-dependent error detection and correction
using the context to help detect and correct
real-word errors. (dessert for desert or there
for three).

5
5.2 Spelling errors patterns

The number and nature of spelling errors in human
typed text differs from those caused by
pattern-recognition devices like OCR and
handwriting recognizers.
Number.
1-3 in human typed text.
Vary. 0.2-20 for OCR. Special input script for
Palm.
Nature.

6
Nature of spelling errors

Human typing errors
Insertion the as ther
Deletion the as th
Substitution the as thw
Transposition the as teh
Other dimension of classification
Typographic errors Keyboard related. spell as
spwll
Cognitive errors the writer doesnt know how to
spell . separate as seperate

OCR errors.
Substitution
Multisubstitution
Space deletion
Insertion
Failure.

7
An example for OCR errors

Correct The quick brown fox jumps over the lazy
dog.
Recognized lhe qick brown foxjurnps ovcr tb l
azy dog.
Errors substitution (e ? c) and
multisubstitutions (T ? l, m?rn, he?b) are
caused by visual simlarity rather than keybooard
distance failures (u?) are cases where OCR does
not select any letter with sufficient accuracy.

8
5.3 Detecting non-word errors

Detecting non-word errors in text, whether typed
by humans ro scanned, is commonly done by using
dictionary.
Small or big dictionary?
Small Large dictionary contains rare words that
resemble misspelling of other words wont as
wont
Large Emperical study found large dictionary are
more helpful than harmful.
Use model of morphology for to deal with
inflection.

9
5.4 Probabilistic models

The noisy channel model.

10
Equation for picking the best word
11
Using Bayesian rules to make the equation
computable
12
5.5 Applying Bayesian method

Bayesian algorithm
Proposing candidate correction.
Scoring the candidates.
Proposing candidates
Simplifying assumption single spelling error.
Example misspelling acress

13
Example
14
Scoring the correction

p(c) can be estimated by counting how often the
word c occurs in some corpus.

15
Calculating p(tc)

Still a research question.
Can be estimated.
Some simply ways. For example..
Confusion matrix
A square 2626 table which represents how many
times one letter was incorrectly used instead of
another.
For example the cell o,e in a substitution
confusion matrix would give the count of times
that e was substituted for o.
Usually, there are four confusion matrix
deletion, insertion, substitution and
transposition.

16
Result

...was called a stellar and versatile acress
whose combination of sass and glamour has defined
here...
Chapter 6 will show how to augment the prior
probability by using surrounding words.

17
5.6 Minimum edit distance

Previous sections relied on the simplifying
assumption single spelling error.
We need a more powerful algorithm to handle
multiple errors.
Minimum edit distance algorithm
String distance, is some metric of how alike two
strings are to each other.
The minimum edit distance between two strings is
the minimum number of editing operations.

18
Three methods for representing distances
19
Minimum edit distance algorithm

Is an application of dynamic programming, which
solving problems by combining solutions to
subproblems.
The edit-distance matrix.

20
Minimum edit distance algorithm
21
5.11 Summary

We can present many language problems as if a
clean string of symbols had been corrupted by
passing through a noisy channel and it is our job
to recover the original string.
One way to do it is to consider all possible
original strings and rank them by their
probability.
We use Bayes Rule to break down the probability
into prior and likelihood.
Prior is computed by taking word frequencies.
Likelihood is computed by training a simple
probabilistic model (confusion matrix, a decision
tree or a hand-written rule) on a database.
The minimum edit distance is introduced to solve
multi-spelling errors.The minimum edit distance
algorithm can be used to produce the distance two
strings.

Write a Comment

User Comments (0)

About PowerShow.com

Chapter 5 Probabilistic models fo pronunciation and spelling - PowerPoint PPT Presentation

Chapter 5 Probabilistic models fo pronunciation and spelling

Chapter 5 Probabilistic models fo pronunciation and spelling. Xiaomeng Su. 6 November ... This chapter discusses the problem of detecting and correcting ... – PowerPoint PPT presentation