Title: Chapter 5 Probabilistic models fo pronunciation and spelling
1Chapter 5 Probabilistic models fo pronunciation
and spelling
2Main points
- This chapter discusses the problem of detecting
and correcting spelling errors. - First introduce the problems of detecting and
correcting spelling errors also summarize
typical human spelling error paterns - Introduce ways to solve the spelling problem
Bayes Rule and the noisy channel model. -
3Outline
- 5.1 Dealing with spelling errors.
- 5.2 Spelling error patterns.
- 5.3 Detecting non-word errors.
- 5.4 Probabilistic models.
- 5.5 Applying the bayesian method to spelling.
- 5.6 Minimum edit distance.
- 5.11 Summary.
45.1 Dealing with spelling errors
- Application aera
- Typed text (word-processors).
- Optical character recognition OCR (optical
scanner) - On-line handwriting recognition (Palm,Chinese)
- Classification of spelling correction.(Kukich1992)
- Non-word error detection detecting spelling
errors that result in non-words (graffe for
giraffe). - Isolated-word error correction correcting
spelling errors that result in non-words.
(correcting graffe to giraffe, but looking only
at the word in isolation.) - Context-dependent error detection and correction
using the context to help detect and correct
real-word errors. (dessert for desert or there
for three).
55.2 Spelling errors patterns
- The number and nature of spelling errors in human
typed text differs from those caused by
pattern-recognition devices like OCR and
handwriting recognizers. - Number.
- 1-3 in human typed text.
- Vary. 0.2-20 for OCR. Special input script for
Palm. - Nature.
6Nature of spelling errors
- Human typing errors
- Insertion the as ther
- Deletion the as th
- Substitution the as thw
- Transposition the as teh
- Other dimension of classification
- Typographic errors Keyboard related. spell as
spwll - Cognitive errors the writer doesnt know how to
spell . separate as seperate
- OCR errors.
- Substitution
- Multisubstitution
- Space deletion
- Insertion
- Failure.
7An example for OCR errors
- Correct The quick brown fox jumps over the lazy
dog. - Recognized lhe qick brown foxjurnps ovcr tb l
azy dog. - Errors substitution (e ? c) and
multisubstitutions (T ? l, m?rn, he?b) are
caused by visual simlarity rather than keybooard
distance failures (u?) are cases where OCR does
not select any letter with sufficient accuracy.
85.3 Detecting non-word errors
- Detecting non-word errors in text, whether typed
by humans ro scanned, is commonly done by using
dictionary. - Small or big dictionary?
- Small Large dictionary contains rare words that
resemble misspelling of other words wont as
wont - Large Emperical study found large dictionary are
more helpful than harmful. - Use model of morphology for to deal with
inflection.
95.4 Probabilistic models
10Equation for picking the best word
11Using Bayesian rules to make the equation
computable
125.5 Applying Bayesian method
- Bayesian algorithm
- Proposing candidate correction.
- Scoring the candidates.
- Proposing candidates
- Simplifying assumption single spelling error.
- Example misspelling acress
13Example
14Scoring the correction
- p(c) can be estimated by counting how often the
word c occurs in some corpus.
15Calculating p(tc)
- Still a research question.
- Can be estimated.
- Some simply ways. For example..
- Confusion matrix
- A square 2626 table which represents how many
times one letter was incorrectly used instead of
another. - For example the cell o,e in a substitution
confusion matrix would give the count of times
that e was substituted for o. - Usually, there are four confusion matrix
deletion, insertion, substitution and
transposition.
16Result
- ...was called a stellar and versatile acress
whose combination of sass and glamour has defined
here... - Chapter 6 will show how to augment the prior
probability by using surrounding words.
175.6 Minimum edit distance
- Previous sections relied on the simplifying
assumption single spelling error. - We need a more powerful algorithm to handle
multiple errors. - Minimum edit distance algorithm
- String distance, is some metric of how alike two
strings are to each other. - The minimum edit distance between two strings is
the minimum number of editing operations.
18Three methods for representing distances
19Minimum edit distance algorithm
- Is an application of dynamic programming, which
solving problems by combining solutions to
subproblems. - The edit-distance matrix.
20Minimum edit distance algorithm
215.11 Summary
- We can present many language problems as if a
clean string of symbols had been corrupted by
passing through a noisy channel and it is our job
to recover the original string. - One way to do it is to consider all possible
original strings and rank them by their
probability. - We use Bayes Rule to break down the probability
into prior and likelihood. - Prior is computed by taking word frequencies.
Likelihood is computed by training a simple
probabilistic model (confusion matrix, a decision
tree or a hand-written rule) on a database. - The minimum edit distance is introduced to solve
multi-spelling errors.The minimum edit distance
algorithm can be used to produce the distance two
strings.