Spell Checker - PowerPoint PPT Presentation

About This Presentation
Title:

Spell Checker

Description:

... phrases of words and to use words in the phrase to guess the correct spelling of ... program returned the correct spelling of an inputted word most of the time, ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 3
Provided by: Jon167
Category:

less

Transcript and Presenter's Notes

Title: Spell Checker


1
Spell Checker
By Myles Maxfield and Jonathan Reed
Methods
To quantitatively test the quality of the spell
checker, the program was executed on predefined
test beds of words for numerous trials, ranging
from thousands to millions of times. The test
beds varied widely in content because they were
freshly constructed from the latest statistical
data of the most commonly searched words on two
products of our sponsor company. To ensure that
the time efficiency results were accurate, the
data from each trial was recorded and averaged.
In addition, each corrected result and the
frequency of said result were recorded to show
the accuracy and consistency of the program.
Results
94.2 Correctness
Analysis
The spell checker works relatively well.The
theoretical run time using Big-Oh notation is
O(mn2), where m is a small constant (0ltmlt1) and
n is the length of the inputted word. The scalar
m is used, because the Kd tree filters out many
impossible matches and therefore cuts down on the
data size of the algorithms. The n2 comes from
the use of the Damerau-Levenshtein edit distance
algorithm. The runtime of the algorithm on our
sponsor companys servers is even less, and is
therefore within the acceptable range. The
correctness of the algorithm is relatively high.
The algorithm gets the correct answer for all but
four cases of the 70-word test-bed that the
algorithm was run on. The algorithm did not find
an answer for two of the four test cases that did
not get the right answer. Because the spell
checker output will not be displayed if it does
not return an answer, the spell checker will only
suggest an incorrect word 2.9 of the time.
2
Discussion
Purpose
The purpose of the project was to create a
spell checker that could check through a corpus
specific dictionary of at least 100,000 words
long and correct a search query from a user in
500 milliseconds or less. The program was
required to identify incorrectly spelled phrases
of words and to use words in the phrase to guess
the correct spelling of the rest of the words.
The accuracy and time efficiency results shown
for the final spell checker program exceeded the
performance of the spell checker previously used
by our sponsor company and also versions using
the Landau-Vishkin algorithm and a
modified-Landau-Vishkin algorithm. The previous
spell checker program used was an out of the
box spell checker built into the coding of
Lucene, an open source software package that was
used to store and access webpage data. The lucene
Spell Checker works by ranking words using the
Levenshtein edit distance. The Levenshtein edit
distance algorithm operates in O(MN) efficiency
and requires MN member space, where M and N are
the lengths of the two strings being compared.
The algorithm is very simple and is extremely
fast when comparing short strings, but
inefficient for long strings (Black, 2006). The
Damerau-Levenshtein edit distance algorithm
implemented in the final spell checker relies on
the same method and is very similar however, the
Damerau-Levenshtein algorithm considers
transposed letters as one mistake, and the
Levenshtein algorithm treats them as two. The
Lucene spell checker provides quick results for
small test cases and is easy to use, but was
unsatisfactory because of the low accuracy in
returned results and poor time efficiency for
long test cases (Black, 2006). The spell checker
produced by the project far exceeds the Lucene
spell checker in speed for longer test cases.
Background
The spelling mistakes of a person can be divided
into two categories typos and guesses. Typos
are random mistakes made by accident. Guesses
are attempts at spelling a word by sounding it
out. With typos, the correct word can be found
by using string matching algorithms, such as the
Demarau-Levenstein edit distance algorithm, to
search a dictionary and find the closest word
that matches. Correcting guesses requires the
words to be translated into their phonetic
equivalents, which can be done using programming
libraries such as soundex and metaphone. The
closest match of the misspelled word can be found
by comparing the phonetic equivalents using the
exact same string matching algorithms used for
typos. However, checking through an entire
dictionary for each misspelled word is
inefficient and time costly. Therefore, the
process can be optimized by filtering out words
in the dictionary that are cannot be matches. A
Kd tree is a special data structure that
intelligently organizes the dictionary so only
words that could possibly match with the
misspelled word are found. This acts like a
filter for the core algorithm and eliminates
excess calculations. In addition to checking
individual words for mismatches, words that are
commonly used together as phrases can be utilized
to hint at what the correct spelling of the
rest of the phrase is. This can be used to
reduce excess work even more and increase
efficiency.
Conclusion
This study was designed to create a spell
checker for an Internet-based search engine.
Different algorithms were created and unified
under a common spell checker program. The
resulting program returned the correct spelling
of an inputted word most of the time, and the
program ran fast enough that the user would not
notice a delay. Users now can search the
Internet using our sponsor companys search
engine and have their spelling corrected.
Write a Comment
User Comments (0)
About PowerShow.com