By Tim

About This Presentation

Title:

By Tim

Description:

succeeded in correctly translating several sentences from Russian into English ... translates Spanish into English. worked on a more open domain ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 71

Provided by: edwardda

Category:

Tags: tim

more less

Transcript and Presenter's Notes

Title: By Tim

1
Natural Language Processing

By Tim
Adrian Gareau
Edward Dantsiguer

2
Agenda

1.0 Definitions
1.1 Characteristics of Successful Machines
1.2 Practical Applications
1.2.1 machine translation
1.2.2 database access
1.2.3 text interpretation
1.2.3.1 information retrieval
1.2.3.2 text categorization
1.2.3.3 extracting data from text
2.0 Efficient Parsing
3.0 Scaling up the Lexicon
4.0 List of References

3
Current Topic

1.0 Definitions
1.1 Characteristics of Successful Machines
1.2 Practical Applications
1.2.1 machine translation
1.2.2 database access
1.2.3 text interpretation
1.2.3.1 information retrieval
1.2.3.2 text categorization
1.2.3.3 extracting data from text
2.0 Efficient Parsing
3.0 Scaling up the Lexicon
4.0 List of References

4
1.0 Definitions

Natural languages are languages that living
creatures use for communication
Artificial Languages are mathematically defined
classes of signals that can be used for
communication with machines
A language is a set of sentences that may be used
as signals to convey semantic information
The meaning of a sentence is the semantic
information it conveys

5
Current Topic

1.0 Definitions
1.1 Characteristics of Successful Machines
1.2 Practical Applications
1.2.1 machine translation
1.2.2 database access
1.2.3 text interpretation
1.2.3.1 information retrieval
1.2.3.2 text categorization
1.2.3.3 extracting data from text
2.0 Efficient Parsing
3.0 Scaling up the Lexicon
4.0 List of References

6
1.1 Characteristics of Successful Natural
Language Systems

Successful systems share two properties
they are focused on a particular domain rather
than allowing discussion of any topic
they are focused on a particular task rather than
attempting to understand language completely
The above means that any natural language machine
is more likely to work correctly if one is to
restrict the set of possible inputs -- input
possibility size is inversely proportional to
likelihood of success

7
Current Topic

1.0 Definitions
1.1 Characteristics of Successful Machines
1.2 Practical Applications
1.2.1 machine translation
1.2.2 database access
1.2.3 text interpretation
1.2.3.1 information retrieval
1.2.3.2 text categorization
1.2.3.3 extracting data from text
2.0 Efficient Parsing
3.0 Scaling up the Lexicon
4.0 List of References

8
1.2 Practical Applications

We are going to look at five practical
applications of natural language processing
machine translation (1.2.1)
database access (1.2.2)
text interpretation (1.2.3)
information retrieval (1.2.3.1)
text categorization (1.2.3.2)
extracting data from text (1.2.3.3)

9
1.2.1 Machine Translation

First suggestions made by the Russian
Smirnov-Troyansky and the Frenchman C.G. Artsouni
in the 1930s
First serious discussions were begun in 1946 by
mathematician Warren Weaver
There was great hope that computers would be able
to translate from one natural language to another
(inspired by the success of the Allied efforts
using the British Colossus computer)
Turings project translated coded messages into
intelligible German
By 1954 there was a machine translation (MT)
project at Georgetown University
succeeded in correctly translating several
sentences from Russian into English
After Georgetown project, MT projects were
started up at MIT, Harvard and the University of
Pennsylvania

10
1.2.1 Machine Translation (Cont)

It soon (1966) became apparent that translation
is a very complicated task and it would be
practically impossible to account for all
intricacies and nuances of natural languages
correct translation would require an in-depth
understanding of both natural languages since
structure of expressions varies in every natural
language
Yehoshua Bar-Hillel declared that MT was
impossible (Bar-Hillel Paradox)
analysis by humans of messages relies to some
extent on the information which is not present in
the words that make up the message
The pen is in the box
i.e. the writing instrument is in the container
The box is in the pen
i.e. the container is in the playpen or the
pigpen

11
1.2.1 Machine Translation (Cont)

There has been no fundamental breakthroughs in
machine translation in the last 34 years
Progress has been made on restricted domains
there are dozens of systems that are able to take
a subset of one language and, fairly accurately,
translate it to another language
these systems operate well enough to save
significant sums of money over fully manual
techniques (see examples two pages down)
From the above systems, ones operating on a more
restricted set, produce more impressive results
Machine translation is NOT automatic speech
recognition

12
1.2.1 Machine Translation (Cont)

Examples of poor machine translations would
include
"the spirit is strong, but the body is weak" was
translated literally as "the vodka is strong but
the meat is rotten
"Out of sight, out of mind was translated as
"Invisible, insane
"hydraulic ram was translated as "male water
sheep
These do not imply that machine translation is a
waste of time
some mistakes are inevitable regardless of the
quality and sophistication of the system
one has to realize that human translators also
make mistakes

13
1.2.1 Machine Translation (Cont)

Examples machine translation systems include
TAUM-METRO system
translates weather reports from English to French
works very well since language in government
weather reports is highly stylized and regular
SPANAM system
translates Spanish into English
worked on a more open domain
results were reasonably good although resulting
English text was not always grammatical and very
rarely fluent
AVENTINUS system
advanced information system for multilingual drug
enforcement
allows law enforcement officials to know what the
foreign document is about
sorts, classifies and analyzes drug related
information

14
1.2.1 Machine Translation (Cont)

There are three basic types of machine
translation
Machine-assisted (aided) human translation (MAHT)
the translation is performed by human translator,
but he/she uses a computer as a tool to improve
or speed up the translation process
Human-assisted (aided) machine translation (HAMT)
the source language text is modified by human
translator either before, during or after it is
translated by the computer
Fully automatic machine translation (FAMT)
the source language text is fed into the computer
as a file, and the computer produces a
translation automatically without any human
intervention

15
1.2.1 Machine Translation (Cont)

Standing on its own, unrestricted machine
translation (FAMT) is still inadequate
Human-assisted machine translation (HAMT) could
be used to improve the quality of translation
one possibility is to have a human reader go over
the text after the translation, correcting
grammar errors (post-processing)
human reader can save a lot of time since some of
the text will be translated correctly
sometimes a monolingual human can edit the output
without reading the original
another possibility is to have a human reader
edit the document before translation
(pre-processing)
make the original to conform to a restricted
subset of a language
this will usually allow the system to translate
the resulting text without any requirement for
post-editing

16
1.2.1 Machine Translation (Cont)

Restricted languages are sometimes called
Caterpillar English
Caterpillar was the first company to try writing
their manuals using pre-processing
Xerox was the first company to really
successfully use of the pre-processing approach
(SYSTRAN system)
language defined for their manuals was highly
restricted, thus translation into other languages
worked quite well
There is a substantial start-up cost to any
machine translation effort
to achieve broad coverage, translation systems
should have lexicons of 20,000 to 100,000 words
and grammars of 100 to 10,000 rules (depending on
the choice of formalism)

17
1.2.1 Machine Translation (Cont)

There are several basic theoretical approaches to
machine translation
Direct MT Strategy
based on good glossaries and morphological
analysis
always between a pair of languages
Transfer MT Strategy
first, source language is parsed into an abstract
internal representation
a transfer is then made into the corresponding
structures in the target language
Inerlingua MT Strategy
the idea is to create an artificial language
it shares all the features and makes all the
distinctions of all languages
Knowledge-Based Strategy
similar to the above
intermediate form is of semantic nature rather
than a syntactic one

18
1.2.2 Database Access

The first major success of natural language
processing
There was a hope that databases could be
controlled by natural languages instead of
complicated data retrieval commands
this was a major problem in the early 1970s since
the staff in charge of data retrieval could not
keep up with demand of users for data
LUNAR system was the first such interface
built by William Woods in 1973 for NASA Manned
Spacecraft Center
system was able to correctly answer 78 of the
questions such as What is the average modal
plagioclase concentration for lunar samples that
contain rubidium?

19
1.2.2 Database Access (Cont)

Other examples of data retrieval systems would
include
CHAT system
developed by Fernando Pereira in 1983
similar level of complexity to LUNAR system
worked on geographical databases
was restricted
question wording was very important
TEAM system
could handle a wider set of problems than CHAT
was still restricted and unable to handle all
types of input

20
1.2.2 Database Access (Cont)

Companies such as Natural Language Inc. and
Symantec are still selling database tools that
use natural language
The ability to have natural language control of
databases is not as big of a concern as it was in
1970s
graphical user interface and integration of
spreadsheets, word processors, graphing
utilities, report generating utilities, etc are
of greater concern to database buyers today
mathematical or set notation seems to be a more
natural way of communicating with a database than
plane English
with advent of SQL, the problem of data retrieval
is not as major as it was in the past

21
1.2.3 Text Interpretation

In early 1980s, most online information was
stored in databases and spreadsheets
Now, most of online information is text email,
news, journals, articles, books, encyclopedias,
reports, essays, etc
there is a need to sort this information to
reduce it to some comprehendible amount
Has become a major field in natural language
processing
becoming more and more important with expansion
of the Internet
consists of
information retrieval
text categorization
data extraction

22
1.2.3.1 Information Retrieval

Information retrieval (IR) is also know as
information extraction (IE)
Information retrieval systems analyze
unrestricted text in order to extract specific
types of information
IR systems do not attempt to understand all of
the text in all of the documents, but they do
analyze those portions of each document that
contain relevant information
relevance is determined by pre-defined domain
guidelines which must specify, as accurately as
possible, exactly what types of information the
system is expected to find
query would be a good example of such a
pre-defined domain
documents that contain relevant information are
retrieved while other are ignored

23
1.2.3.1 Information Retrieval (Cont)

Sometimes documents could be represented by a
surrogate, such as the title and and a list of
key words and/or an abstract
It is more common to use the full text, possibly
subdivided into sections that each serve as a
separate document for retrieval purposes
The query is normally a list of words typed by
the user
Boolean combinations of words were used by
earlier systems to construct queries
users found it difficult to get good results from
Boolean queries
it was hard to find a combination of ANDs and
ORs that will produce appropriate results

24
1.2.3.1 Information Retrieval (Cont)

Boolean model has been replaced by vector-space
model in modern IR systems
in vector-space model every list of words (both
the documents and query) is treated as a vector
in n-dimensional vector space (where n is the
number of distinct tokens in the document
collection)
can use a 1 in a vector position if that word
appears and 0 if it does not
vectors are then compared to determine which ones
are close
vector model is more flexible than Boolean model
documents can be ranked and closest matches could
be reported first

25
1.2.3.1 Information Retrieval (Cont)

There are many variations on vector-space model
some allow stating that two words must appear
near each other
some use thesaurus to automatically augment the
words in the query with their synonyms
A good discriminator must be chosen in order for
the system to be effective
common words like a, the dont tell us much
since they occur in just about every document
a good way to set up the retrieval is to give a
term a larger weight if it appears in a small
number of documents

26
1.2.3.1 Information Retrieval (Cont)

Another way to think about IR is in terms of
databases. An IR system attempts to convert
unstructured text documents into codified
database entries. Database entries might be
drawn from a set of fixed values, or they can be
actual sub-strings pulled from the original
source text.
From a language processing perspective, IR
systems must operate at many levels, from word
recognition to sentence analysis, and from
understanding at the sentence level on up to
discourse analysis at the level of full text
document.
Dictionary coverage is an especially challenging
problem since open-ended documents can be filled
with all manner of jargon, abbreviations, and
proper names, not to mention typos and
telegraphic writing styles.

27
1.2.3.1 Information Retrieval (Cont)

Example (Vector-Space Model) we assume that we
have one very short document that contains one
sentence CPSC 533 is the best Computer Science
course at UofC also assume that our query is
UofC
we need to set up our n-dimensional vector space
we have 10 distinct tokens (one for every word in
the sentence)
we are going to set up the following vector to
represent the sentence (1,1,1,1,1,1,1,1,1,1) --
indicating that all ten words are present
we are going to set the following vector for the
query (0,0,0,0,0,0,0,0,0,1) -- indicating that
UofC is the only word present in the query
by ANDing the two vectors together, we get
(0,0,0,0,0,0,0,0,0,1) meaning that our document
contains UofC, as expected

28
1.2.3.1 Information Retrieval (Cont)

Example Commercial System (HIGHLIGHT)
helps users find relevant information in large
volumes of text and present it in a structured
fashion
it can extract information from newswire reports
for a specific topic area - such as global
banking, or the oil industry - as well as current
and historical financial and other data
although its accuracy will never match the
decision-making skills of a trained human expert,
HIGHLIGHT can process large amounts of text very
quickly, allowing users to discover more
information that even the most trained
professional would have time to look for
see Demo at http//www-cgi.cam.sri.com/highlight/
could be classified under Extracting Data From
Text (1.2.3.3)

29
1.2.3.2 Text Categorization

It is often desirable to sort all text into
several categories
There are number of companies that provide their
subscribers access to all news on a particular
industry, company or geographic area
traditionally, human experts were used to assign
the categories
in the last few years, NLP systems have proven
very accurate (correctly categorizing over 90 of
the news stories)
Context in which text appears is very important
since the same word could be categorized
completely differently depending on the context
Example in a dictionary, the primary definition
of the word crude is vulgar, but in a large
sample of the Wall Street Journal, crude refers
to oil 100 of the time

30
1.2.3.3 Extracting Data From Text

The task of data extraction is take on-line text
and derive from it some assertions that can be
put into a structured database
Examples of data extraction systems include
SCISOR system
able to take stock information text (such as the
type released by Dow Jones News Service) and
extract important stock information pertaining
to
events that took place
companies involved
starting share prices
quantity of shares that changed hands
effect on stock prices

31
Current Topic

1.0 Definitions
1.1 Characteristics of Successful Machines
1.2 Practical Applications
1.2.1 machine translation
1.2.2 database access
1.2.3 text interpretation
1.2.3.1 information retrieval
1.2.3.2 text categorization
1.2.3.3 extracting data from text
2.0 Efficient Parsing
3.0 Scaling up the Lexicon
4.0 List of References

32
2.0 Efficient Parsing

Parsing -- the act of analyzing the
grammaticality of an utterance according to some
specific grammar
previous sentence was parsed according to some
grammar of English and was determined that it
was grammatical
we read the words in some order (from left to
right from right to left or in random order)
and analyzed them one-by-one
Each parse is a different method of analyzing
some target sentence according to some specified
grammar

33
2.0 Efficient Parsing (Cont)

Simple left-to-right parsing is often
insufficient
it is hard to determine the nature of the
sentence
this means that we have to make an initial guess
as to what it is the sentence is saying
this forces us to backtrack if the guess is
incorrect
Some backtracking is inevitable
to make parsing efficient, we want to minimize
the amount of backtracking
even if a wrong guess is made, we know that a
portion of the sentence has already been analyzed
-- there is no need to start from scratch since
we can use the information that is available to us

34
2.0 Efficient Parsing (Cont)

Example we have two sentences
Have students in section 2 of Computer Science
203 take the exam.
Have students in section 2 of Computer Science
203 taken the exam?
first ten words Have students in section 2 of
Computer Science 203 are exactly the same
although the meanings of the two sentences are
completely different
if an incorrect guess is made, we can still use
the first ten words when we backtrack
this will require a lot less work

35
2.0 Efficient Parsing (Cont)

There are three main things that we can do to
improve efficiency
dont do twice what you can do once
dont do once what you can avoid altogether
dont represent distinctions that you dont need
To accomplish these we can use a data structure
known as chart (matrix) to store partial results
this is a form of dynamic programming
results are only calculated if they can not be
found in the chart
only a portion of the calculations that can not
be found in the chart is done while the rest is
retrieved from the chart
algorithms that do this are called chart parsers

36
2.0 Efficient Parsing (Cont)

Examples of parsing techniques
Top-Down, Depth-First
Top-Down, Breadth-First
Bottom-Up, Depth-First Chart
Prolog
Feature Augmented Phrase Structure
These are not the only parsing techniques that
exist
One is free to come up with his or her own
algorithm for the order in which individual words
in every sentence will be analyzed

37
2.0 Efficient Parsing (Cont)

i) Top-Down, Depth-First
uses a strategy of searching for phrasal
constituents from the highest node (the sentence
node) to the terminal nodes (the individual
lexical items) to find a match to the possible
syntactic structure of the input sentence
stores attempts on a possibilities list as a
stacked data structure (LIFO)
ii) Top-Down, Breadth-First
same searching strategy as Top-Down, Depth-First
stores attempts on a possibilities list as a
queued data structure (FIFO)

38
2.0 Efficient Parsing (Cont)

iii) Bottom-Up, Depth-First Chart
parse begins at the word level and uses the
grammar rules to build higher-level structures
(bottom-up), which are combined until a goal
state is reached or until all the applicable
grammar rules have been exhausted
iv) Prolog
relies on the functionality of Prolog Programming
Language to generate a parse using Top-Down,
Depth-First algorithm
naturally deals with constituents and their
relationships
v) Feature Augmented Phrase Structure
takes sentence as input and parses it by
accessing information in a featured
phrase-structure grammar and lexicon
parser output is a tree

39
2.0 Efficient Parsing (Cont)

Chart parsing can be represented pictorially
using a combination of n 1 vertices and a
number of edges
Notation for edge labels ltStarting
Vertexgt,ltEnding Vertexgt, ltResultgt ? ltPart 1gt...
ltPart ngt ltNeeded Part 1gtltNeeded Part k
if Needed Parts are added to already available
Parts then Result would be the outcome, spanning
edges from Starting Vertex to Ending Vertex
see examples (two pages down)
If there are no Needed Parts (if k 0), then the
edge is called complete
edge is called incomplete otherwise

40
2.0 Efficient Parsing (Cont)

Chart-parsing algorithms use a combination of
top-down and bottom-up processing
this means that it never has to consider certain
constituents that could not lead to a complete
parse
this also means that it can handle grammars with
both left-recursive rules and rules with empty
right-hand sides without going into an infinite
loop
result of our algorithm is a packed forest of
parse tree constituents rather than an
enumeration of all possible trees
Chart Parsing consists of forming a chart with n
1 vertices and adding edges to the chart one at
a time, trying to produce a complete edge that
spans from vertex 0 to n and is of category S
(sentence) ? 0,n, S ? NP VP There is no
backtracking -- everything that is put into the
chart stays there

41
2.0 Efficient Parsing (Cont)
Examples

A) Edge 0,5, S ? NP VP -- says an NP
followed by VP combine to make an S that spans
the string from 0 to 5
B) Edge 0,2, S ? NP VP -- says that an NP
spans the string from 0 to 2, and if we could
find a VP to follow it, then we would have an S

42
2.0 Efficient Parsing (Cont)

There are four ways to add and edge to the chart
Initializer
adds an edge to indicate that we are looking for
the start symbol of the grammar, S, starting at
position 0, but have not found anything yet
Predictor
takes an incomplete edge that is looking for an X
and adds new incomplete edges, that if completed,
would build an X in the right place
Completer
takes an incomplete edge that is looking for an X
and ends at vertex j and a complete edge that
begins at j and has X as the left-hand side, and
combines them to make a new edge where the X has
been found
Scanner
similar to the completer, except that it uses the
input words rather than exciting complete edges
to generate the X

43
2.0 Efficient Parsing (Cont)
Nondeterministic Chart Parsing Algorithm
44
2.0 Efficient Parsing (Cont)

Nondeterministic Chart Parsing Algorithm
treats the chart as a set of edges
an new edge is non-deterministically added to the
chart at every step (an edge is
non-deterministically chosen from the possible
additions)
S is the start symbol and S is the new
nonterminal symbol
we start out looking for S (i.e. we currently
have an empty string)
add edges using one of the three methods
(predictor, completer, scanner), one at a time
until no new edges can be added
at the end, if the required parse exists, it is
found
if none of the methods could be used to add
another edge to the set, the algorithm terminates

45
2.0 Efficient Parsing (Cont)
Chart for a Parse of I feel it
46
2.0 Efficient Parsing (Cont)

Using the sample chart on the previous page, the
following steps are taken to complete the parse
of I feel it -- page 1/3
1. INITIALIZER if we parse from edge 0 to edge 0
and look for S, we still need to find S -- (a)
2. PREDICTOR we are looking for an incomplete
edge, that if completed, would give us S -- we
know that S consists of NP and VP, meaning that
by going from 0 to 0 we will have S if we find VP
and NP -- (b)
3. PREDICTOR following a very similar rule, we
know that we will have NP if we can find a
Pronoun this condition can be achieved by going
from 0 to 0, looking for a Pronoun -- (c)
4. SCANNER if we go from 0 to 1, parsing I we
will have our NP since a Pronoun is found -- (d)

47
2.0 Efficient Parsing (Cont)

Example (continued) -- page 2/3
5. COMPLETER we can summarize above steps, we
are looking for S and by going from 0 to1 we have
NP and are still looking for VP -- (e)
6. PREDICTOR we are now looking for VP and by
going from 1 to 1 we will have VP if can find a
Verb -- (f)
7. PREDICTOR VP can consist of another VP and
NP, meaning that 6 would also work if we can find
VP and NP -- (g)
8. SCANNER by going from1 to 2 we can find a
Verb, thus we can find VP -- (h)
9. COMPLETER using 7 and 8, we know that since
VP is found we can complete VP by going from 1 to
2 and finding NP -- (i)
10. PREDICTOR NP can be completed by going from
2 to 2 and finding a Pronoun -- (j)

48
2.0 Efficient Parsing (Cont)

Example (continued) -- page 3/3
11. SCANNER we can find a Pronoun if we go from
2 to 3, thus completing NP -- (k)
12. COMPLETER using 7 - 11, we know that VP can
be found by going from 1 to 3, thus finding NP
and VP -- (l)
13. COMPLETER using all of the information we
collected up to this point, one can get S by
going from 0 to 3, thus finding the original NP
and VP, where VP consists of another VP and NP --
(m)
All of these steps are summarized on the diagram
on the next page

49
2.0 Efficient Parsing (Cont)
Trace of a Parse of I feel it
50
2.0 Efficient Parsing (Cont)
Left-Corner Parsing Algorithm
51
2.0 Efficient Parsing (Cont)

Left-Corner Parsing
avoids building some edges that could not
possibly be part of an S spanning the whole
string
builds up a parse tree that starts with the
grammars start symbol and extends down to the
last word in the sentence
Non-deterministic Chart Parsing Algorithm is an
example of left-corner parsers
using example on the previous slide
ride the horse would never be considered as VP
saves time since unrealistic combinations do not
have to be, first worked out and then discarded

52
2.0 Efficient Parsing (Cont)

Extracting Parses From the Chart Packing
when the chart parsing algorithm finishes, it
returns an entire chart (collection of parse
trees)
what we really want is a parse tree (or several
parse trees)
Ex
a) pick out parse trees that span the entire
input
b) pick out parse trees that for some reason do
not span the entire input
the easiest way to do this is to modify COMPLETER
so that when it combines two child edges to
produce a parent edge, it stores in the parent
edge the list of children that comprise it.
when we are done with the parse, we only need to
look in chartn for an edge that starts at 0,
and recursively look at the children lists to
reproduce a complete parse tree

53
2.0 Efficient Parsing (Cont)
A Variant of Nondeterministic Chart Parsing
Algorithm

Keeps track of the entire parse tree
We can look in chartn for an edge that starts
at 0, and recursively look at the children lists
to reproduce a complete parse tree

54
Current Topic

1.0 Definitions
1.1 Characteristics of Successful Machines
1.2 Practical Applications
1.2.1 machine translation
1.2.2 database access
1.2.3 text interpretation
1.2.3.1 information retrieval
1.2.3.2 text categorization
1.2.3.3 extracting data from text
2.0 Efficient Parsing
3.0 Scaling up the Lexicon
4.0 List of References

55
3.0 Scaling Up the Lexicon

In real text-understanding systems, the input is
a sequence of characters from which the words
must be extracted
Four step process for doing this consists of
tokenization
morphological analysis
dictionary lookup
error recovery
Since many natural languages are fundamentally
different, these steps would be much harder to
apply to some languages than others

56
3.0 Scaling Up the Lexicon (Cont)

a) Tokenization
process of dividing the input into distinct
tokens -- words and punctuation marks.
this is not easy in some languages , like
Japanese, where there are no spaces between words
this process is much easier in English although
it is not trivial by any means
examples of complications may include
A hyphen at the end of the line may be an
interword or an intraword dash
tokenization routines are designed to be fast,
with the idea that as long as they are consistent
in breaking up the input text into tokens, any
problems can always be handled at some later
stage of processing

57
3.0 Scaling Up the Lexicon (Cont)

b) Morphological Analysis
the process of describing a word in terms of the
prefixes, suffixes and root forms that comprise
it
there are three ways that words can be composed
Inflectional Morphology
reflects that changes to a word that are needed
in a particular grammatical context (Ex most
nouns take the suffix s when they are plural)
Derivational Morphology
derives a new word from another word that is
usually of a different category (Ex the noun
softness is derived from the adjective short)
Compounding
takes two words and puts them together (Ex
bookkeeper is a compound of book and
keeper)
used a lot in morphologically complex languages
such as German, Finish, Turkish, Inuit, and Yupik

58
3.0 Scaling Up the Lexicon (Cont)

c) Dictionary Lookup
is performed on every token (except for special
ones such as punctuation)
the task is to find the word in the dictionary
and return its definition
two ways to do dictionary lookup
store morphologically complex words first
complex words are written to dictionary and the
looked up when needed
do morphological analysis first
process the word before looking anything up
Ex walked -- strip of ed and look up walk
if the verb is not marked as irregular, then
walked would be the past tense of walk
any implementation of the table abstract data
type can serve as a dictionary hash tables,
binary trees, b-tries, and trees

59
3.0 Scaling Up the Lexicon (Cont)

d) Error Recovery
is undertaken when a word is not found in the
dictionary
there are four types of error recovery
morphological rules can guess at the words
syntactic class
Ex smarply is not in the dictionary but it is
probably an adverb
capitalization is a clue that a word is a proper
name
other specialized formats denote dates, times,
social security numbers, etc
spelling correction routines can be used to find
a word in the dictionary that is close to the
input word
there are two popular models for defining
closeness in words
Letter-Based Model
Sound-Based Model

60
3.0 Scaling Up the Lexicon (Cont)

Letter-Based Model
an error consists of inserting or deleting a
single letter, transposing two adjacent letters
or replacing one letter with another
Ex a 10 letter word is one error away from 530
other words
10 deletions -- each of the ten letters could be
deleted
9 swaps -- _x_x_x_x_x_x_x_x_x_ there are nine
possible swaps where x signifies that _ on
its left and right could be switched
10 x 25 replacements -- each of the ten letters
can be replaced by (26 - 1) letters of the
alphabet
11 x 26 insertions -- x_x_x_x_x_x_x_x_x_x_x and
each x can be one of the 26 letters of the
alphabet
total is 10 9 225 286 530

61
3.0 Scaling Up the Lexicon (Cont)

Sound-Based Model
words are translated into canonical form that
preserves most of information needed to pronounce
the word, but abstracts away the details
Ex a word such as attention might be
translated into the sequence a, T, a, N, S, H,
a, N, where a stands for any vowel
this would mean that words such as attension
and atennshun translate to the same sequence
if no other word in the dictionary translates
into the same sequence, then we can unambiguously
correct the spelling error
NOTE letter-based approach would work just as
well for attention but not for atennshun,
which is 5 errors away from attention

62
3.0 Scaling Up the Lexicon (Cont)

Practical NPL systems have lexicons with from
10,000 to 1000,000 root word forms
building such a sizable lexicon is very time
consuming and expensive
this has been a cost that dictionary publishing
companies and companies with NLP programs have
not been willing to share
Wordnet is an exception to this rule
freely available dictionary, developed by a group
at Princeton (led by George Miller)
diagram on the next slide gives and example of
the type of information returned by Wordnet about
the word ride

63
3.0 Scaling Up the Lexicon (Cont)
Wordnet Example of the Word ride
64
3.0 Scaling Up the Lexicon (Cont)

Although dictionaries like Wordnet are useful,
they do not provide all the lexical information
one would like
frequency information is missing
some of the meanings are far more likely than
others
Ex pen usually means a writing instrument
although (very rarely) it can mean a female swan
semantic restrictions are missing
we need to know related information
Ex with the word ride, we may need to know
whether we are talking about animals or vehicles
because the actions in two cases are quite
different

65
Current Topic

1.0 Definitions
1.1 Characteristics of Successful Machines
1.2 Practical Applications
1.2.1 machine translation
1.2.2 database access
1.2.3 text interpretation
1.2.3.1 information retrieval
1.2.3.2 text categorization
1.2.3.3 extracting data from text
2.0 Efficient Parsing
3.0 Scaling up the Lexicon
4.0 List of References

66
4.0 List of References

http//nats-www.informatik.uni-hamburg.de/
Natural Language Systems
http//www.he.net/hedden/intro_mt.html Machine
Translation A Brief Introduction
http//foxnet.cs.cmu.edu/people/spot/frg/Tomita.tx
t Masaru Tomita
http//www.csli.stanford.edu/aac/papers.html Ann
Copestake's Online Publications
http//www.aventinus.de/ AVENTINUS advanced
information system for multilingual drug
enforcement

67
4.0 List of References (Cont)

http//ai10.bpa.arizona.edu/ktolle/np.html AZ
Noun Phraser
http//www.cam.sri.com/ Cambridge Computer
Science Research Center
http//www-cgi.cam.sri.com/highlight/ Cambridge
Computer Science Research Center, Highlight
http//www.cogs.susx.ac.uk/lab/nlp/ Natural
Language Processing and Computational Linguistics
at The University of Sussex
http//www.cogs.susx.ac.uk/lab/nlp/lexsys/
LexSys Analysis of Naturally-Occurring English
Text with Stochastic Lexicalized Grammars

68
4.0 List of References (Cont)

http//www.georgetown.edu/compling/parsinfo.htm
Georgetown University General Description of
Parsers
http//www.georgetown.edu/compling/graminfo.htm
Georgetown University General Information about
Grammars
http//www.georgetown.edu/cball/ling361/ling361_nl
p1.html Georgetown University Introduction to
Computational Linguistics
http//www.georgetown.edu/compling/module.html
Georgetown University Modularity in Natural
Language Parsing

69
4.0 List of References (Cont)