Title: By Tim
1Natural Language Processing
- By Tim
- Adrian Gareau
- Edward Dantsiguer
2Agenda
- 1.0 Definitions
- 1.1 Characteristics of Successful Machines
- 1.2 Practical Applications
- 1.2.1 machine translation
- 1.2.2 database access
- 1.2.3 text interpretation
- 1.2.3.1 information retrieval
- 1.2.3.2 text categorization
- 1.2.3.3 extracting data from text
- 2.0 Efficient Parsing
- 3.0 Scaling up the Lexicon
- 4.0 List of References
3Current Topic
- 1.0 Definitions
- 1.1 Characteristics of Successful Machines
- 1.2 Practical Applications
- 1.2.1 machine translation
- 1.2.2 database access
- 1.2.3 text interpretation
- 1.2.3.1 information retrieval
- 1.2.3.2 text categorization
- 1.2.3.3 extracting data from text
- 2.0 Efficient Parsing
- 3.0 Scaling up the Lexicon
- 4.0 List of References
41.0 Definitions
- Natural languages are languages that living
creatures use for communication - Artificial Languages are mathematically defined
classes of signals that can be used for
communication with machines - A language is a set of sentences that may be used
as signals to convey semantic information - The meaning of a sentence is the semantic
information it conveys
5Current Topic
- 1.0 Definitions
- 1.1 Characteristics of Successful Machines
- 1.2 Practical Applications
- 1.2.1 machine translation
- 1.2.2 database access
- 1.2.3 text interpretation
- 1.2.3.1 information retrieval
- 1.2.3.2 text categorization
- 1.2.3.3 extracting data from text
- 2.0 Efficient Parsing
- 3.0 Scaling up the Lexicon
- 4.0 List of References
61.1 Characteristics of Successful Natural
Language Systems
- Successful systems share two properties
- they are focused on a particular domain rather
than allowing discussion of any topic - they are focused on a particular task rather than
attempting to understand language completely - The above means that any natural language machine
is more likely to work correctly if one is to
restrict the set of possible inputs -- input
possibility size is inversely proportional to
likelihood of success
7Current Topic
- 1.0 Definitions
- 1.1 Characteristics of Successful Machines
- 1.2 Practical Applications
- 1.2.1 machine translation
- 1.2.2 database access
- 1.2.3 text interpretation
- 1.2.3.1 information retrieval
- 1.2.3.2 text categorization
- 1.2.3.3 extracting data from text
- 2.0 Efficient Parsing
- 3.0 Scaling up the Lexicon
- 4.0 List of References
81.2 Practical Applications
- We are going to look at five practical
applications of natural language processing - machine translation (1.2.1)
- database access (1.2.2)
- text interpretation (1.2.3)
- information retrieval (1.2.3.1)
- text categorization (1.2.3.2)
- extracting data from text (1.2.3.3)
91.2.1 Machine Translation
- First suggestions made by the Russian
Smirnov-Troyansky and the Frenchman C.G. Artsouni
in the 1930s - First serious discussions were begun in 1946 by
mathematician Warren Weaver - There was great hope that computers would be able
to translate from one natural language to another
(inspired by the success of the Allied efforts
using the British Colossus computer) - Turings project translated coded messages into
intelligible German - By 1954 there was a machine translation (MT)
project at Georgetown University - succeeded in correctly translating several
sentences from Russian into English - After Georgetown project, MT projects were
started up at MIT, Harvard and the University of
Pennsylvania
101.2.1 Machine Translation (Cont)
- It soon (1966) became apparent that translation
is a very complicated task and it would be
practically impossible to account for all
intricacies and nuances of natural languages - correct translation would require an in-depth
understanding of both natural languages since
structure of expressions varies in every natural
language - Yehoshua Bar-Hillel declared that MT was
impossible (Bar-Hillel Paradox) - analysis by humans of messages relies to some
extent on the information which is not present in
the words that make up the message - The pen is in the box
- i.e. the writing instrument is in the container
- The box is in the pen
- i.e. the container is in the playpen or the
pigpen
111.2.1 Machine Translation (Cont)
- There has been no fundamental breakthroughs in
machine translation in the last 34 years - Progress has been made on restricted domains
- there are dozens of systems that are able to take
a subset of one language and, fairly accurately,
translate it to another language - these systems operate well enough to save
significant sums of money over fully manual
techniques (see examples two pages down) - From the above systems, ones operating on a more
restricted set, produce more impressive results - Machine translation is NOT automatic speech
recognition
121.2.1 Machine Translation (Cont)
- Examples of poor machine translations would
include - "the spirit is strong, but the body is weak" was
translated literally as "the vodka is strong but
the meat is rotten - "Out of sight, out of mind was translated as
"Invisible, insane - "hydraulic ram was translated as "male water
sheep - These do not imply that machine translation is a
waste of time - some mistakes are inevitable regardless of the
quality and sophistication of the system - one has to realize that human translators also
make mistakes
131.2.1 Machine Translation (Cont)
- Examples machine translation systems include
- TAUM-METRO system
- translates weather reports from English to French
- works very well since language in government
weather reports is highly stylized and regular - SPANAM system
- translates Spanish into English
- worked on a more open domain
- results were reasonably good although resulting
English text was not always grammatical and very
rarely fluent - AVENTINUS system
- advanced information system for multilingual drug
enforcement - allows law enforcement officials to know what the
foreign document is about - sorts, classifies and analyzes drug related
information
141.2.1 Machine Translation (Cont)
- There are three basic types of machine
translation - Machine-assisted (aided) human translation (MAHT)
- the translation is performed by human translator,
but he/she uses a computer as a tool to improve
or speed up the translation process - Human-assisted (aided) machine translation (HAMT)
- the source language text is modified by human
translator either before, during or after it is
translated by the computer - Fully automatic machine translation (FAMT)
- the source language text is fed into the computer
as a file, and the computer produces a
translation automatically without any human
intervention
151.2.1 Machine Translation (Cont)
- Standing on its own, unrestricted machine
translation (FAMT) is still inadequate - Human-assisted machine translation (HAMT) could
be used to improve the quality of translation - one possibility is to have a human reader go over
the text after the translation, correcting
grammar errors (post-processing) - human reader can save a lot of time since some of
the text will be translated correctly - sometimes a monolingual human can edit the output
without reading the original - another possibility is to have a human reader
edit the document before translation
(pre-processing) - make the original to conform to a restricted
subset of a language - this will usually allow the system to translate
the resulting text without any requirement for
post-editing
161.2.1 Machine Translation (Cont)
- Restricted languages are sometimes called
Caterpillar English - Caterpillar was the first company to try writing
their manuals using pre-processing - Xerox was the first company to really
successfully use of the pre-processing approach
(SYSTRAN system) - language defined for their manuals was highly
restricted, thus translation into other languages
worked quite well - There is a substantial start-up cost to any
machine translation effort - to achieve broad coverage, translation systems
should have lexicons of 20,000 to 100,000 words
and grammars of 100 to 10,000 rules (depending on
the choice of formalism)
171.2.1 Machine Translation (Cont)
- There are several basic theoretical approaches to
machine translation - Direct MT Strategy
- based on good glossaries and morphological
analysis - always between a pair of languages
- Transfer MT Strategy
- first, source language is parsed into an abstract
internal representation - a transfer is then made into the corresponding
structures in the target language - Inerlingua MT Strategy
- the idea is to create an artificial language
- it shares all the features and makes all the
distinctions of all languages - Knowledge-Based Strategy
- similar to the above
- intermediate form is of semantic nature rather
than a syntactic one
181.2.2 Database Access
- The first major success of natural language
processing - There was a hope that databases could be
controlled by natural languages instead of
complicated data retrieval commands - this was a major problem in the early 1970s since
the staff in charge of data retrieval could not
keep up with demand of users for data - LUNAR system was the first such interface
- built by William Woods in 1973 for NASA Manned
Spacecraft Center - system was able to correctly answer 78 of the
questions such as What is the average modal
plagioclase concentration for lunar samples that
contain rubidium?
191.2.2 Database Access (Cont)
- Other examples of data retrieval systems would
include - CHAT system
- developed by Fernando Pereira in 1983
- similar level of complexity to LUNAR system
- worked on geographical databases
- was restricted
- question wording was very important
- TEAM system
- could handle a wider set of problems than CHAT
- was still restricted and unable to handle all
types of input
201.2.2 Database Access (Cont)
- Companies such as Natural Language Inc. and
Symantec are still selling database tools that
use natural language - The ability to have natural language control of
databases is not as big of a concern as it was in
1970s - graphical user interface and integration of
spreadsheets, word processors, graphing
utilities, report generating utilities, etc are
of greater concern to database buyers today - mathematical or set notation seems to be a more
natural way of communicating with a database than
plane English - with advent of SQL, the problem of data retrieval
is not as major as it was in the past
211.2.3 Text Interpretation
- In early 1980s, most online information was
stored in databases and spreadsheets - Now, most of online information is text email,
news, journals, articles, books, encyclopedias,
reports, essays, etc - there is a need to sort this information to
reduce it to some comprehendible amount - Has become a major field in natural language
processing - becoming more and more important with expansion
of the Internet - consists of
- information retrieval
- text categorization
- data extraction
221.2.3.1 Information Retrieval
- Information retrieval (IR) is also know as
information extraction (IE) - Information retrieval systems analyze
unrestricted text in order to extract specific
types of information - IR systems do not attempt to understand all of
the text in all of the documents, but they do
analyze those portions of each document that
contain relevant information - relevance is determined by pre-defined domain
guidelines which must specify, as accurately as
possible, exactly what types of information the
system is expected to find - query would be a good example of such a
pre-defined domain - documents that contain relevant information are
retrieved while other are ignored
231.2.3.1 Information Retrieval (Cont)
- Sometimes documents could be represented by a
surrogate, such as the title and and a list of
key words and/or an abstract - It is more common to use the full text, possibly
subdivided into sections that each serve as a
separate document for retrieval purposes - The query is normally a list of words typed by
the user - Boolean combinations of words were used by
earlier systems to construct queries - users found it difficult to get good results from
Boolean queries - it was hard to find a combination of ANDs and
ORs that will produce appropriate results
241.2.3.1 Information Retrieval (Cont)
- Boolean model has been replaced by vector-space
model in modern IR systems - in vector-space model every list of words (both
the documents and query) is treated as a vector
in n-dimensional vector space (where n is the
number of distinct tokens in the document
collection) - can use a 1 in a vector position if that word
appears and 0 if it does not - vectors are then compared to determine which ones
are close - vector model is more flexible than Boolean model
- documents can be ranked and closest matches could
be reported first
251.2.3.1 Information Retrieval (Cont)
- There are many variations on vector-space model
- some allow stating that two words must appear
near each other - some use thesaurus to automatically augment the
words in the query with their synonyms - A good discriminator must be chosen in order for
the system to be effective - common words like a, the dont tell us much
since they occur in just about every document - a good way to set up the retrieval is to give a
term a larger weight if it appears in a small
number of documents
261.2.3.1 Information Retrieval (Cont)
- Another way to think about IR is in terms of
databases. An IR system attempts to convert
unstructured text documents into codified
database entries. Database entries might be
drawn from a set of fixed values, or they can be
actual sub-strings pulled from the original
source text. - From a language processing perspective, IR
systems must operate at many levels, from word
recognition to sentence analysis, and from
understanding at the sentence level on up to
discourse analysis at the level of full text
document. - Dictionary coverage is an especially challenging
problem since open-ended documents can be filled
with all manner of jargon, abbreviations, and
proper names, not to mention typos and
telegraphic writing styles.
271.2.3.1 Information Retrieval (Cont)
- Example (Vector-Space Model) we assume that we
have one very short document that contains one
sentence CPSC 533 is the best Computer Science
course at UofC also assume that our query is
UofC - we need to set up our n-dimensional vector space
we have 10 distinct tokens (one for every word in
the sentence) - we are going to set up the following vector to
represent the sentence (1,1,1,1,1,1,1,1,1,1) --
indicating that all ten words are present - we are going to set the following vector for the
query (0,0,0,0,0,0,0,0,0,1) -- indicating that
UofC is the only word present in the query - by ANDing the two vectors together, we get
(0,0,0,0,0,0,0,0,0,1) meaning that our document
contains UofC, as expected
281.2.3.1 Information Retrieval (Cont)
- Example Commercial System (HIGHLIGHT)
- helps users find relevant information in large
volumes of text and present it in a structured
fashion - it can extract information from newswire reports
for a specific topic area - such as global
banking, or the oil industry - as well as current
and historical financial and other data - although its accuracy will never match the
decision-making skills of a trained human expert,
HIGHLIGHT can process large amounts of text very
quickly, allowing users to discover more
information that even the most trained
professional would have time to look for - see Demo at http//www-cgi.cam.sri.com/highlight/
- could be classified under Extracting Data From
Text (1.2.3.3)
291.2.3.2 Text Categorization
- It is often desirable to sort all text into
several categories - There are number of companies that provide their
subscribers access to all news on a particular
industry, company or geographic area - traditionally, human experts were used to assign
the categories - in the last few years, NLP systems have proven
very accurate (correctly categorizing over 90 of
the news stories) - Context in which text appears is very important
since the same word could be categorized
completely differently depending on the context - Example in a dictionary, the primary definition
of the word crude is vulgar, but in a large
sample of the Wall Street Journal, crude refers
to oil 100 of the time
301.2.3.3 Extracting Data From Text
- The task of data extraction is take on-line text
and derive from it some assertions that can be
put into a structured database - Examples of data extraction systems include
- SCISOR system
- able to take stock information text (such as the
type released by Dow Jones News Service) and
extract important stock information pertaining
to - events that took place
- companies involved
- starting share prices
- quantity of shares that changed hands
- effect on stock prices
31Current Topic
- 1.0 Definitions
- 1.1 Characteristics of Successful Machines
- 1.2 Practical Applications
- 1.2.1 machine translation
- 1.2.2 database access
- 1.2.3 text interpretation
- 1.2.3.1 information retrieval
- 1.2.3.2 text categorization
- 1.2.3.3 extracting data from text
- 2.0 Efficient Parsing
- 3.0 Scaling up the Lexicon
- 4.0 List of References
322.0 Efficient Parsing
- Parsing -- the act of analyzing the
grammaticality of an utterance according to some
specific grammar - previous sentence was parsed according to some
grammar of English and was determined that it
was grammatical - we read the words in some order (from left to
right from right to left or in random order)
and analyzed them one-by-one - Each parse is a different method of analyzing
some target sentence according to some specified
grammar
332.0 Efficient Parsing (Cont)
- Simple left-to-right parsing is often
insufficient - it is hard to determine the nature of the
sentence - this means that we have to make an initial guess
as to what it is the sentence is saying - this forces us to backtrack if the guess is
incorrect - Some backtracking is inevitable
- to make parsing efficient, we want to minimize
the amount of backtracking - even if a wrong guess is made, we know that a
portion of the sentence has already been analyzed
-- there is no need to start from scratch since
we can use the information that is available to us
342.0 Efficient Parsing (Cont)
- Example we have two sentences
- Have students in section 2 of Computer Science
203 take the exam. - Have students in section 2 of Computer Science
203 taken the exam? - first ten words Have students in section 2 of
Computer Science 203 are exactly the same
although the meanings of the two sentences are
completely different - if an incorrect guess is made, we can still use
the first ten words when we backtrack - this will require a lot less work
352.0 Efficient Parsing (Cont)
- There are three main things that we can do to
improve efficiency - dont do twice what you can do once
- dont do once what you can avoid altogether
- dont represent distinctions that you dont need
- To accomplish these we can use a data structure
known as chart (matrix) to store partial results - this is a form of dynamic programming
- results are only calculated if they can not be
found in the chart - only a portion of the calculations that can not
be found in the chart is done while the rest is
retrieved from the chart - algorithms that do this are called chart parsers
362.0 Efficient Parsing (Cont)
- Examples of parsing techniques
- Top-Down, Depth-First
- Top-Down, Breadth-First
- Bottom-Up, Depth-First Chart
- Prolog
- Feature Augmented Phrase Structure
- These are not the only parsing techniques that
exist - One is free to come up with his or her own
algorithm for the order in which individual words
in every sentence will be analyzed
372.0 Efficient Parsing (Cont)
- i) Top-Down, Depth-First
- uses a strategy of searching for phrasal
constituents from the highest node (the sentence
node) to the terminal nodes (the individual
lexical items) to find a match to the possible
syntactic structure of the input sentence - stores attempts on a possibilities list as a
stacked data structure (LIFO) - ii) Top-Down, Breadth-First
- same searching strategy as Top-Down, Depth-First
- stores attempts on a possibilities list as a
queued data structure (FIFO)
382.0 Efficient Parsing (Cont)
- iii) Bottom-Up, Depth-First Chart
- parse begins at the word level and uses the
grammar rules to build higher-level structures
(bottom-up), which are combined until a goal
state is reached or until all the applicable
grammar rules have been exhausted - iv) Prolog
- relies on the functionality of Prolog Programming
Language to generate a parse using Top-Down,
Depth-First algorithm - naturally deals with constituents and their
relationships - v) Feature Augmented Phrase Structure
- takes sentence as input and parses it by
accessing information in a featured
phrase-structure grammar and lexicon - parser output is a tree
392.0 Efficient Parsing (Cont)
- Chart parsing can be represented pictorially
using a combination of n 1 vertices and a
number of edges - Notation for edge labels ltStarting
Vertexgt,ltEnding Vertexgt, ltResultgt ? ltPart 1gt...
ltPart ngt ltNeeded Part 1gtltNeeded Part k - if Needed Parts are added to already available
Parts then Result would be the outcome, spanning
edges from Starting Vertex to Ending Vertex - see examples (two pages down)
- If there are no Needed Parts (if k 0), then the
edge is called complete - edge is called incomplete otherwise
402.0 Efficient Parsing (Cont)
- Chart-parsing algorithms use a combination of
top-down and bottom-up processing - this means that it never has to consider certain
constituents that could not lead to a complete
parse - this also means that it can handle grammars with
both left-recursive rules and rules with empty
right-hand sides without going into an infinite
loop - result of our algorithm is a packed forest of
parse tree constituents rather than an
enumeration of all possible trees - Chart Parsing consists of forming a chart with n
1 vertices and adding edges to the chart one at
a time, trying to produce a complete edge that
spans from vertex 0 to n and is of category S
(sentence) ? 0,n, S ? NP VP There is no
backtracking -- everything that is put into the
chart stays there
412.0 Efficient Parsing (Cont)
Examples
- A) Edge 0,5, S ? NP VP -- says an NP
followed by VP combine to make an S that spans
the string from 0 to 5 - B) Edge 0,2, S ? NP VP -- says that an NP
spans the string from 0 to 2, and if we could
find a VP to follow it, then we would have an S
422.0 Efficient Parsing (Cont)
- There are four ways to add and edge to the chart
- Initializer
- adds an edge to indicate that we are looking for
the start symbol of the grammar, S, starting at
position 0, but have not found anything yet - Predictor
- takes an incomplete edge that is looking for an X
and adds new incomplete edges, that if completed,
would build an X in the right place - Completer
- takes an incomplete edge that is looking for an X
and ends at vertex j and a complete edge that
begins at j and has X as the left-hand side, and
combines them to make a new edge where the X has
been found - Scanner
- similar to the completer, except that it uses the
input words rather than exciting complete edges
to generate the X
432.0 Efficient Parsing (Cont)
Nondeterministic Chart Parsing Algorithm
442.0 Efficient Parsing (Cont)
- Nondeterministic Chart Parsing Algorithm
- treats the chart as a set of edges
- an new edge is non-deterministically added to the
chart at every step (an edge is
non-deterministically chosen from the possible
additions) - S is the start symbol and S is the new
nonterminal symbol - we start out looking for S (i.e. we currently
have an empty string) - add edges using one of the three methods
(predictor, completer, scanner), one at a time
until no new edges can be added - at the end, if the required parse exists, it is
found - if none of the methods could be used to add
another edge to the set, the algorithm terminates
452.0 Efficient Parsing (Cont)
Chart for a Parse of I feel it
462.0 Efficient Parsing (Cont)
- Using the sample chart on the previous page, the
following steps are taken to complete the parse
of I feel it -- page 1/3 - 1. INITIALIZER if we parse from edge 0 to edge 0
and look for S, we still need to find S -- (a) - 2. PREDICTOR we are looking for an incomplete
edge, that if completed, would give us S -- we
know that S consists of NP and VP, meaning that
by going from 0 to 0 we will have S if we find VP
and NP -- (b) - 3. PREDICTOR following a very similar rule, we
know that we will have NP if we can find a
Pronoun this condition can be achieved by going
from 0 to 0, looking for a Pronoun -- (c) - 4. SCANNER if we go from 0 to 1, parsing I we
will have our NP since a Pronoun is found -- (d)
472.0 Efficient Parsing (Cont)
- Example (continued) -- page 2/3
- 5. COMPLETER we can summarize above steps, we
are looking for S and by going from 0 to1 we have
NP and are still looking for VP -- (e) - 6. PREDICTOR we are now looking for VP and by
going from 1 to 1 we will have VP if can find a
Verb -- (f) - 7. PREDICTOR VP can consist of another VP and
NP, meaning that 6 would also work if we can find
VP and NP -- (g) - 8. SCANNER by going from1 to 2 we can find a
Verb, thus we can find VP -- (h) - 9. COMPLETER using 7 and 8, we know that since
VP is found we can complete VP by going from 1 to
2 and finding NP -- (i) - 10. PREDICTOR NP can be completed by going from
2 to 2 and finding a Pronoun -- (j)
482.0 Efficient Parsing (Cont)
- Example (continued) -- page 3/3
- 11. SCANNER we can find a Pronoun if we go from
2 to 3, thus completing NP -- (k) - 12. COMPLETER using 7 - 11, we know that VP can
be found by going from 1 to 3, thus finding NP
and VP -- (l) - 13. COMPLETER using all of the information we
collected up to this point, one can get S by
going from 0 to 3, thus finding the original NP
and VP, where VP consists of another VP and NP --
(m) - All of these steps are summarized on the diagram
on the next page
492.0 Efficient Parsing (Cont)
Trace of a Parse of I feel it
502.0 Efficient Parsing (Cont)
Left-Corner Parsing Algorithm
512.0 Efficient Parsing (Cont)
- Left-Corner Parsing
- avoids building some edges that could not
possibly be part of an S spanning the whole
string - builds up a parse tree that starts with the
grammars start symbol and extends down to the
last word in the sentence - Non-deterministic Chart Parsing Algorithm is an
example of left-corner parsers - using example on the previous slide
- ride the horse would never be considered as VP
- saves time since unrealistic combinations do not
have to be, first worked out and then discarded
522.0 Efficient Parsing (Cont)
- Extracting Parses From the Chart Packing
- when the chart parsing algorithm finishes, it
returns an entire chart (collection of parse
trees) - what we really want is a parse tree (or several
parse trees) - Ex
- a) pick out parse trees that span the entire
input - b) pick out parse trees that for some reason do
not span the entire input - the easiest way to do this is to modify COMPLETER
so that when it combines two child edges to
produce a parent edge, it stores in the parent
edge the list of children that comprise it. - when we are done with the parse, we only need to
look in chartn for an edge that starts at 0,
and recursively look at the children lists to
reproduce a complete parse tree
532.0 Efficient Parsing (Cont)
A Variant of Nondeterministic Chart Parsing
Algorithm
- Keeps track of the entire parse tree
- We can look in chartn for an edge that starts
at 0, and recursively look at the children lists
to reproduce a complete parse tree
54Current Topic
- 1.0 Definitions
- 1.1 Characteristics of Successful Machines
- 1.2 Practical Applications
- 1.2.1 machine translation
- 1.2.2 database access
- 1.2.3 text interpretation
- 1.2.3.1 information retrieval
- 1.2.3.2 text categorization
- 1.2.3.3 extracting data from text
- 2.0 Efficient Parsing
- 3.0 Scaling up the Lexicon
- 4.0 List of References
553.0 Scaling Up the Lexicon
- In real text-understanding systems, the input is
a sequence of characters from which the words
must be extracted - Four step process for doing this consists of
- tokenization
- morphological analysis
- dictionary lookup
- error recovery
- Since many natural languages are fundamentally
different, these steps would be much harder to
apply to some languages than others
563.0 Scaling Up the Lexicon (Cont)
- a) Tokenization
- process of dividing the input into distinct
tokens -- words and punctuation marks. - this is not easy in some languages , like
Japanese, where there are no spaces between words - this process is much easier in English although
it is not trivial by any means - examples of complications may include
- A hyphen at the end of the line may be an
interword or an intraword dash - tokenization routines are designed to be fast,
with the idea that as long as they are consistent
in breaking up the input text into tokens, any
problems can always be handled at some later
stage of processing
573.0 Scaling Up the Lexicon (Cont)
- b) Morphological Analysis
- the process of describing a word in terms of the
prefixes, suffixes and root forms that comprise
it - there are three ways that words can be composed
- Inflectional Morphology
- reflects that changes to a word that are needed
in a particular grammatical context (Ex most
nouns take the suffix s when they are plural) - Derivational Morphology
- derives a new word from another word that is
usually of a different category (Ex the noun
softness is derived from the adjective short) - Compounding
- takes two words and puts them together (Ex
bookkeeper is a compound of book and
keeper) - used a lot in morphologically complex languages
such as German, Finish, Turkish, Inuit, and Yupik
583.0 Scaling Up the Lexicon (Cont)
- c) Dictionary Lookup
- is performed on every token (except for special
ones such as punctuation) - the task is to find the word in the dictionary
and return its definition - two ways to do dictionary lookup
- store morphologically complex words first
- complex words are written to dictionary and the
looked up when needed - do morphological analysis first
- process the word before looking anything up
- Ex walked -- strip of ed and look up walk
- if the verb is not marked as irregular, then
walked would be the past tense of walk - any implementation of the table abstract data
type can serve as a dictionary hash tables,
binary trees, b-tries, and trees
593.0 Scaling Up the Lexicon (Cont)
- d) Error Recovery
- is undertaken when a word is not found in the
dictionary - there are four types of error recovery
- morphological rules can guess at the words
syntactic class - Ex smarply is not in the dictionary but it is
probably an adverb - capitalization is a clue that a word is a proper
name - other specialized formats denote dates, times,
social security numbers, etc - spelling correction routines can be used to find
a word in the dictionary that is close to the
input word - there are two popular models for defining
closeness in words - Letter-Based Model
- Sound-Based Model
603.0 Scaling Up the Lexicon (Cont)
- Letter-Based Model
- an error consists of inserting or deleting a
single letter, transposing two adjacent letters
or replacing one letter with another - Ex a 10 letter word is one error away from 530
other words - 10 deletions -- each of the ten letters could be
deleted - 9 swaps -- _x_x_x_x_x_x_x_x_x_ there are nine
possible swaps where x signifies that _ on
its left and right could be switched - 10 x 25 replacements -- each of the ten letters
can be replaced by (26 - 1) letters of the
alphabet - 11 x 26 insertions -- x_x_x_x_x_x_x_x_x_x_x and
each x can be one of the 26 letters of the
alphabet - total is 10 9 225 286 530
613.0 Scaling Up the Lexicon (Cont)
- Sound-Based Model
- words are translated into canonical form that
preserves most of information needed to pronounce
the word, but abstracts away the details - Ex a word such as attention might be
translated into the sequence a, T, a, N, S, H,
a, N, where a stands for any vowel - this would mean that words such as attension
and atennshun translate to the same sequence - if no other word in the dictionary translates
into the same sequence, then we can unambiguously
correct the spelling error - NOTE letter-based approach would work just as
well for attention but not for atennshun,
which is 5 errors away from attention
623.0 Scaling Up the Lexicon (Cont)
- Practical NPL systems have lexicons with from
10,000 to 1000,000 root word forms - building such a sizable lexicon is very time
consuming and expensive - this has been a cost that dictionary publishing
companies and companies with NLP programs have
not been willing to share - Wordnet is an exception to this rule
- freely available dictionary, developed by a group
at Princeton (led by George Miller) - diagram on the next slide gives and example of
the type of information returned by Wordnet about
the word ride
633.0 Scaling Up the Lexicon (Cont)
Wordnet Example of the Word ride
643.0 Scaling Up the Lexicon (Cont)
- Although dictionaries like Wordnet are useful,
they do not provide all the lexical information
one would like - frequency information is missing
- some of the meanings are far more likely than
others - Ex pen usually means a writing instrument
although (very rarely) it can mean a female swan - semantic restrictions are missing
- we need to know related information
- Ex with the word ride, we may need to know
whether we are talking about animals or vehicles
because the actions in two cases are quite
different
65Current Topic
- 1.0 Definitions
- 1.1 Characteristics of Successful Machines
- 1.2 Practical Applications
- 1.2.1 machine translation
- 1.2.2 database access
- 1.2.3 text interpretation
- 1.2.3.1 information retrieval
- 1.2.3.2 text categorization
- 1.2.3.3 extracting data from text
- 2.0 Efficient Parsing
- 3.0 Scaling up the Lexicon
- 4.0 List of References
664.0 List of References
- http//nats-www.informatik.uni-hamburg.de/
Natural Language Systems - http//www.he.net/hedden/intro_mt.html Machine
Translation A Brief Introduction - http//foxnet.cs.cmu.edu/people/spot/frg/Tomita.tx
t Masaru Tomita - http//www.csli.stanford.edu/aac/papers.html Ann
Copestake's Online Publications - http//www.aventinus.de/ AVENTINUS advanced
information system for multilingual drug
enforcement
674.0 List of References (Cont)
- http//ai10.bpa.arizona.edu/ktolle/np.html AZ
Noun Phraser - http//www.cam.sri.com/ Cambridge Computer
Science Research Center - http//www-cgi.cam.sri.com/highlight/ Cambridge
Computer Science Research Center, Highlight - http//www.cogs.susx.ac.uk/lab/nlp/ Natural
Language Processing and Computational Linguistics
at The University of Sussex - http//www.cogs.susx.ac.uk/lab/nlp/lexsys/
LexSys Analysis of Naturally-Occurring English
Text with Stochastic Lexicalized Grammars
684.0 List of References (Cont)
- http//www.georgetown.edu/compling/parsinfo.htm
Georgetown University General Description of
Parsers - http//www.georgetown.edu/compling/graminfo.htm
Georgetown University General Information about
Grammars - http//www.georgetown.edu/cball/ling361/ling361_nl
p1.html Georgetown University Introduction to
Computational Linguistics - http//www.georgetown.edu/compling/module.html
Georgetown University Modularity in Natural
Language Parsing
694.0 List of References (Cont)
- Elaine Rich, Kevin Knight Artificial Intelligence
- Patrick Henry Winston Artificial Intelligence
- Philip C. Jackson Introduction to Artificial
Intelligence
70This presentation was brought to you by the
letter A as well as numbers 40/40 100