By Tim - PowerPoint PPT Presentation

1 / 70
About This Presentation
Title:

By Tim

Description:

succeeded in correctly translating several sentences from Russian into English ... translates Spanish into English. worked on a more open domain ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 71
Provided by: edwardda
Category:
Tags: tim

less

Transcript and Presenter's Notes

Title: By Tim


1
Natural Language Processing
  • By Tim
  • Adrian Gareau
  • Edward Dantsiguer

2
Agenda
  • 1.0 Definitions
  • 1.1 Characteristics of Successful Machines
  • 1.2 Practical Applications
  • 1.2.1 machine translation
  • 1.2.2 database access
  • 1.2.3 text interpretation
  • 1.2.3.1 information retrieval
  • 1.2.3.2 text categorization
  • 1.2.3.3 extracting data from text
  • 2.0 Efficient Parsing
  • 3.0 Scaling up the Lexicon
  • 4.0 List of References

3
Current Topic
  • 1.0 Definitions
  • 1.1 Characteristics of Successful Machines
  • 1.2 Practical Applications
  • 1.2.1 machine translation
  • 1.2.2 database access
  • 1.2.3 text interpretation
  • 1.2.3.1 information retrieval
  • 1.2.3.2 text categorization
  • 1.2.3.3 extracting data from text
  • 2.0 Efficient Parsing
  • 3.0 Scaling up the Lexicon
  • 4.0 List of References

4
1.0 Definitions
  • Natural languages are languages that living
    creatures use for communication
  • Artificial Languages are mathematically defined
    classes of signals that can be used for
    communication with machines
  • A language is a set of sentences that may be used
    as signals to convey semantic information
  • The meaning of a sentence is the semantic
    information it conveys

5
Current Topic
  • 1.0 Definitions
  • 1.1 Characteristics of Successful Machines
  • 1.2 Practical Applications
  • 1.2.1 machine translation
  • 1.2.2 database access
  • 1.2.3 text interpretation
  • 1.2.3.1 information retrieval
  • 1.2.3.2 text categorization
  • 1.2.3.3 extracting data from text
  • 2.0 Efficient Parsing
  • 3.0 Scaling up the Lexicon
  • 4.0 List of References

6
1.1 Characteristics of Successful Natural
Language Systems
  • Successful systems share two properties
  • they are focused on a particular domain rather
    than allowing discussion of any topic
  • they are focused on a particular task rather than
    attempting to understand language completely
  • The above means that any natural language machine
    is more likely to work correctly if one is to
    restrict the set of possible inputs -- input
    possibility size is inversely proportional to
    likelihood of success

7
Current Topic
  • 1.0 Definitions
  • 1.1 Characteristics of Successful Machines
  • 1.2 Practical Applications
  • 1.2.1 machine translation
  • 1.2.2 database access
  • 1.2.3 text interpretation
  • 1.2.3.1 information retrieval
  • 1.2.3.2 text categorization
  • 1.2.3.3 extracting data from text
  • 2.0 Efficient Parsing
  • 3.0 Scaling up the Lexicon
  • 4.0 List of References

8
1.2 Practical Applications
  • We are going to look at five practical
    applications of natural language processing
  • machine translation (1.2.1)
  • database access (1.2.2)
  • text interpretation (1.2.3)
  • information retrieval (1.2.3.1)
  • text categorization (1.2.3.2)
  • extracting data from text (1.2.3.3)

9
1.2.1 Machine Translation
  • First suggestions made by the Russian
    Smirnov-Troyansky and the Frenchman C.G. Artsouni
    in the 1930s
  • First serious discussions were begun in 1946 by
    mathematician Warren Weaver
  • There was great hope that computers would be able
    to translate from one natural language to another
    (inspired by the success of the Allied efforts
    using the British Colossus computer)
  • Turings project translated coded messages into
    intelligible German
  • By 1954 there was a machine translation (MT)
    project at Georgetown University
  • succeeded in correctly translating several
    sentences from Russian into English
  • After Georgetown project, MT projects were
    started up at MIT, Harvard and the University of
    Pennsylvania

10
1.2.1 Machine Translation (Cont)
  • It soon (1966) became apparent that translation
    is a very complicated task and it would be
    practically impossible to account for all
    intricacies and nuances of natural languages
  • correct translation would require an in-depth
    understanding of both natural languages since
    structure of expressions varies in every natural
    language
  • Yehoshua Bar-Hillel declared that MT was
    impossible (Bar-Hillel Paradox)
  • analysis by humans of messages relies to some
    extent on the information which is not present in
    the words that make up the message
  • The pen is in the box
  • i.e. the writing instrument is in the container
  • The box is in the pen
  • i.e. the container is in the playpen or the
    pigpen

11
1.2.1 Machine Translation (Cont)
  • There has been no fundamental breakthroughs in
    machine translation in the last 34 years
  • Progress has been made on restricted domains
  • there are dozens of systems that are able to take
    a subset of one language and, fairly accurately,
    translate it to another language
  • these systems operate well enough to save
    significant sums of money over fully manual
    techniques (see examples two pages down)
  • From the above systems, ones operating on a more
    restricted set, produce more impressive results
  • Machine translation is NOT automatic speech
    recognition

12
1.2.1 Machine Translation (Cont)
  • Examples of poor machine translations would
    include
  • "the spirit is strong, but the body is weak" was
    translated literally as "the vodka is strong but
    the meat is rotten
  • "Out of sight, out of mind was translated as
    "Invisible, insane
  • "hydraulic ram was translated as "male water
    sheep
  • These do not imply that machine translation is a
    waste of time
  • some mistakes are inevitable regardless of the
    quality and sophistication of the system
  • one has to realize that human translators also
    make mistakes

13
1.2.1 Machine Translation (Cont)
  • Examples machine translation systems include
  • TAUM-METRO system
  • translates weather reports from English to French
  • works very well since language in government
    weather reports is highly stylized and regular
  • SPANAM system
  • translates Spanish into English
  • worked on a more open domain
  • results were reasonably good although resulting
    English text was not always grammatical and very
    rarely fluent
  • AVENTINUS system
  • advanced information system for multilingual drug
    enforcement
  • allows law enforcement officials to know what the
    foreign document is about
  • sorts, classifies and analyzes drug related
    information

14
1.2.1 Machine Translation (Cont)
  • There are three basic types of machine
    translation
  • Machine-assisted (aided) human translation (MAHT)
  • the translation is performed by human translator,
    but he/she uses a computer as a tool to improve
    or speed up the translation process
  • Human-assisted (aided) machine translation (HAMT)
  • the source language text is modified by human
    translator either before, during or after it is
    translated by the computer
  • Fully automatic machine translation (FAMT)
  • the source language text is fed into the computer
    as a file, and the computer produces a
    translation automatically without any human
    intervention

15
1.2.1 Machine Translation (Cont)
  • Standing on its own, unrestricted machine
    translation (FAMT) is still inadequate
  • Human-assisted machine translation (HAMT) could
    be used to improve the quality of translation
  • one possibility is to have a human reader go over
    the text after the translation, correcting
    grammar errors (post-processing)
  • human reader can save a lot of time since some of
    the text will be translated correctly
  • sometimes a monolingual human can edit the output
    without reading the original
  • another possibility is to have a human reader
    edit the document before translation
    (pre-processing)
  • make the original to conform to a restricted
    subset of a language
  • this will usually allow the system to translate
    the resulting text without any requirement for
    post-editing

16
1.2.1 Machine Translation (Cont)
  • Restricted languages are sometimes called
    Caterpillar English
  • Caterpillar was the first company to try writing
    their manuals using pre-processing
  • Xerox was the first company to really
    successfully use of the pre-processing approach
    (SYSTRAN system)
  • language defined for their manuals was highly
    restricted, thus translation into other languages
    worked quite well
  • There is a substantial start-up cost to any
    machine translation effort
  • to achieve broad coverage, translation systems
    should have lexicons of 20,000 to 100,000 words
    and grammars of 100 to 10,000 rules (depending on
    the choice of formalism)

17
1.2.1 Machine Translation (Cont)
  • There are several basic theoretical approaches to
    machine translation
  • Direct MT Strategy
  • based on good glossaries and morphological
    analysis
  • always between a pair of languages
  • Transfer MT Strategy
  • first, source language is parsed into an abstract
    internal representation
  • a transfer is then made into the corresponding
    structures in the target language
  • Inerlingua MT Strategy
  • the idea is to create an artificial language
  • it shares all the features and makes all the
    distinctions of all languages
  • Knowledge-Based Strategy
  • similar to the above
  • intermediate form is of semantic nature rather
    than a syntactic one

18
1.2.2 Database Access
  • The first major success of natural language
    processing
  • There was a hope that databases could be
    controlled by natural languages instead of
    complicated data retrieval commands
  • this was a major problem in the early 1970s since
    the staff in charge of data retrieval could not
    keep up with demand of users for data
  • LUNAR system was the first such interface
  • built by William Woods in 1973 for NASA Manned
    Spacecraft Center
  • system was able to correctly answer 78 of the
    questions such as What is the average modal
    plagioclase concentration for lunar samples that
    contain rubidium?

19
1.2.2 Database Access (Cont)
  • Other examples of data retrieval systems would
    include
  • CHAT system
  • developed by Fernando Pereira in 1983
  • similar level of complexity to LUNAR system
  • worked on geographical databases
  • was restricted
  • question wording was very important
  • TEAM system
  • could handle a wider set of problems than CHAT
  • was still restricted and unable to handle all
    types of input

20
1.2.2 Database Access (Cont)
  • Companies such as Natural Language Inc. and
    Symantec are still selling database tools that
    use natural language
  • The ability to have natural language control of
    databases is not as big of a concern as it was in
    1970s
  • graphical user interface and integration of
    spreadsheets, word processors, graphing
    utilities, report generating utilities, etc are
    of greater concern to database buyers today
  • mathematical or set notation seems to be a more
    natural way of communicating with a database than
    plane English
  • with advent of SQL, the problem of data retrieval
    is not as major as it was in the past

21
1.2.3 Text Interpretation
  • In early 1980s, most online information was
    stored in databases and spreadsheets
  • Now, most of online information is text email,
    news, journals, articles, books, encyclopedias,
    reports, essays, etc
  • there is a need to sort this information to
    reduce it to some comprehendible amount
  • Has become a major field in natural language
    processing
  • becoming more and more important with expansion
    of the Internet
  • consists of
  • information retrieval
  • text categorization
  • data extraction

22
1.2.3.1 Information Retrieval
  • Information retrieval (IR) is also know as
    information extraction (IE)
  • Information retrieval systems analyze
    unrestricted text in order to extract specific
    types of information
  • IR systems do not attempt to understand all of
    the text in all of the documents, but they do
    analyze those portions of each document that
    contain relevant information
  • relevance is determined by pre-defined domain
    guidelines which must specify, as accurately as
    possible, exactly what types of information the
    system is expected to find
  • query would be a good example of such a
    pre-defined domain
  • documents that contain relevant information are
    retrieved while other are ignored

23
1.2.3.1 Information Retrieval (Cont)
  • Sometimes documents could be represented by a
    surrogate, such as the title and and a list of
    key words and/or an abstract
  • It is more common to use the full text, possibly
    subdivided into sections that each serve as a
    separate document for retrieval purposes
  • The query is normally a list of words typed by
    the user
  • Boolean combinations of words were used by
    earlier systems to construct queries
  • users found it difficult to get good results from
    Boolean queries
  • it was hard to find a combination of ANDs and
    ORs that will produce appropriate results

24
1.2.3.1 Information Retrieval (Cont)
  • Boolean model has been replaced by vector-space
    model in modern IR systems
  • in vector-space model every list of words (both
    the documents and query) is treated as a vector
    in n-dimensional vector space (where n is the
    number of distinct tokens in the document
    collection)
  • can use a 1 in a vector position if that word
    appears and 0 if it does not
  • vectors are then compared to determine which ones
    are close
  • vector model is more flexible than Boolean model
  • documents can be ranked and closest matches could
    be reported first

25
1.2.3.1 Information Retrieval (Cont)
  • There are many variations on vector-space model
  • some allow stating that two words must appear
    near each other
  • some use thesaurus to automatically augment the
    words in the query with their synonyms
  • A good discriminator must be chosen in order for
    the system to be effective
  • common words like a, the dont tell us much
    since they occur in just about every document
  • a good way to set up the retrieval is to give a
    term a larger weight if it appears in a small
    number of documents

26
1.2.3.1 Information Retrieval (Cont)
  • Another way to think about IR is in terms of
    databases. An IR system attempts to convert
    unstructured text documents into codified
    database entries. Database entries might be
    drawn from a set of fixed values, or they can be
    actual sub-strings pulled from the original
    source text.
  • From a language processing perspective, IR
    systems must operate at many levels, from word
    recognition to sentence analysis, and from
    understanding at the sentence level on up to
    discourse analysis at the level of full text
    document.
  • Dictionary coverage is an especially challenging
    problem since open-ended documents can be filled
    with all manner of jargon, abbreviations, and
    proper names, not to mention typos and
    telegraphic writing styles.

27
1.2.3.1 Information Retrieval (Cont)
  • Example (Vector-Space Model) we assume that we
    have one very short document that contains one
    sentence CPSC 533 is the best Computer Science
    course at UofC also assume that our query is
    UofC
  • we need to set up our n-dimensional vector space
    we have 10 distinct tokens (one for every word in
    the sentence)
  • we are going to set up the following vector to
    represent the sentence (1,1,1,1,1,1,1,1,1,1) --
    indicating that all ten words are present
  • we are going to set the following vector for the
    query (0,0,0,0,0,0,0,0,0,1) -- indicating that
    UofC is the only word present in the query
  • by ANDing the two vectors together, we get
    (0,0,0,0,0,0,0,0,0,1) meaning that our document
    contains UofC, as expected

28
1.2.3.1 Information Retrieval (Cont)
  • Example Commercial System (HIGHLIGHT)
  • helps users find relevant information in large
    volumes of text and present it in a structured
    fashion
  • it can extract information from newswire reports
    for a specific topic area - such as global
    banking, or the oil industry - as well as current
    and historical financial and other data
  • although its accuracy will never match the
    decision-making skills of a trained human expert,
    HIGHLIGHT can process large amounts of text very
    quickly, allowing users to discover more
    information that even the most trained
    professional would have time to look for
  • see Demo at http//www-cgi.cam.sri.com/highlight/
  • could be classified under Extracting Data From
    Text (1.2.3.3)

29
1.2.3.2 Text Categorization
  • It is often desirable to sort all text into
    several categories
  • There are number of companies that provide their
    subscribers access to all news on a particular
    industry, company or geographic area
  • traditionally, human experts were used to assign
    the categories
  • in the last few years, NLP systems have proven
    very accurate (correctly categorizing over 90 of
    the news stories)
  • Context in which text appears is very important
    since the same word could be categorized
    completely differently depending on the context
  • Example in a dictionary, the primary definition
    of the word crude is vulgar, but in a large
    sample of the Wall Street Journal, crude refers
    to oil 100 of the time

30
1.2.3.3 Extracting Data From Text
  • The task of data extraction is take on-line text
    and derive from it some assertions that can be
    put into a structured database
  • Examples of data extraction systems include
  • SCISOR system
  • able to take stock information text (such as the
    type released by Dow Jones News Service) and
    extract important stock information pertaining
    to
  • events that took place
  • companies involved
  • starting share prices
  • quantity of shares that changed hands
  • effect on stock prices

31
Current Topic
  • 1.0 Definitions
  • 1.1 Characteristics of Successful Machines
  • 1.2 Practical Applications
  • 1.2.1 machine translation
  • 1.2.2 database access
  • 1.2.3 text interpretation
  • 1.2.3.1 information retrieval
  • 1.2.3.2 text categorization
  • 1.2.3.3 extracting data from text
  • 2.0 Efficient Parsing
  • 3.0 Scaling up the Lexicon
  • 4.0 List of References

32
2.0 Efficient Parsing
  • Parsing -- the act of analyzing the
    grammaticality of an utterance according to some
    specific grammar
  • previous sentence was parsed according to some
    grammar of English and was determined that it
    was grammatical
  • we read the words in some order (from left to
    right from right to left or in random order)
    and analyzed them one-by-one
  • Each parse is a different method of analyzing
    some target sentence according to some specified
    grammar

33
2.0 Efficient Parsing (Cont)
  • Simple left-to-right parsing is often
    insufficient
  • it is hard to determine the nature of the
    sentence
  • this means that we have to make an initial guess
    as to what it is the sentence is saying
  • this forces us to backtrack if the guess is
    incorrect
  • Some backtracking is inevitable
  • to make parsing efficient, we want to minimize
    the amount of backtracking
  • even if a wrong guess is made, we know that a
    portion of the sentence has already been analyzed
    -- there is no need to start from scratch since
    we can use the information that is available to us

34
2.0 Efficient Parsing (Cont)
  • Example we have two sentences
  • Have students in section 2 of Computer Science
    203 take the exam.
  • Have students in section 2 of Computer Science
    203 taken the exam?
  • first ten words Have students in section 2 of
    Computer Science 203 are exactly the same
    although the meanings of the two sentences are
    completely different
  • if an incorrect guess is made, we can still use
    the first ten words when we backtrack
  • this will require a lot less work

35
2.0 Efficient Parsing (Cont)
  • There are three main things that we can do to
    improve efficiency
  • dont do twice what you can do once
  • dont do once what you can avoid altogether
  • dont represent distinctions that you dont need
  • To accomplish these we can use a data structure
    known as chart (matrix) to store partial results
  • this is a form of dynamic programming
  • results are only calculated if they can not be
    found in the chart
  • only a portion of the calculations that can not
    be found in the chart is done while the rest is
    retrieved from the chart
  • algorithms that do this are called chart parsers

36
2.0 Efficient Parsing (Cont)
  • Examples of parsing techniques
  • Top-Down, Depth-First
  • Top-Down, Breadth-First
  • Bottom-Up, Depth-First Chart
  • Prolog
  • Feature Augmented Phrase Structure
  • These are not the only parsing techniques that
    exist
  • One is free to come up with his or her own
    algorithm for the order in which individual words
    in every sentence will be analyzed

37
2.0 Efficient Parsing (Cont)
  • i) Top-Down, Depth-First
  • uses a strategy of searching for phrasal
    constituents from the highest node (the sentence
    node) to the terminal nodes (the individual
    lexical items) to find a match to the possible
    syntactic structure of the input sentence
  • stores attempts on a possibilities list as a
    stacked data structure (LIFO)
  • ii) Top-Down, Breadth-First
  • same searching strategy as Top-Down, Depth-First
  • stores attempts on a possibilities list as a
    queued data structure (FIFO)

38
2.0 Efficient Parsing (Cont)
  • iii) Bottom-Up, Depth-First Chart
  • parse begins at the word level and uses the
    grammar rules to build higher-level structures
    (bottom-up), which are combined until a goal
    state is reached or until all the applicable
    grammar rules have been exhausted
  • iv) Prolog
  • relies on the functionality of Prolog Programming
    Language to generate a parse using Top-Down,
    Depth-First algorithm
  • naturally deals with constituents and their
    relationships
  • v) Feature Augmented Phrase Structure
  • takes sentence as input and parses it by
    accessing information in a featured
    phrase-structure grammar and lexicon
  • parser output is a tree

39
2.0 Efficient Parsing (Cont)
  • Chart parsing can be represented pictorially
    using a combination of n 1 vertices and a
    number of edges
  • Notation for edge labels ltStarting
    Vertexgt,ltEnding Vertexgt, ltResultgt ? ltPart 1gt...
    ltPart ngt ltNeeded Part 1gtltNeeded Part k
  • if Needed Parts are added to already available
    Parts then Result would be the outcome, spanning
    edges from Starting Vertex to Ending Vertex
  • see examples (two pages down)
  • If there are no Needed Parts (if k 0), then the
    edge is called complete
  • edge is called incomplete otherwise

40
2.0 Efficient Parsing (Cont)
  • Chart-parsing algorithms use a combination of
    top-down and bottom-up processing
  • this means that it never has to consider certain
    constituents that could not lead to a complete
    parse
  • this also means that it can handle grammars with
    both left-recursive rules and rules with empty
    right-hand sides without going into an infinite
    loop
  • result of our algorithm is a packed forest of
    parse tree constituents rather than an
    enumeration of all possible trees
  • Chart Parsing consists of forming a chart with n
    1 vertices and adding edges to the chart one at
    a time, trying to produce a complete edge that
    spans from vertex 0 to n and is of category S
    (sentence) ? 0,n, S ? NP VP There is no
    backtracking -- everything that is put into the
    chart stays there

41
2.0 Efficient Parsing (Cont)
Examples
  • A) Edge 0,5, S ? NP VP -- says an NP
    followed by VP combine to make an S that spans
    the string from 0 to 5
  • B) Edge 0,2, S ? NP VP -- says that an NP
    spans the string from 0 to 2, and if we could
    find a VP to follow it, then we would have an S

42
2.0 Efficient Parsing (Cont)
  • There are four ways to add and edge to the chart
  • Initializer
  • adds an edge to indicate that we are looking for
    the start symbol of the grammar, S, starting at
    position 0, but have not found anything yet
  • Predictor
  • takes an incomplete edge that is looking for an X
    and adds new incomplete edges, that if completed,
    would build an X in the right place
  • Completer
  • takes an incomplete edge that is looking for an X
    and ends at vertex j and a complete edge that
    begins at j and has X as the left-hand side, and
    combines them to make a new edge where the X has
    been found
  • Scanner
  • similar to the completer, except that it uses the
    input words rather than exciting complete edges
    to generate the X

43
2.0 Efficient Parsing (Cont)
Nondeterministic Chart Parsing Algorithm
44
2.0 Efficient Parsing (Cont)
  • Nondeterministic Chart Parsing Algorithm
  • treats the chart as a set of edges
  • an new edge is non-deterministically added to the
    chart at every step (an edge is
    non-deterministically chosen from the possible
    additions)
  • S is the start symbol and S is the new
    nonterminal symbol
  • we start out looking for S (i.e. we currently
    have an empty string)
  • add edges using one of the three methods
    (predictor, completer, scanner), one at a time
    until no new edges can be added
  • at the end, if the required parse exists, it is
    found
  • if none of the methods could be used to add
    another edge to the set, the algorithm terminates

45
2.0 Efficient Parsing (Cont)
Chart for a Parse of I feel it
46
2.0 Efficient Parsing (Cont)
  • Using the sample chart on the previous page, the
    following steps are taken to complete the parse
    of I feel it -- page 1/3
  • 1. INITIALIZER if we parse from edge 0 to edge 0
    and look for S, we still need to find S -- (a)
  • 2. PREDICTOR we are looking for an incomplete
    edge, that if completed, would give us S -- we
    know that S consists of NP and VP, meaning that
    by going from 0 to 0 we will have S if we find VP
    and NP -- (b)
  • 3. PREDICTOR following a very similar rule, we
    know that we will have NP if we can find a
    Pronoun this condition can be achieved by going
    from 0 to 0, looking for a Pronoun -- (c)
  • 4. SCANNER if we go from 0 to 1, parsing I we
    will have our NP since a Pronoun is found -- (d)

47
2.0 Efficient Parsing (Cont)
  • Example (continued) -- page 2/3
  • 5. COMPLETER we can summarize above steps, we
    are looking for S and by going from 0 to1 we have
    NP and are still looking for VP -- (e)
  • 6. PREDICTOR we are now looking for VP and by
    going from 1 to 1 we will have VP if can find a
    Verb -- (f)
  • 7. PREDICTOR VP can consist of another VP and
    NP, meaning that 6 would also work if we can find
    VP and NP -- (g)
  • 8. SCANNER by going from1 to 2 we can find a
    Verb, thus we can find VP -- (h)
  • 9. COMPLETER using 7 and 8, we know that since
    VP is found we can complete VP by going from 1 to
    2 and finding NP -- (i)
  • 10. PREDICTOR NP can be completed by going from
    2 to 2 and finding a Pronoun -- (j)

48
2.0 Efficient Parsing (Cont)
  • Example (continued) -- page 3/3
  • 11. SCANNER we can find a Pronoun if we go from
    2 to 3, thus completing NP -- (k)
  • 12. COMPLETER using 7 - 11, we know that VP can
    be found by going from 1 to 3, thus finding NP
    and VP -- (l)
  • 13. COMPLETER using all of the information we
    collected up to this point, one can get S by
    going from 0 to 3, thus finding the original NP
    and VP, where VP consists of another VP and NP --
    (m)
  • All of these steps are summarized on the diagram
    on the next page

49
2.0 Efficient Parsing (Cont)
Trace of a Parse of I feel it
50
2.0 Efficient Parsing (Cont)
Left-Corner Parsing Algorithm
51
2.0 Efficient Parsing (Cont)
  • Left-Corner Parsing
  • avoids building some edges that could not
    possibly be part of an S spanning the whole
    string
  • builds up a parse tree that starts with the
    grammars start symbol and extends down to the
    last word in the sentence
  • Non-deterministic Chart Parsing Algorithm is an
    example of left-corner parsers
  • using example on the previous slide
  • ride the horse would never be considered as VP
  • saves time since unrealistic combinations do not
    have to be, first worked out and then discarded

52
2.0 Efficient Parsing (Cont)
  • Extracting Parses From the Chart Packing
  • when the chart parsing algorithm finishes, it
    returns an entire chart (collection of parse
    trees)
  • what we really want is a parse tree (or several
    parse trees)
  • Ex
  • a) pick out parse trees that span the entire
    input
  • b) pick out parse trees that for some reason do
    not span the entire input
  • the easiest way to do this is to modify COMPLETER
    so that when it combines two child edges to
    produce a parent edge, it stores in the parent
    edge the list of children that comprise it.
  • when we are done with the parse, we only need to
    look in chartn for an edge that starts at 0,
    and recursively look at the children lists to
    reproduce a complete parse tree

53
2.0 Efficient Parsing (Cont)
A Variant of Nondeterministic Chart Parsing
Algorithm
  • Keeps track of the entire parse tree
  • We can look in chartn for an edge that starts
    at 0, and recursively look at the children lists
    to reproduce a complete parse tree

54
Current Topic
  • 1.0 Definitions
  • 1.1 Characteristics of Successful Machines
  • 1.2 Practical Applications
  • 1.2.1 machine translation
  • 1.2.2 database access
  • 1.2.3 text interpretation
  • 1.2.3.1 information retrieval
  • 1.2.3.2 text categorization
  • 1.2.3.3 extracting data from text
  • 2.0 Efficient Parsing
  • 3.0 Scaling up the Lexicon
  • 4.0 List of References

55
3.0 Scaling Up the Lexicon
  • In real text-understanding systems, the input is
    a sequence of characters from which the words
    must be extracted
  • Four step process for doing this consists of
  • tokenization
  • morphological analysis
  • dictionary lookup
  • error recovery
  • Since many natural languages are fundamentally
    different, these steps would be much harder to
    apply to some languages than others

56
3.0 Scaling Up the Lexicon (Cont)
  • a) Tokenization
  • process of dividing the input into distinct
    tokens -- words and punctuation marks.
  • this is not easy in some languages , like
    Japanese, where there are no spaces between words
  • this process is much easier in English although
    it is not trivial by any means
  • examples of complications may include
  • A hyphen at the end of the line may be an
    interword or an intraword dash
  • tokenization routines are designed to be fast,
    with the idea that as long as they are consistent
    in breaking up the input text into tokens, any
    problems can always be handled at some later
    stage of processing

57
3.0 Scaling Up the Lexicon (Cont)
  • b) Morphological Analysis
  • the process of describing a word in terms of the
    prefixes, suffixes and root forms that comprise
    it
  • there are three ways that words can be composed
  • Inflectional Morphology
  • reflects that changes to a word that are needed
    in a particular grammatical context (Ex most
    nouns take the suffix s when they are plural)
  • Derivational Morphology
  • derives a new word from another word that is
    usually of a different category (Ex the noun
    softness is derived from the adjective short)
  • Compounding
  • takes two words and puts them together (Ex
    bookkeeper is a compound of book and
    keeper)
  • used a lot in morphologically complex languages
    such as German, Finish, Turkish, Inuit, and Yupik

58
3.0 Scaling Up the Lexicon (Cont)
  • c) Dictionary Lookup
  • is performed on every token (except for special
    ones such as punctuation)
  • the task is to find the word in the dictionary
    and return its definition
  • two ways to do dictionary lookup
  • store morphologically complex words first
  • complex words are written to dictionary and the
    looked up when needed
  • do morphological analysis first
  • process the word before looking anything up
  • Ex walked -- strip of ed and look up walk
  • if the verb is not marked as irregular, then
    walked would be the past tense of walk
  • any implementation of the table abstract data
    type can serve as a dictionary hash tables,
    binary trees, b-tries, and trees

59
3.0 Scaling Up the Lexicon (Cont)
  • d) Error Recovery
  • is undertaken when a word is not found in the
    dictionary
  • there are four types of error recovery
  • morphological rules can guess at the words
    syntactic class
  • Ex smarply is not in the dictionary but it is
    probably an adverb
  • capitalization is a clue that a word is a proper
    name
  • other specialized formats denote dates, times,
    social security numbers, etc
  • spelling correction routines can be used to find
    a word in the dictionary that is close to the
    input word
  • there are two popular models for defining
    closeness in words
  • Letter-Based Model
  • Sound-Based Model

60
3.0 Scaling Up the Lexicon (Cont)
  • Letter-Based Model
  • an error consists of inserting or deleting a
    single letter, transposing two adjacent letters
    or replacing one letter with another
  • Ex a 10 letter word is one error away from 530
    other words
  • 10 deletions -- each of the ten letters could be
    deleted
  • 9 swaps -- _x_x_x_x_x_x_x_x_x_ there are nine
    possible swaps where x signifies that _ on
    its left and right could be switched
  • 10 x 25 replacements -- each of the ten letters
    can be replaced by (26 - 1) letters of the
    alphabet
  • 11 x 26 insertions -- x_x_x_x_x_x_x_x_x_x_x and
    each x can be one of the 26 letters of the
    alphabet
  • total is 10 9 225 286 530

61
3.0 Scaling Up the Lexicon (Cont)
  • Sound-Based Model
  • words are translated into canonical form that
    preserves most of information needed to pronounce
    the word, but abstracts away the details
  • Ex a word such as attention might be
    translated into the sequence a, T, a, N, S, H,
    a, N, where a stands for any vowel
  • this would mean that words such as attension
    and atennshun translate to the same sequence
  • if no other word in the dictionary translates
    into the same sequence, then we can unambiguously
    correct the spelling error
  • NOTE letter-based approach would work just as
    well for attention but not for atennshun,
    which is 5 errors away from attention

62
3.0 Scaling Up the Lexicon (Cont)
  • Practical NPL systems have lexicons with from
    10,000 to 1000,000 root word forms
  • building such a sizable lexicon is very time
    consuming and expensive
  • this has been a cost that dictionary publishing
    companies and companies with NLP programs have
    not been willing to share
  • Wordnet is an exception to this rule
  • freely available dictionary, developed by a group
    at Princeton (led by George Miller)
  • diagram on the next slide gives and example of
    the type of information returned by Wordnet about
    the word ride

63
3.0 Scaling Up the Lexicon (Cont)
Wordnet Example of the Word ride
64
3.0 Scaling Up the Lexicon (Cont)
  • Although dictionaries like Wordnet are useful,
    they do not provide all the lexical information
    one would like
  • frequency information is missing
  • some of the meanings are far more likely than
    others
  • Ex pen usually means a writing instrument
    although (very rarely) it can mean a female swan
  • semantic restrictions are missing
  • we need to know related information
  • Ex with the word ride, we may need to know
    whether we are talking about animals or vehicles
    because the actions in two cases are quite
    different

65
Current Topic
  • 1.0 Definitions
  • 1.1 Characteristics of Successful Machines
  • 1.2 Practical Applications
  • 1.2.1 machine translation
  • 1.2.2 database access
  • 1.2.3 text interpretation
  • 1.2.3.1 information retrieval
  • 1.2.3.2 text categorization
  • 1.2.3.3 extracting data from text
  • 2.0 Efficient Parsing
  • 3.0 Scaling up the Lexicon
  • 4.0 List of References

66
4.0 List of References
  • http//nats-www.informatik.uni-hamburg.de/
    Natural Language Systems
  • http//www.he.net/hedden/intro_mt.html Machine
    Translation A Brief Introduction
  • http//foxnet.cs.cmu.edu/people/spot/frg/Tomita.tx
    t Masaru Tomita
  • http//www.csli.stanford.edu/aac/papers.html Ann
    Copestake's Online Publications
  • http//www.aventinus.de/ AVENTINUS advanced
    information system for multilingual drug
    enforcement

67
4.0 List of References (Cont)
  • http//ai10.bpa.arizona.edu/ktolle/np.html AZ
    Noun Phraser
  • http//www.cam.sri.com/ Cambridge Computer
    Science Research Center
  • http//www-cgi.cam.sri.com/highlight/ Cambridge
    Computer Science Research Center, Highlight
  • http//www.cogs.susx.ac.uk/lab/nlp/ Natural
    Language Processing and Computational Linguistics
    at The University of Sussex
  • http//www.cogs.susx.ac.uk/lab/nlp/lexsys/
    LexSys Analysis of Naturally-Occurring English
    Text with Stochastic Lexicalized Grammars

68
4.0 List of References (Cont)
  • http//www.georgetown.edu/compling/parsinfo.htm
    Georgetown University General Description of
    Parsers
  • http//www.georgetown.edu/compling/graminfo.htm
    Georgetown University General Information about
    Grammars
  • http//www.georgetown.edu/cball/ling361/ling361_nl
    p1.html Georgetown University Introduction to
    Computational Linguistics
  • http//www.georgetown.edu/compling/module.html
    Georgetown University Modularity in Natural
    Language Parsing

69
4.0 List of References (Cont)
  • Elaine Rich, Kevin Knight Artificial Intelligence
  • Patrick Henry Winston Artificial Intelligence
  • Philip C. Jackson Introduction to Artificial
    Intelligence

70
This presentation was brought to you by the
letter A as well as numbers 40/40 100
Write a Comment
User Comments (0)
About PowerShow.com