Computational Linguistics - PowerPoint PPT Presentation

Loading...

PPT – Computational Linguistics PowerPoint presentation | free to download - id: 4915-NzA4Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Computational Linguistics

Description:

Learn about the problems and possibilities of natural language analysis: ... British Left Waffles on Falkland Islands. Red Tape Holds Up New Bridges ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 206
Provided by: csBra
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Computational Linguistics


1
Computational Linguistics
  • James Pustejovsky
  • Brandeis University
  • Boston Computational Linguistics Olympiad Team
  • Fall, 2007

2
What is Computational Linguistics?
  • Computational Linguistics is the computational
    analysis of natural languages.
  • Process information contained in natural
    language.
  • Can machines understand human language?
  • Define understand
  • Understanding is the ultimate goal. However, one
    doesnt need to fully understand to be useful.

3
Goals of this Lecture
  • Learn about the problems and possibilities of
    natural language analysis
  • What are the major issues?
  • What are the major solutions?
  • At the end you should
  • Agree that language is subtle and interesting!
  • Know about some of the algorithms.
  • Know how difficult it can be!

4
Its 2007,but were not anywhere closeto
realizing the dream(or nightmare ) of 2001
5
  • Dave Bowman Open the pod bay doors.

Dave Bowman Open the pod bay doors, please,
HAL.
HAL 9000 Im sorry Dave. Im afraid I cant do
that.
6
Why is NLP difficult?
  • Computers are not brains
  • There is evidence that much of language
    understanding is built-in to the human brain
  • Computers do not socialize
  • Much of language is about communicating with
    people
  • Key problems
  • Representation of meaning
  • Language presupposed knowledge about the world
  • Language only reflects the surface of meaning
  • Language presupposes communication between people

7
Hidden Structure
  • English plural pronunciation
  • Toy s ? toyz add z
  • Book s ? books add s
  • Church s ? churchiz add iz
  • Box s ? boxiz add iz
  • Sheep s ? sheep add nothing
  • What about new words?
  • Bach s ? boxs why not boxiz?

8
Language subtleties
  • Adjective order and placement
  • A big black dog
  • A big black scary dog
  • A big scary dog
  • A scary big dog
  • A black big dog
  • Antonyms
  • Which sizes go together?
  • Big and little
  • Big and small
  • Large and small
  • Large and little

9
World Knowledge is subtle
  • He arrived at the lecture.
  • He chuckled at the lecture.
  • He arrived drunk.
  • He chuckled drunk.
  • He chuckled his way through the lecture.
  • He arrived his way through the lecture.

10
Words are ambiguous(have multiple meanings)
  • I know that.
  • I know that block.
  • I know that blocks the sun.
  • I know that block blocks the sun.

11
Headline Ambiguity
  • Iraqi Head Seeks Arms
  • Juvenile Court to Try Shooting Defendant
  • Teacher Strikes Idle Kids
  • Kids Make Nutritious Snacks
  • British Left Waffles on Falkland Islands
  • Red Tape Holds Up New Bridges
  • Bush Wins on Budget, but More Lies Ahead
  • Hospitals are Sued by 7 Foot Doctors
  • Ban on nude dancing on Governors desk
  • Local high school dropouts cut in half

12
The Role of Memorization
  • Children learn words quickly
  • As many as 9 words/day
  • Often only need one exposure to associate meaning
    with word
  • Can make mistakes, e.g., overgeneralization
  • I goed to the store.
  • Exactly how they do this is still under study

13
The Role of Memorization
  • Dogs can do word association too!
  • Rico, a border collie in Germany
  • Knows the names of each of 100 toys
  • Can retrieve items called out to him with over
    90 accuracy.
  • Can also learn and remember the names of
    unfamiliar toys after just one encounter, putting
    him on a par with a three-year-old child.

http//www.nature.com/news/2004/040607/pf/040607-8
_pf.html
14
But there is too much to memorize!
  • establish
  • establishment
  • the church of England as the official state
    church.
  • disestablishment
  • antidisestablishment
  • antidisestablishmentarian
  • antidisestablishmentarianism
  • is a political philosophy that is opposed to the
    separation of church and state.

15
Rules and Memorization
  • Current thinking in psycholinguistics is that we
    use a combination of rules and memorization
  • However, this is very controversial
  • Mechanism
  • If there is an applicable rule, apply it
  • However, if there is a memorized version, that
    takes precedence. (Important for irregular
    words.)
  • Artists paint still lifes
  • Not still lives
  • Past tense of
  • think ? thought
  • blink ? blinked
  • This is a simplification for more on this, see
    Pinkers Words and Language and The Language
    Instinct.

16
Representation of Meaning
  • I know that block blocks the sun.
  • How do we represent the meanings of block?
  • How do we represent I know?
  • How does that differ from I know that.?
  • Who is I?
  • How do we indicate that we are talking about
    earths sun vs. some other planets sun?
  • When did this take place? What if I move the
    block? What if I move my viewpoint? How do we
    represent this?

17
How to tackle these problems?
  • The field was stuck for quite some time.
  • A new approach started around 1990
  • Well, not really new, but the first time around,
    in the 50s, they didnt have the text, disk
    space, or GHz
  • Main idea combine memorizing and rules
  • How to do it
  • Get large text collections (corpora)
  • Compute statistics over the words in those
    collections
  • Surprisingly effective
  • Even better now with the Web

18
Corpus-based Example Pre-Nominal Adjective
Ordering
  • Important for translation and generation
  • Examples
  • big fat Greek wedding
  • fat Greek big wedding
  • Some approaches try to characterize this as
    semantic rules, e.g.
  • Age lt color, value lt dimension
  • Data-intensive approaches
  • Assume adjective ordering is independent of the
    noun they modify
  • Compare how often you see a, b vs b, a

Keller Lapata, The Web as Baseline,
HLT-NAACL04
19
Corpus-based Example Pre-Nominal Adjective
Ordering
  • Data-intensive approaches
  • Compare how often you see a, b vs b, a
  • What happens when you encounter an unseen pair?
  • Shaw and Hatzivassiloglou 99 use transitive
    closures
  • Malouf 00 uses a back-off bigram model
  • P(lta,bgta,b) vs. P(ltb,agta,b)
  • He also uses morphological analysis, semantic
    similarity calculations and positional
    probabilities
  • Keller and Lapata 04 use just the very simple
    algorithm
  • But they use the web as their training set
  • Gets 90 accuracy on 1000 sequences
  • As good as or better than the complex algorithms

Keller Lapata, The Web as Baseline,
HLT-NAACL04
20
Real-World Applications of NLP
  • Spelling Suggestions/Corrections
  • Grammar Checking
  • Synonym Generation
  • Information Extraction
  • Text Categorization
  • Automated Customer Service
  • Speech Recognition (limited)
  • Machine Translation
  • In the (near?) future
  • Question Answering
  • Improving Web Search Engine results
  • Automated Metadata Assignment
  • Online Dialogs

21
Synonym Generation
22
Synonym Generation
23
Synonym Generation
24
Levels of Language
  • Sound Structure (Phonetics and Phonology)
  • The sounds of speech and their production
  • The systematic way that sounds are differently
    realized in different environments.
  • Word Structure (Morphology)
  • From morphos shape (not transform, as in morph)
  • Analyzes how words are formed from minimal units
    of meaning also derivational rules
  • dog s dogs eat, eats, ate
  • Phrase Structure (Syntax)
  • From the Greek syntaxis, arrange together
  • Describes grammatical arrangements of words into
    hierarchical structure

25
Levels of Language
  • Thematic Structure
  • Getting closer to meaning
  • Who did what to whom
  • Subject, object, predicate
  • Semantic Structure
  • How the lower levels combine to convey meaning
  • Pragmatics and Discourse Structure
  • How language is used across sentences.

26
Parsing at Every Level
  • Transforming from a surface representation to an
    underlying representation
  • Its not straightforward to do any of these
    mappings!
  • Ambiguity at every level
  • Word is saw a verb or noun?
  • Phrase I saw the guy on the hill with the
    telescope.
  • Who is on the hill?
  • Semantic which hill?

27
Tokens and Types
  • The term word can be used in two different ways
  • To refer to an individual occurrence of a word
  • To refer to an abstract vocabulary item
  • For example, the sentence my dog likes his dog
    contains five occurrences of words, but four
    vocabulary items.
  • To avoid confusion use more precise terminology
  • Word token an occurrence of a word
  • Word Type a vocabulary item

28
Tokenization (continued)
  • Tokenization is harder that it seems
  • Ill see you in New York.
  • The aluminum-export ban.
  • The simplest approach is to use graphic words
    (i.e., separate words using whitespace)
  • Another approach is to use regular expressions to
    specify which substrings are valid words.
  • NLTK provides a generic tokenization interface
    TokenizerI

29
Terminology
  • Tagging
  • The process of associating labels with each token
    in a text
  • Tags
  • The labels
  • Tag Set
  • The collection of tags used for a particular task

30
Example
  • Typically a tagged text is a sequence of
    white-space separated base/tag tokens
  • The/at Pantheons/np interior/nn ,/,still/rb
    in/in its/pp original/jj form/nn ,/, is/bez
    truly/ql majestic/jj and/cc an/at
    architectural/jj triumph/nn ./. Its/pp rotunda/nn
    forms/vbz a/at perfect/jj circle/nn whose/wp
    diameter/nn is/bez equal/jj to/in the/at
    height/nn from/in the/at floor/nn to/in the/at
    ceiling/nn ./.

31
What does Tagging do?
  • Collapses Distinctions
  • Lexical identity may be discarded
  • e.g. all personal pronouns tagged with PRP
  • Introduces Distinctions
  • Ambiguities may be removed
  • e.g. deal tagged with NN or VB
  • e.g. deal tagged with DEAL1 or DEAL2
  • Helps classification and prediction

32
Significance of Parts of Speech
  • A words POS tells us a lot about the word and
    its neighbors
  • Limits the range of meanings (deal),
    pronunciation (object vs object) or both (wind)
  • Helps in stemming
  • Limits the range of following words for Speech
    Recognition
  • Can help select nouns from a document for IR
  • Basis for partial parsing (chunked parsing)
  • Parsers can build trees directly on the POS tags
    instead of maintaining a lexicon

33
Choosing a tagset
  • The choice of tagset greatly affects the
    difficulty of the problem
  • Need to strike a balance between
  • Getting better information about context (best
    introduce more distinctions)
  • Make it possible for classifiers to do their job
    (need to minimize distinctions)

34
Some of the best-known Tagsets
  • Brown corpus 87 tags
  • Penn Treebank 45 tags
  • Lancaster UCREL C5 (used to tag the BNC) 61 tags
  • Lancaster C7 145 tags

35
The Brown Corpus
  • The first digital corpus (1961)
  • Francis and Kucera, Brown University
  • Contents 500 texts, each 2000 words long
  • From American books, newspapers, magazines
  • Representing genres
  • Science fiction, romance fiction, press reportage
    scientific writing, popular lore

36
Penn Treebank
  • First syntactically annotated corpus
  • 1 million words from Wall Street Journal
  • Part of speech tags and syntax trees

37
How hard is POS tagging?
In the Brown corpus,- 11.5 of word types
ambiguous- 40 of word TOKENS
38
Important Penn Treebank tags
39
Verb inflection tags
40
The entire Penn Treebank tagset
41
Tagging methods
  • Hand-coded
  • Statistical taggers
  • Brill (transformation-based) tagger

42
Default Tagger
  • We need something to use for unseen words
  • E.g., guess NNP for a word with an initial
    capital
  • How to do this?
  • Apply a sequence of regular expression tests
  • Assign the word to a suitable tag
  • If there are no matches
  • Assign to the most frequent unknown tag, NN
  • Other common ones are verb, proper noun,
    adjective
  • Note the role of closed-class words in English
  • Prepositions, auxiliaries, etc.
  • New ones do not tend to appear.

43
Training vs. Testing
  • A fundamental idea in computational linguistics
  • Start with a collection labeled with the right
    answers
  • Supervised learning
  • Usually the labels are done by hand
  • Train or teach the algorithm on a subset of
    the labeled text.
  • Test the algorithm on a different set of data.
  • Why?
  • If memorization worked, wed be done.
  • Need to generalize so the algorithm works on
    examples that you havent seen yet.
  • Thus testing only makes sense on examples you
    didnt train on.

44
Evaluating a Tagger
  • Tagged tokens the original data
  • Untag (exclude) the data
  • Tag the data with your own tagger
  • Compare the original and new tags
  • Iterate over the two lists checking for identity
    and counting
  • Accuracy fraction correct

45
Language Modeling
  • Another fundamental concept in NLP
  • Main idea
  • For a given language, some words are more likely
    than others to follow each other, or
  • You can predict (with some degree of accuracy)
    the probability that a given word will follow
    another word.

46
N-Grams
  • The N stands for how many terms are used
  • Unigram 1 term
  • Bigram 2 terms
  • Trigrams 3 terms
  • Usually dont go beyond this
  • You can use different kinds of terms, e.g.
  • Character based n-grams
  • Word-based n-grams
  • POS-based n-grams
  • Ordering
  • Often adjacent, but not required
  • We use n-grams to help determine the context in
    which some linguistic phenomenon happens.
  • E.g., look at the words before and after the
    period to see if it is the end of a sentence or
    not.

47
Features and Contexts
wn-2 wn-1 wn wn1
CONTEXT FEATURE CONTEXT
tn-1
tn
tn1
tn-2
48
Unigram Tagger
  • Trained using a tagged corpus to determine which
    tags are most common for each word.
  • E.g. in tagged WSJ sample, deal is tagged with
    NN 11 times, with VB 1 time, and with VBP 1 time
  • Performance is highly dependent on the quality of
    its training set.
  • Cant be too small
  • Cant be too different from texts we actually
    want to tag

49
Nth Order Tagging
  • Order refers to how much context
  • Its one less than the N in N-gram here because
    we use the target word itself as part of the
    context.
  • Oth order unigram tagger
  • 1st order bigrams
  • 2nd order trigrams
  • Bigram tagger
  • For tagging, in addition to considering the
    tokens type, the context also considers the tags
    of the n preceding tokens
  • What is the most likely tag for w_n, given w_n-1
    and t_n-1?
  • The tagger picks the tag which is most likely for
    that context.

50
Tagging with lexical frequencies
  • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
    tomorrow/NN
  • People/NNS continue/VBP to/TO inquire/VB the/DT
    reason/NN for/IN the/DT race/NN for/IN outer/JJ
    space/NN
  • Problem assign a tag to race given its lexical
    frequency
  • Solution we choose the tag that has the greater
  • P(raceVB)
  • P(raceNN)
  • Actual estimate from the Switchboard corpus
  • P(raceNN) .00041
  • P(raceVB) .00003

51
Rule-Based Tagger
  • The Linguistic Complaint
  • Where is the linguistic knowledge of a tagger?
  • Just a massive table of numbers
  • Arent there any linguistic insights that could
    emerge from the data?
  • Could thus use handcrafted sets of rules to tag
    input sentences, for example, if input follows a
    determiner tag it as a noun.

52
The Brill tagger
  • An example of TRANSFORMATION-BASED LEARNING
  • Very popular (freely available, works fairly
    well)
  • A SUPERVISED method requires a tagged corpus
  • Basic idea do a quick job first (using
    frequency), then revise it using contextual rules

53
Brill Tagging In more detail
  • Start with simple (less accurate) ruleslearn
    better ones from tagged corpus
  • Tag each word initially with most likely POS
  • Examine set of transformations to see which
    improves tagging decisions compared to tagged
    corpus
  • Re-tag corpus using best transformation
  • Repeat until, e.g., performance doesnt improve
  • Result tagging procedure (ordered list of
    transformations) which can be applied to new,
    untagged text

54
An example
  • Examples
  • It is expected to race tomorrow.
  • The race for outer space.
  • Tagging algorithm
  • Tag all uses of race as NN (most likely tag in
    the Brown corpus)
  • It is expected to race/NN tomorrow
  • the race/NN for outer space
  • Use a transformation rule to replace the tag NN
    with VB for all uses of race preceded by the
    tag TO
  • It is expected to race/VB tomorrow
  • the race/NN for outer space

55
Transformation-based learning in the Brill tagger
  • Tag the corpus with the most likely tag for each
    word
  • Choose a TRANSFORMATION that deterministically
    replaces an existing tag with a new one such that
    the resulting tagged corpus has the lowest error
    rate
  • Apply that transformation to the training corpus
  • Repeat
  • Return a tagger that
  • first tags using unigrams
  • then applies the learned transformations in order

56
Examples of learned transformations
57
Templates
58
Probabilities in Language Modeling
  • A fundamental concept in NLP
  • Main idea
  • For a given language, some words are more likely
    than others to follow each other, or
  • You can predict (with some degree of accuracy)
    the probability that a given word will follow
    another word.

59
Next Word Prediction
  • From a NY Times story...
  • Stocks ...
  • Stocks plunged this .
  • Stocks plunged this morning, despite a cut in
    interest rates
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    ...
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began

60
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began trading for the first time since
    last
  • Stocks plunged this morning, despite a cut in
    interest rates by the Federal Reserve, as Wall
    Street began trading for the first time since
    last Tuesday's terrorist attacks.

61
Human Word Prediction
  • Clearly, at least some of us have the ability to
    predict future words in an utterance.
  • How?
  • Domain knowledge
  • Syntactic knowledge
  • Lexical knowledge

62
Claim
  • A useful part of the knowledge needed to allow
    word prediction can be captured using simple
    statistical techniques
  • In particular, we'll rely on the notion of the
    probability of a sequence (a phrase, a sentence)

63
Applications
  • Why do we want to predict a word, given some
    preceding words?
  • Rank the likelihood of sequences containing
    various alternative hypotheses, e.g. for ASR
  • Theatre owners say popcorn/unicorn sales have
    doubled...
  • Assess the likelihood/goodness of a sentence
  • for text generation or machine translation.
  • The doctor recommended a cat scan.
  • El doctor recommendó una exploración del gato.

64
N-Gram Models of Language
  • Use the previous N-1 words in a sequence to
    predict the next word
  • Language Model (LM)
  • unigrams, bigrams, trigrams,
  • How do we train these models?
  • Very large corpora

65
Simple N-Grams
  • Assume a language has V word types in its
    lexicon, how likely is word x to follow word y?
  • Simplest model of word probability 1/V
  • Alternative 1 estimate likelihood of x occurring
    in new text based on its general frequency of
    occurrence estimated from a corpus (unigram
    probability)
  • popcorn is more likely to occur than unicorn
  • Alternative 2 condition the likelihood of x
    occurring in the context of previous words
    (bigrams, trigrams,)
  • mythical unicorn is more likely than mythical
    popcorn

66
A Word on Notation
  • P(unicorn)
  • Read this as The probability of seeing the token
    unicorn
  • Unigram tagger uses this.
  • P(unicornmythical)
  • Called the Conditional Probability.
  • Read this as The probability of seeing the token
    unicorn given that youve seen the token mythical
  • Bigram tagger uses this.
  • Related to the conditional frequency
    distributions that weve been working with.

67
Computing the Probability of a Word Sequence
  • Compute the product of component conditional
    probabilities?
  • P(the mythical unicorn)
  • P(the) P(mythicalthe) P(unicornthe mythical)
  • The longer the sequence, the less likely we are
    to find it in a training corpus
  • P(Most biologists and folklore specialists
    believe that in fact the mythical unicorn horns
    derived from the narwhal)
  • Solution approximate using n-grams

68
Bigram Model
  • Approximate by
  • P(unicornthe mythical) by P(unicornmythical)
  • Markov assumption
  • The probability of a word depends only on the
    probability of a limited history
  • Generalization
  • The probability of a word depends only on the
    probability of the n previous words
  • trigrams, 4-grams,
  • the higher n is, the more data needed to train
  • backoff models

69
Using N-Grams
  • For N-gram models
  • P(wn-1,wn) P(wn wn-1) P(wn-1)
  • By the Chain Rule we can decompose a joint
    probability, e.g. P(w1,w2,w3)
  • P(w1,w2, ...,wn) P(w1w2,w3,...,wn) P(w2w3,
    ...,wn) P(wn-1wn) P(wn)
  • For bigrams then, the probability of a sequence
    is just the product of the conditional
    probabilities of its bigrams
  • P(the,mythical,unicorn) P(unicornmythical)P(myt
    hicalthe) P(theltstartgt)

70
Training and Testing
  • N-Gram probabilities come from a training corpus
  • overly narrow corpus probabilities don't
    generalize
  • overly general corpus probabilities don't
    reflect task or domain
  • A separate test corpus is used to evaluate the
    model, typically using standard metrics
  • held out test set development test set
  • cross validation
  • results tested for statistical significance

71
Shallow (Chunk) Parsing
  • Goal divide a sentence into a sequence of
    chunks.
  • Chunks are non-overlapping regions of a text
  • I saw a tall man in the park.
  • Chunks are non-recursive
  • A chunk can not contain other chunks
  • Chunks are non-exhaustive
  • Not all words are included in chunks

72
Chunk Parsing Examples
  • Noun-phrase chunking
  • I saw a tall man in the park.
  • Verb-phrase chunking
  • The man who was in the park saw me.
  • Prosodic chunking
  • I saw a tall man in the park.
  • Question answering
  • What Spanish explorer discovered the
    Mississippi River?

73
Shallow Parsing Motivation
  • Locating information
  • e.g., text retrieval
  • Index a document collection on its noun phrases
  • Ignoring information
  • Generalize in order to study higher-level
    patterns
  • e.g. phrases involving gave in Penn treebank
  • gave NP gave up NP in NP gave NP up gave NP
    help gave NP to NP
  • Sometimes a full parse has too much structure
  • Too nested
  • Chunks usually are not recursive

74
Representation
  • BIO (or IOB)Trees

75
Comparison with Full Syntactic Parsing
  • Parsing is usually an intermediate stage
  • Builds structures that are used by later stages
    of processing
  • Full parsing is a sufficient but not necessary
    intermediate stage for many NLP tasks
  • Parsing often provides more information than we
    need
  • Shallow parsing is an easier problem
  • Less word-order flexibility within chunks than
    between chunks
  • More locality
  • Fewer long-range dependencies
  • Less context-dependence
  • Less ambiguity

76
Chunks and Constituency
  • Constituents a tall man in the park.
  • Chunks a tall man in the park.
  • A constituent is part of some higher unit in the
    hierarchical syntactic parse
  • Chunks are not constituents
  • Constituents are recursive
  • But, chunks are typically subsequences of
    constituents
  • Chunks do not cross major constituent boundaries

77
Chunking
  • Define a regular expression that matches the
    sequences of tags in a chunk
  • A simple noun phrase chunk regexp
  • (Note that ltNN.gt matches any tag starting with
    NN)
  • ltDTgt? ltJJgt ltNN.?gt
  • Chunk all matching subsequences
  • the/DT little/JJ cat/NN sat/VBD on/IN the/DT
    mat/NN
  • the/DT little/JJ cat/NN sat/VBD on/IN the/DT
    mat/NN
  • If matching subsequences overlap, first 1 gets
    priority

78
Unchunking
  • Remove any chunk with a given pattern
  • e.g., unChunkRule(ltNNDTgt, Unchunk NNDT)
  • Combine with Chunk Rule ltNNDTJJgt
  • Chunk all matching subsequences
  • Input
  • the/DT little/JJ cat/NN sat/VBD on/IN the/DT
    mat/NN
  • Apply chunk rule
  • the/DT little/JJ cat/NN sat/VBD on/IN the/DT
    mat/NN
  • Apply unchunk rule
  • the/DT little/JJ cat/NN sat/VBD on/IN the/DT
    mat/NN

79
Chinking
  • A chink is a subsequence of the text that is not
    a chunk.
  • Define a regular expression that matches the
    sequences of tags in a chink
  • A simple chink regexp for finding NP chunks
  • (ltVB.?gtltINgt)
  • First apply chunk rule to chunk everything
  • Input
  • the/DT little/JJ cat/NN sat/VBD on/IN the/DT
    mat/NN
  • ChunkRule('lt.gt', Chunk everything)
  • the/DT little/JJ cat/NN sat/VBD on/IN the/DT
    mat/NN
  • Apply Chink rule above
  • the/DT little/JJ cat/NN sat/VBD on/IN the/DT
    mat/NN

Chunk
Chink
Chunk
80
Merging
  • Combine adjacent chunks into a single chunk
  • Define a regular expression that matches the
    sequences of tags on both sides of the point to
    be merged
  • Example
  • Merge a chunk ending in JJ with a chunk starting
    with NN
  • MergeRule(ltJJgt, ltNNgt, Merge adjs and
    nouns)
  • the/DT little/JJ cat/NN sat/VBD on/IN
    the/DT mat/NN
  • the/DT little/JJ cat/NN sat/VBD on/IN the/DT
    mat/NN
  • Splitting is the opposite of merging

81
Applying Chunking to Treebank Data
82
(No Transcript)
83
(No Transcript)
84
Classifying at Different Granularies
  • Text Categorization
  • Classify an entire document
  • Information Extraction (IE)
  • Identify and classify small units within
    documents
  • Named Entity Extraction (NE)
  • A subset of IE
  • Identify and classify proper names
  • People, locations, organizations

85
Example The Problem Looking for a Job
Martin Baker, a person
Genomics job
Employers job posting form
86
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
87
What is Information Extraction
As a task
Filling slots in a database from sub-segments of
text.
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
IE
NAME TITLE ORGANIZATION Bill Gates
CEO Microsoft Bill Veghte VP
Microsoft Richard Stallman founder Free
Soft..
88
What is Information Extraction
As a familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
aka named entity extraction
89
What is Information Extraction
A familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
90
What is Information Extraction
A familyof techniques
Information Extraction segmentation
classification association
October 14, 2002, 400 a.m. PT For years,
Microsoft Corporation CEO Bill Gates railed
against the economic philosophy of open-source
software with Orwellian fervor, denouncing its
communal licensing as a "cancer" that stifled
technological innovation. Today, Microsoft
claims to "love" the open-source concept, by
which software code is made public to encourage
improvement and development by outside
programmers. Gates himself says Microsoft will
gladly disclose its crown jewels--the coveted
code behind the Windows operating system--to
select customers. "We can be open source. We
love the concept of shared source," said Bill
Veghte, a Microsoft VP. "That's a super-important
shift for us in terms of code access. Richard
Stallman, founder of the Free Software
Foundation, countered saying
Microsoft Corporation CEO Bill Gates Microsoft Gat
es Microsoft Bill Veghte Microsoft VP Richard
Stallman founder Free Software Foundation
91
IE in Context
Create ontology
Spider
Filter by relevance
IE
Segment Classify Associate Cluster
Database
Load DB
Query, Search
Documentcollection
Train extraction models
Data mine
Label training data
92
Landscape of IE TasksDegree of Formatting
93
Landscape of IE TasksIntended Breadth of
Coverage
Web site specific
Genre specific
Wide, non-specific
Formatting
Layout
Language
Amazon.com Book Pages
Resumes
University Names
94
Landscape of IE TasksComplexity
95
Landscape of IE TasksSingle Field/Record
Jack Welch will retire as CEO of General Electric
tomorrow. The top role at the Connecticut
company will be filled by Jeffrey Immelt.
Single entity
Binary relationship
N-ary record
Person Jack Welch
Relation Person-Title Person Jack
Welch Title CEO
Relation Succession Company General
Electric Title CEO Out
Jack Welsh In Jeffrey Immelt
Person Jeffrey Immelt
Relation Company-Location Company General
Electric Location Connecticut
Location Connecticut
Named entity extraction
96
State of the Art Performance a sample
  • Named entity recognition from newswire text
  • Person, Location, Organization,
  • F1 in high 80s or low- to mid-90s
  • Binary relation extraction
  • Contained-in (Location1, Location2)Member-of
    (Person1, Organization1)
  • F1 in 60s or 70s or 80s
  • Web site structure recognition
  • Extremely accurate performance obtainable
  • Human effort (10min?) required on each site

97
Three generations of IE systems
  • Hand-Built Systems Knowledge Engineering
    1980s
  • Rules written by hand
  • Require experts who understand both the systems
    and the domain
  • Iterative guess-test-tweak-repeat cycle
  • Automatic, Trainable Rule-Extraction Systems
    1990s
  • Rules discovered automatically using predefined
    templates, using automated rule learners
  • Require huge, labeled corpora (effort is just
    moved!)
  • Statistical Models 1997
  • Use machine learning to learn which features
    indicate boundaries and types of entities.
  • Learning usually supervised may be partially
    unsupervised

98
Landscape of IE Techniques
Lexicons
Abraham Lincoln was born in Kentucky.
member?
Alabama Alaska Wisconsin Wyoming
Any of these models can be used to capture words,
formatting or both.
99
Trainable IE systems
  • Pros
  • Annotating text is simpler faster than writing
    rules.
  • Domain independent
  • Domain experts dont need to be linguists or
    programers.
  • Learning algorithms ensure full coverage of
    examples.
  • Cons
  • Hand-crafted systems perform better, especially
    at hard tasks. (but this is changing)
  • Training data might be expensive to acquire
  • May need huge amount of training data
  • Hand-writing rules isnt that hard!!

100
MUC the genesis of IE
  • DARPA funded significant efforts in IE in the
    early to mid 1990s.
  • Message Understanding Conference (MUC) was an
    annual event/competition where results were
    presented.
  • Focused on extracting information from news
    articles
  • Terrorist events
  • Industrial joint ventures
  • Company management changes
  • Information extraction of particular interest to
    the intelligence community (CIA, NSA). (Note
    early 90s)

101
Message Understanding Conference (MUC)
  • Named entity
  • Person, Organization, Location
  • Co-reference
  • Clinton ? President Bill Clinton
  • Template element
  • Perpetrator, Target
  • Template relation
  • Incident
  • Multilingual

102
MUC Typical Text
  • Bridgestone Sports Co. said Friday it has set up
    a joint venture in Taiwan with a local concern
    and a Japanese trading house to produce golf
    clubs to be shipped to Japan. The joint venture,
    Bridgestone Sports Taiwan Co., capitalized at 20
    million new Taiwan dollars, will start production
    of 20,000 iron and metal wood clubs a month

103
MUC Typical Text
  • Bridgestone Sports Co. said Friday it has set up
    a joint venture in Taiwan with a local concern
    and a Japanese trading house to produce golf
    clubs to be shipped to Japan. The joint venture,
    Bridgestone Sports Taiwan Co., capitalized at 20
    million new Taiwan dollars, will start production
    of 20,000 iron and metal wood clubs a month

104
MUC Templates
  • Relationship
  • tie-up
  • Entities
  • Bridgestone Sports Co, a local concern, a
    Japanese trading house
  • Joint venture company
  • Bridgestone Sports Taiwan Co
  • Activity
  • ACTIVITY 1
  • Amount
  • NT2,000,000

105
MUC Templates
  • ATIVITY 1
  • Activity
  • Production
  • Company
  • Bridgestone Sports Taiwan Co
  • Product
  • Iron and metal wood clubs
  • Start Date
  • January 1990

106
Example of IE from FASTUS (1993)
107
Example of IE FASTUS(1993)
108
Example of IE FASTUS(1993) Resolving anaphora
109
Evaluating IE Accuracy
  • Always evaluate performance on independent,
    manually-annotated test data not used during
    system development.
  • Measure for each test document
  • Total number of correct extractions in the
    solution template N
  • Total number of slot/value pairs extracted by the
    system E
  • Number of extracted slot/value pairs that are
    correct (i.e. in the solution template) C
  • Compute average value of metrics adapted from IR
  • Recall C/N
  • Precision C/E
  • F-Measure Harmonic mean of recall and precision

110
MUC Information ExtractionState of the Art c.
1997
NE named entity recognition CO coreference
resolution TE template element construction TR
template relation construction ST scenario
template production
111
Finite State Transducers for IE
  • Basic method for extracting relevant information
  • IE systems generally use a collection of
    specialized FSTs
  • Company Name detection
  • Person Name detection
  • Relationship detection

112
Three Equivalent Representations
Regular expressions
Each can describe the others
Regular languages
Finite automata
Theorem For every regular expression, there is
a deterministic finite-state automaton that
defines the same language, and vice versa.
113
Question Answering
  • Today
  • Introduction to QA
  • A typical full-fledged QA system
  • A very simple system, in response to this
  • An intermediate approach
  • Wednesday
  • Using external resources
  • WordNet
  • Encyclopedias, Gazeteers
  • Incorporating a reasoning system
  • Machine Learning of mappings
  • Other question types (e.g., biography,
    definitions)

114
A of Search Types
Spectrum
  • What is the typical height of a giraffe?
  • What are some good ideas for landscaping my
    clients yard?
  • What are some promising untried treatments for
    Raynauds disease?

115
Beyond Document Retrieval
  • Document Retrieval
  • Users submit queries corresponding to their
    information needs.
  • System returns (voluminous) list of full-length
    documents.
  • It is the responsibility of the users to find
    information of interest within the returned
    documents.
  • Open-Domain Question Answering (QA)
  • Users ask questions in natural language.
  • What is the highest volcano in Europe?
  • System returns list of short answers.
  • Under Mount Etna, the highest volcano
    in Europe, perches the fabulous town
  • A real use for NLP

116
Questions and Answers
  • What is the height of a typical giraffe?
  • The result can be a simple answer, extracted from
    existing web pages.
  • Can specify with keywords or a natural language
    query
  • However, most web search engines are not set up
    to handle questions properly.
  • Get different results using a question vs.
    keywords

117
(No Transcript)
118
(No Transcript)
119
(No Transcript)
120
(No Transcript)
121
The Problem of Question Answering
When was the San Francisco fire? were
driven over it. After the ceremonial tie was
removed - it burned in the San Francisco fire of
1906 historians believe an unknown Chinese
worker probably drove the last steel spike into a
wooden tie. If so, it was only
What is the nationality of Pope John Paul II?
stabilize the country with its help, the
Catholic hierarchy stoutly held out for
pluralism, in large part at the urging of
Polish-born Pope John Paul II. When the Pope
emphatically defended the Solidarity trade union
during a 1987 tour of the
Where is the Taj Mahal? list of more
than 360 cities around the world includes the
Great Reef in Australia, the Taj Mahal in India,
Chartres Cathedral in France, and Serengeti
National Park in Tanzania. The four sites Japan
has listed include
122
The Problem of Question Answering
Natural language question, not keyword queries
What is the nationality of Pope John Paul II?
stabilize the country with its help, the
Catholic hierarchy stoutly held out for
pluralism, in large part at the urging of
Polish-born Pope John Paul II. When the Pope
emphatically defended the Solidarity trade union
during a 1987 tour of the
Short text fragment, not URL list
123
Question Answering from text
  • With massive collections of full-text documents,
    simply finding relevant documents is of limited
    use we want answers
  • QA give the user a (short) answer to their
    question, perhaps supported by evidence.
  • An alternative to standard IR
  • The first problem area in IR where NLP is really
    making a difference.

124
People want to ask questions
Examples from AltaVista query log who invented
surf music? how to make stink bombs where are the
snowdens of yesteryear? which english translation
of the bible is used in official catholic
liturgies? how to do clayart how to copy psx how
tall is the sears tower? Examples from Excite
query log (12/1999) how can i find someone in
texas where can i find information on puritan
religion? what are the 7 wonders of the world how
can i eliminate stress What vacuum cleaner does
Consumers Guide recommend
125
A Brief (Academic) History
  • In some sense question answering is not a new
    research area
  • Question answering systems can be found in many
    areas of NLP research, including
  • Natural language database systems
  • A lot of early NLP work on these
  • Problem-solving systems
  • STUDENT (Winograd 77)
  • LUNAR (Woods Kaplan 77)
  • Spoken dialog systems
  • Currently very active and commercially relevant
  • The focus is now on open-domain QA is new
  • First modern system MURAX (Kupiec, SIGIR93)
  • Trivial Pursuit questions
  • Encyclopedia answers
  • FAQFinder (Burke et al. 97)
  • TREC QA competition (NIST, 1999present)

126
AskJeeves
  • AskJeeves is probably most hyped example of
    Question answering
  • How it used to work
  • Do pattern matching to match a question to their
    own knowledge base of questions
  • If a match is found, returns a human-curated
    answer to that known question
  • If that fails, it falls back to regular web
    search
  • (Seems to be more of a meta-search engine now)
  • A potentially interesting middle ground, but a
    fairly weak shadow of real QA

127
Question Answering at TREC
  • Question answering competition at TREC consists
    of answering a set of 500 fact-based questions,
    e.g.,
  • When was Mozart born?.
  • Has really pushed the field forward.
  • The document set
  • Newswire textual documents from LA Times, San
    Jose Mercury News, Wall Street Journal, NY Times
    etcetera over 1M documents now.
  • Well-formed lexically, syntactically and
    semantically (were reviewed by professional
    editors).
  • The questions
  • Hundreds of new questions every year, the total
    is 2400
  • Task
  • Initially extract at most 5 answers long (250B)
    and short (50B).
  • Now extract only one exact answer.
  • Several other sub-tasks added later definition,
    list, biography.

128
Sample TREC questions
1. Who is the author of the book, "The Iron Lady
A Biography of Margaret Thatcher"? 2. What was
the monetary value of the Nobel Peace Prize in
1989? 3. What does the Peugeot company
manufacture? 4. How much did Mercury spend on
advertising in 1993? 5. What is the name of the
managing director of Apricot Computer? 6. Why
did David Koresh ask the FBI for a word
processor? 7. What is the name of the rare
neurological disease with symptoms such as
involuntary movements (tics), swearing, and
incoherent vocalizations (grunts, shouts, etc.)?
129
TREC Scoring
  • For the first three years systems were allowed to
    return 5 ranked answer snippets (50/250 bytes) to
    each question.
  • Mean Reciprocal Rank Scoring (MRR)
  • Each question assigned the reciprocal rank of the
    first correct answer. If correct answer at
    position k, the score is 1/k.
  • 1, 0.5, 0.33, 0.25, 0.2, 0 for 1, 2, 3, 4, 5, 6
    position
  • Mainly Named Entity answers (person, place, date,
    )
  • From 2002 on, the systems are only allowed to
    return a single exact answer and the notion of
    confidence has been introduced.

130
Top Performing Systems
  • In 2003, the best performing systems at TREC can
    answer approximately 60-70 of the questions
  • Approaches and successes have varied a fair deal
  • Knowledge-rich approaches, using a vast array of
    NLP techniques stole the show in 2000-2003
  • Notably Harabagiu, Moldovan et al. ( SMU/UTD/LCC
    )
  • Statistical systems starting to catch up
  • AskMSR system stressed how much could be achieved
    by very simple methods with enough text (and now
    various copycats)
  • People are experimenting with machine learning
    methods
  • Middle ground is to use large collection of
    surface matching patterns (ISI)

131
Example QA System
  • This system contains many components used by
    other systems, but more complex in some ways
  • Most work completed in 2001 there have been
    advances by this group and others since then.
  • Next slides based mainly on
  • Pasca and Harabagiu, High-Performance Question
    Answering from Large Text Collections, SIGIR01.
  • Pasca and Harabagiu, Answer Mining from Online
    Documents, ACL01.
  • Harabagiu, Pasca, Maiorano Experiments with
    Open-Domain Textual Question Answering. COLING00

132
QA Block Architecture
Question Semantics
Passage Retrieval
Answer Extraction
Question Processing
Q
A
Passages
Keywords
WordNet
WordNet
Document Retrieval
Parser
Parser
NER
NER
133
Question Processing Flow
Question semantic representation
Construction of the question representation
Q
Question parsing
Answer type detection
AT category
Keyword selection
Keywords
134
Question Stems and Answer Types
Identify the semantic category of expected answers
  • Other question stems Who, Which, Name, How
    hot...
  • Other answer types Country, Number, Product...

135
Detecting the Expected Answer Type
  • In some cases, the question stem is sufficient to
    indicate the answer type (AT)
  • Why ? REASON
  • When ? DATE
  • In many cases, the question stem is ambiguous
  • Examples
  • What was the name of Titanics captain ?
  • What U.S. Government agency registers trademarks?
  • What is the capital of Kosovo?
  • Solution select additional question concepts (AT
    words) that help disambiguate the expected answer
    type
  • Examples
  • captain
  • agency
  • capital

136
Answer Type Taxonomy
  • Encodes 8707 English concepts to help recognize
    expected answer type
  • Mapping to parts of Wordnet done by hand
  • Can connect to Noun, Adj, and/or Verb
    subhierarchies

137
Answer Type Detection Algorithm
  • Select the answer type word from the question
    representation.
  • Select the word(s) connected to the question.
    Some content-free words are skipped (e.g.
    name).
  • From the previous set select the word with the
    highest connectivity in the question
    representation.
  • Map the AT word in a previously built AT
    hierarchy
  • The AT hierarchy is based on WordNet, with some
    concepts associated with semantic categories,
    e.g. writer ? PERSON.
  • Select the AT(s) from the first hypernym(s)
    associated with a semantic category.

138
Answer Type Hierarchy
PERSON
PERSON
139
Understanding a Simple Narrative
  • Feb. 18, 2004
  • Yesterday Holly was running a marathon when she
    twisted her ankle. David had pushed her.
  • Temporal Awareness of the Narrative
  • Time
  • Events, Activities, and States
  • Anchoring the Events
  • Ordering the Events

140
Temporal Aspects of Narrative Text
  • Feb. 18, 2004
  • Yesterday Holly was running a marathon when she
    twisted her ankle. David had pushed her.

1. When did the running occur? Yesterday. 2. When
did the twisting occur? Yesterday, during the
running. 3. Did the pushing occur before the
twisting? Yes. 4. Did Holly keep running after
twisting her ankle? 5. Probably not.
141
Temporal Assumptions
  • Time primitives are temporal intervals.
  • No branching into the future or the past
  • 13 basic (binary) interval relations
  • b,a,eq,o,oi,s,si,f,fi,d,di,m,mi,
  • (six are inverses of the other six)
  • Supported by a transitivity table that defines
    the conjunction of any two relations.
  • All 13 relations can be expressed using meet
  • Before (X, Y) ? ?Z , (meets(X, Z) ? (meets (Z,
    Y))

142
Allens 13 Temporal Relations
A
A is EQUAL to B B is EQUAL to A
B
A is BEFORE B B is AFTER A
A
B
A MEETS B B is MET by A
A
B
A OVERLAPS B B is OVERLAPPED by A
A
B
A
A STARTS B B is STARTED by A
B
A
A FINISHES B
About PowerShow.com