NLP/CL: Review - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

NLP/CL: Review

Description:

School of Computing FACULTY OF ENGINEERING NLP/CL: Review Eric Atwell, Language Research Group (with thanks to other contributors) – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 30
Provided by: acuk
Category:

less

Transcript and Presenter's Notes

Title: NLP/CL: Review


1
NLP/CL Review
School of Computing FACULTY OF ENGINEERING
  • Eric Atwell, Language Research Group
  • (with thanks to other contributors)

2
Objectives of the module
  • On completion of this module, students should be
    able to- understand theory and terminology of
    empirical modelling of natural language-
    understand and use algorithms, resources and
    techniques for implementing and evaluating NLP
    systems- be familiar with some of the main
    language engineering application areas-
    appreciate why unrestricted natural language
    processing is still a major research task.
  • In a nutshell
  • Why NLP is difficult language is a complex
    system
  • How to solve it? Corpus-based machine-learning
    approaches
  • Motivation applications of The Language
    Machine

3
The main sub-areas of linguistics
  • ? Phonetics and Phonology The study of
    linguistic sounds or speech.
  • ? Morphology The study of the meaningful
    components of words.
  • ? Syntax The study of the structural
    relationships between words.
  • ? Semantics The study of meanings of words,
    phrases, sentences.
  • ? Discourse The study of linguistic units larger
    than a single utterance.
  • ? Pragmatics The study of how language is used
    to accomplish goals.

4
Python, NLTK, WEKA
  • Python A good programming language for NLP
  • Interpreted
  • Object-oriented
  • Easy to interface to other things (text files,
    web, DBMS)
  • Good stuff from java, lisp, tcl, perl
  • Easy to learn
  • FUN!
  • Python NLTK Natural Language Tool Kit with demos
    and tutorials
  • WEKA Machine Learning toolkit Classifiers, eg
    J48 Decision Trees

5
Why is NLP difficult?
  • Computers are not brains
  • There is evidence that much of language
    understanding is built into the human brain
  • Computers do not socialize
  • Much of language is about communicating with
    people
  • Key problems
  • Representation of meaning and hidden structure
  • Language presupposes knowledge about the world
  • Language is ambiguous a message can have many
    interpretations
  • Language presupposes communication between people

6
AmbiguityGrammar (PoS) and Meaning
  • Iraqi Head Seeks Arms
  • Juvenile Court to Try Shooting Defendant
  • Teacher Strikes Idle Kids
  • Kids Make Nutritious Snacks
  • British Left Waffles on Falkland Islands
  • Red Tape Holds Up New Bridges
  • Bush Wins on Budget, but More Lies Ahead
  • Hospitals are Sued by 7 Foot Doctors
  • (Headlines leave out punctuation and
    function-words)
  • Lynne Truss, 2003. Eats shoots and leaves
  • The Zero Tolerance Approach to Punctuation

7
The Role of Memorization
  • Children learn words quickly
  • Around age two they learn about 1 word every 2
    hours.
  • (Or 9 words/day)
  • Often only need one exposure to associate meaning
    with word
  • Can make mistakes, e.g., overgeneralization
  • I goed to the store.
  • Exactly how they do this is still under study
  • Adult vocabulary
  • Typical adult about 60,000 words
  • Literate adults about twice that.

8
But there is too much to memorize!
  • establish
  • establishment
  • the church of England as the official state
    church.
  • disestablishment
  • antidisestablishment
  • antidisestablishmentarian
  • antidisestablishmentarianism
  • is a political philosophy that is opposed to the
    separation of church and state.
  • MAYBE we dont remember every word separately
  • MAYBE we remember MORPHEMES and how to combine
    them

9
Rationalism v Empiricism
  • Rationalism the doctrine that knowledge is
    acquired by reason without regard to experience
    (Collins English Dictionary)
  • Noam Chomsky, 1957 Syntactic Structures
  • Argued that we should build models through
    introspection
  • A language model is a set of rules thought up by
    an expert
  • Like Expert Systems
  • Chomsky thought data was full of errors, better
    to rely on linguists intuitions

10
Empiricism v Rationalism
  • Empiricism the doctrine that all knowledge
    derives from experience (Collins English
    Dictionary)
  • The field was stuck for quite some time
    rationalist
  • linguistic models for a specific example did not
    generalise.
  • A new approach started around 1990 Corpus
    Linguistics
  • Well, not really new, but in the 50s to 80s,
    they didnt have the text, disk space, or GHz
  • Main idea machine learning from CORPUS data
  • How to do corpus linguistics
  • Get large text collection (a corpus plural
    several corpora)
  • Compute statistical models over the
    words/PoS/parses/ in the corpus
  • Surprisingly effective

11
Example Problem
  • Grammar checking example
  • Which word to use?
  • ltprincipalgt ltprinciplegt
  • Empirical solution look at which words surround
    each use
  • I am in my third year as the principal of Anamosa
    High School.
  • School-principal transfers caused some upset.
  • This is a simple formulation of the quantum
    mechanical uncertainty principle.
  • Power without principle is barren, but principle
    without power is futile. (Tony Blair)

12
Using Very Large Corpora
  • Keep track of which words are the neighbors of
    each spelling in well-edited text, e.g.
  • Principal high school
  • Principle rule
  • At grammar-check time, choose the spelling best
    predicted by the probability of co-occurring with
    surrounding words.
  • No need to understand the meaning !?
  • Surprising results
  • Log-linear improvement even to a billion words!
  • Getting more data is better than fine-tuning
    algorithms!

13
The Effects of LARGE Datasets
  • From Banko Brill, 2001. Scaling to Very Very
    Large Corpora for Natural Language
    Disambiguation, Proc ACL

14
Corpus, word tokens and types
  • Corpus text selected by language, genre, domain,
  • Brown, LOB, BNC, Penn Treebank, MapTask, CCA,
  • Corpus Annotation text headers, PoS, parses,
  • Corpus size is no. of words depends on
    tokenisation
  • We can count word tokens, word types, type-token
    distribution
  • Lexeme/lemma is root form, v inflections (be v
    am/is/was)

15
Tokenization and Morphology
  • Tokenization - by whitespace, regular expressions
  • Problems Its data-base New York
  • Jabberwocky shows we can break words into
    morphemes
  • Morpheme types root/stem, affix, clitic
  • Derivational vs. Inflectional
  • Regular vs. Irregular
  • Concatinative vs. Templatic (root-and-pattern)
  • Morphological analysers Porter stemmer, Morphy,
    PC-Kimmo
  • Morphology by lookup CatVar, CELEX, OALD

16
Corpus word-counts and n-grams
  • FreqDist counts of tokens and their distribution
    can be useful
  • Eg find main characters in Gutenberg texts
  • Eg compare word-lengths in different languages
  • Human can predict the next word
  • N-gram models are based on counts in a large
    corpus
  • Auto-generate a story ... (but gets stuck in
    local maximum)

17
Word-counts follow Zipfs Law
  • Zipfs law applies to a word type-token frequency
    distribution frequency is proportional to the
    inverse of the rank in a ranked list
  • fr k where f is frequency, r is rank, k is a
    constant
  • ie a few very common words, a small to medium
    number of middle-frequency words, and a long tail
    of low frequency (1)
  • Chomsky argued against corpus evidence as it is
    finite and limited compared to introspection
    Zipfs law shows that many words/structures only
    appear 1 or 0 times in a given corpus,
    ??supporting the argument that corpus evidence is
    limited compared to introspection

18
Kilgarriffs Sketch Engine
  • Sketch Engine shows a Word Sketch or list of
    collocates words co-occurring with the target
    word more frequently than predicted by
    independent probabilities
  • A lexicographer can colour-code groups of related
    collocates indicating different senses or
    meanings of the target word
  • With a large corpus the lexicographer should find
    all current senses, better than relying on
    intuition/introspection
  • Large user-base of experience, used in
    development of several published dictionaries for
    English
  • For minority languages with few existing corpus
    resources, Sketch Engine is combined with
    Web-Bootcat to enable lexicographers to collect
    their own Web-as-Corpus

19
Parts of Speech
  • Parts of Speech groups words into grammatical
    categories
  • and separates different functions of a word
  • In English, many words are ambiguous 2 or more
    PoS-tags
  • Very simple tagger everything is NN
  • Better Pos-Taggers unigram, bigram, trigram,
    Brill,

20
Training and Testing ofMachine Learning
Algorithms
  • Algorithms that learn from data see a set of
    examples and try to generalize from them.
  • Training set
  • Examples trained on
  • Test set
  • Also called held-out data and unseen data
  • Use this for testing your algorithm
  • Must be separate from the training set
  • Otherwise, you cheated!
  • Gold Standard
  • A test set that a community has agreed on and
    uses as a common benchmark use for final
    evaluation

21
Grammar and Parsing
  • Context-Free Grammars and Constituency
  • Some common CFG phenomena for English
  • Sentence-level constructions
  • NP, PP, VP
  • Coordination
  • Subcategorization
  • Top-down and Bottom-up Parsing

22
Problems with context-free grammars and parsers
  • Parse-trees show syntactic structure of sentences
  • Key constituents S, NP, VP, PP
  • You can draw a parse-tree and corresponding CFG
  • Problems with Context-Free Grammar
  • Coordination X ? X and X is a Meta-Rule, not
    strict CFG rule
  • Agreement needs duplicate CFG rules for
    singular/plural etc
  • Subcategorization needs separate CFG
    non-terminals for trans/intrans/
  • Movement object/subject of verb may be moved
    in questions
  • Dependency parsing captures deeper semantics but
    is harder
  • Parsing top-down v bottom-up v combined
  • Ambiguity causes backtracking, so CHART PARSER
    stores partial parses

23
Parsing sentences left-to-right
  • The horse raced past the barn
  • S NP A the N horse NP
  • VP V raced PP I past NP A the N
    barn NP PP VP S
  • The horse raced past the barn fell
  • S NP NP A the N horse NP
  • VP V raced PP I past NP A
    the N barn NP PP VP NP
  • VP V fell VP S

24
Chunking or Shallow Parsing
  • Break text up into non-overlapping contiguous
    subsets of tokens.
  • Shallow parsing or Chunking is useful for
  • Entity recognition
  • people, locations, organizations
  • Studying linguistic patterns
  • gave NP
  • gave up NP in NP
  • gave NP NP
  • gave NP to NP
  • Prosodic phrase breaks pauses in speech
  • Can ignore complex structure when not relevant
  • Chunking can be done via regular expressions over
    PoS-tags

25
Information Extraction
  • Partial parsing gives us NP chunks
  • IE Named Entity Recognition
  • People, places, companies, dates etc.
  • In a cohesive text, some NPs refer to the same
    thing/person
  • Needed an algorithm for NP coreference
    resolution eg
  • Hudda, ten of hearts, Mrs Anthrax, she
    all refer to the same Named Entity

26
Semantics Word Sense Disambiguation
  • e.g. mouse (animal /PC-interface)
  • Its a hard task (Very)
  • Humans very good at it
  • Computers not
  • Active field of research for over 50 years
  • Mistakes in disambiguation have negative results
  • Beginning to be of practical use
  • Desirable skill (Google, M)

27
Machine learningv cognitive modelling
  • NLP has been successful using ML from data,
    without linguistic / cognitive models
  • Supervised ML given labelled data (eg PoS-tagged
    text to train PoS-tagger, to tag new text in the
    style of training text)
  • Unsupervised ML no labelled data (eg clustering
    words with similar contexts gives PoS-tag
    categories)
  • Unsupervised ML is harder, but increasingly
    successful!

28
NLP applications
  • Machine Translation
  • Localization adapting text (e.g. ads) to local
    language
  • Information Retrieval (Google, etc)
  • Information Extraction
  • Detecting Terrorist Activities
  • Understanding the Quran
  • For more, see The Language Machine

29
And Finally
  • Any final questions?
  • Feedback please (eg email me)
  • Good luck in the exam!
  • Look at past exam papers
  • BUT note changes in topics covered
  • And if you do use NLP in your career, please let
    me know
Write a Comment
User Comments (0)
About PowerShow.com