Aquesta

About This Presentation

Title:

Aquesta

Description:

Title: Aquesta s una prova petita Author: lluism Last modified by: lluism Created Date: 5/20/1999 10:25:04 PM Document presentation format: Presentaci n en pantalla – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 90

Provided by: llu60

Category:

more less

Transcript and Presenter's Notes

Title: Aquesta

1
Seminar Statistical NLP
Machine Learning for Natural Language Processing
Lluís Màrquez TALP Research Center Llenguatges i
Sistemes Informàtics Universitat Politècnica de
Catalunya
Girona, June 2003
2
Outline

Machine Learning for NLP

The Classification Problem
Three ML Algorithms
Applications to NLP

3
Outline

Machine Learning for NLP

The Classification Problem
Three ML Algorithms
Applications to NLP

4
Machine Learning
ML4NLP

There are many general-purpose definitions of
Machine Learning (or artificial learning)

Learners are computers we study learning
algorithms
Resources are scarce time, memory, data, etc.
It has (almost) nothing to do with Cognitive
science, neuroscience, theory of scientific
discovery and research, etc.
Biological plausibility is welcome but not the
main goal

5
Machine Learning
ML4NLP

Learning... but what for?
To perform some particular task
To react to environmental inputs
Concept learning from data
modelling concepts underlying data
predicting unseen observations
compacting the knowledge representation
knowledge discovery for expert systems

We will concentrate on
Supervised inductive learning for classification
discriminative learning

6
Machine Learning
ML4NLP
A more precise definition

What to read?
Machine Learning (Mitchell, 1997)

7
Empirical NLP
ML4NLP
90s Application of Machine Learning techniques
(ML) to NLP problems

Lexical and structural ambiguity problems
Word selection (SR, MT)
Part-of-speech tagging
Semantic ambiguity (polysemy)
Prepositional phrase attachment
Reference ambiguity (anaphora)
etc.

What to read? Foundations of Statistical Language
Processing (Manning Schütze, 1999)

8
NLP classification problems
ML4NLP

Ambiguity is a crucial problem for natural
language understanding/processing. Ambiguity
Resolution Classification

He was shot in the hand as he chased the robbers
in the back street
(The Wall Street Journal Corpus)
9
NLP classification problems
ML4NLP

Morpho-syntactic ambiguity

He was shot in the hand as he chased the robbers
in the back street
NN VB
JJ VB
NN VB
(The Wall Street Journal Corpus)
10
NLP classification problems
ML4NLP

Morpho-syntactic ambiguity
Part of Speech Tagging

He was shot in the hand as he chased the robbers
in the back street
NN VB
JJ VB
NN VB
(The Wall Street Journal Corpus)
11
NLP classification problems
ML4NLP

Semantic (lexical) ambiguity

He was shot in the hand as he chased the robbers
in the back street
body-part clock-part
(The Wall Street Journal Corpus)
12
NLP classification problems
ML4NLP

Semantic (lexical) ambiguity
Word Sense Disambiguation

He was shot in the hand as he chased the robbers
in the back street
body-part clock-part
(The Wall Street Journal Corpus)
13
NLP classification problems
ML4NLP

Structural (syntactic) ambiguity

He was shot in the hand as he chased the robbers
in the back street
(The Wall Street Journal Corpus)
14
NLP classification problems
ML4NLP

Structural (syntactic) ambiguity

He was shot in the hand as he chased the robbers
in the back street
(The Wall Street Journal Corpus)
15
NLP classification problems
ML4NLP

Structural (syntactic) ambiguity
PP-attachment disambiguation

He was shot in the hand as he (chased (the
robbers)NP (in the back street)PP)
(The Wall Street Journal Corpus)
16
Outline

Machine Learning for NLP

The Classification Problem
Three ML Algorithms in detail
Applications to NLP

17
Feature Vector Classification
Classification
IA perspective

An instance is a vector x ltx1,, xngt whose
components, called features (or attributes), are
discrete or real-valued.
Let X be the space of all possible instances.
Let Yy1,, ym be the set of categories (or
classes).
The goal is to learn an unknown target function,
f X Y
A training example is an instance x belonging to
X, labelled with the correct value for f(x),
i.e., a pair ltx, f(x)gt
Let D be the set of all training examples.

18
Feature Vector Classification
Classification

The hypotheses space, H, is the set of functions
h X Y that the learner can consider as
possible definitions

The goal is to find a function h belonging to H
such that for all pair ltx, f (x)gt belonging to
D, h(x) f (x)

19
An Example
Classification
otherwise Þ negative
20
An Example
Classification
21
Some important concepts
Classification

Inductive Bias
Any means that a classification learning
system uses to choose between to functions that
are both consistent with the training data is
called inductive bias (Mooney Cardie, 99)
Language / Search bias

22
Some important concepts
Classification

Inductive Bias
Training error and generalization error

Generalization ability and overfitting
Batch Learning vs. on-line Leaning
Symbolic vs. statistical Learning
Propositional vs. first-order learning

23
Classification
Propositional vs. Relational Learning

Propositional learning

color(red) Ù shape(circle) Þ classA
24
The Classification SettingClass, Point, Example,
Data Set, ...
Classification
CoLT/SLT perspective

Input Space X ? Rn
(binary) Output Space Y 1,-1
A point, pattern or instance x ? X, x
(x1, x2, , xn)
Example (x, y) with x ? X, y ? Y
Training Set a set of m examples generated
i.i.d. according to an unknown distribution
P(x,y) S (x1,
y1), , (xm, ym) ? (X ? Y)m

25
The Classification SettingLearning, Error, ...
Classification

The hypotheses space, H, is the set of functions
h X?Y that the learner can consider as
possible definitions. In SVM are of the form
The goal is to find a function h belonging to H
such that the expected misclassification error on
new examples, also drawn from P(x,y), is minimal
(Risk Minimization, RM)

26
The Classification SettingLearning, Error, ...
Classification

Expected error (risk)
Problem P itself is unknown. Known are training
examples ? an induction principle is needed
Empirical Risk Minimization (ERM) Find the
function h belonging to H for which the training
error (empirical risk) is minimal

27
The Classification SettingError,
Over(under)fitting,...
Classification

Low training error ? low true error?
The overfitting dilemma

(Müller et al., 2001)

Trade-off between training error and complexity
Different learning biases can be used

28
Outline

Machine Learning for NLP

The Classification Problem
Three ML Algorithms
Applications to NLP

29
Outline

Machine Learning for NLP

The Classification Problem
Three ML Algorithms
Decision Trees
AdaBoost
Support Vector Machines
Applications to NLP

30
Learning Paradigms
Algorithms

Statistical learning
HMM, Bayesian Networks, ME, CRF, etc.
Traditional methods from Artificial Intelligence
(ML, AI)
Decision trees/lists, exemplar-based learning,
rule induction, neural networks, etc.
Methods from Computational Learning Theory
(CoLT/SLT)
Winnow, AdaBoost, SVMs, etc.

31
Learning Paradigms
Algorithms

Classifier combination
Bagging, Boosting, Randomization, ECOC, Stacking,
etc.
Semi-supervised learning learning from labelled
and unlabelled examples
Bootstrapping, EM, Transductive learning (SVMs,
AdaBoost), Co-Training, etc.
etc.

32
Decision Trees
Algorithms

Decision trees are a way to represent rules
underlying training data, with hierarchical
structures that recursively partition the data.
They have been used by many research communities
(Pattern Recognition, Statistics, ML, etc.) for
data exploration with some of the following
purposes Description, Classification, and
Generalization.
From a machine-learning perspective Decision
Trees are n -ary branching trees that represent
classification rules for classifying the objects
of a certain domain into a set of mutually
exclusive classes

33
Decision Trees
Algorithms

Acquisition
Top-Down Induction of Decision Trees (TDIDT)
Systems
CART (Breiman et al. 84),
ID3, C4.5, C5.0 (Quinlan 86,93,98),
ASSISTANT, ASSISTANT-R (Cestnik et al. 87)
(Kononenko et al. 95)
etc.

34
An Example
Algorithms
35
Learning Decision Trees
Algorithms
36
General Induction Algorithm
Algorithms
37
General Induction Algorithm
Algorithms
38
Feature Selection Criteria
Algorithms

Functions derived from Information Theory
Information Gain, Gain Ratio (Quinlan 86)
Functions derived from Distance Measures
Gini Diversity Index (Breiman et al. 84)
RLM (López de Mántaras 91)
Statistically-based
Chi-square test (Sestito Dillon 94)
Symmetrical Tau (Zhou Dillon 91)
RELIEFF-IG variant of RELIEFF (Kononenko 94)

39
Extensions of DTs
Algorithms
(Murthy 95)

Pruning (pre/post)
Minimize the effect of the greedy approach
lookahead
Non-lineal splits
Combination of multiple models
Incremental learning (on-line)
etc.

40
Decision Trees and NLP
Algorithms

Speech processing (Bahl et al. 89 Bakiri
Dietterich 99)
POS Tagging (Cardie 93, Schmid 94b Magerman 95
Màrquez Rodríguez 95,97 Màrquez et al. 00)
Word sense disambiguation (Brown et al. 91
Cardie 93 Mooney 96)
Parsing (Magerman 95,96 Haruno et al. 98,99)
Text categorization (Lewis Ringuette 94 Weiss
et al. 99)
Text summarization (Mani Bloedorn 98)
Dialogue act tagging (Samuel et al. 98)

41
Decision Trees and NLP
Algorithms

Noun phrase coreference
(Aone Benett 95 Mc Carthy
Lehnert 95)
Discourse analysis in information extraction
(Soderland Lehnert 94)
Cue phrase identification in text and speech
(Litman 94 Siegel McKeown 94)
Verb classification in Machine Translation
(Tanaka 96 Siegel 97)

42
Decision Trees proscons
Algorithms

Advantages
Acquires symbolic knowledge in a understandable
way
Very well studied ML algorithms and variants
Can be easily translated into rules
Existence of available software C4.5, C5.0, etc.
Can be easily integrated into an ensemble

43
Decision Trees proscons
Algorithms

Drawbacks
Computationally expensive when scaling to large
natural language domains training examples,
features, etc.
Data sparseness and data fragmentation the
problem of the small disjuncts gt Probability
estimation
DTs is a model with high variance (unstable)
Tendency to overfit training data pruning is
necessary
Requires quite a big effort in tuning the model

44
Boosting algorithms
Algorithms

Idea
to combine many simple and moderately accurate
hypotheses (weak classifiers) into a single and
highly accurate classifier
AdaBoost (Freund Schapire 95) has been
theoretically and empirically studied extensively
Many other variants extensions (1997-2003)
http//www.lsi.upc.es/lluism/seminari/mlnlp.htm
l

45
AdaBoost general scheme
Algorithms
TRAINING
46
AdaBoost algorithm
Algorithms
47
AdaBoost example
Algorithms
Weak hypotheses vertical/horizontal hyperplanes
48
AdaBoost round 1
Algorithms
49
AdaBoost round 2
Algorithms
50
AdaBoost round 3
Algorithms
51
Combined Hypothesis
Algorithms
52
AdaBoost and NLP
Algorithms

POS Tagging (Abney et al. 99 Màrquez 99)
Text and Speech Categorization
(Schapire Singer 98 Schapire et al. 98 Weiss
et al. 99)
PP-attachment Disambiguation (Abney et al. 99)
Parsing (Haruno et al. 99)
Word Sense Disambiguation (Escudero et al. 00,
01)
Shallow parsing (Carreras Màrquez, 01a 02)
Email spam filtering (Carreras Màrquez, 01b)
Term Extraction (Vivaldi, et al. 01)

53
AdaBoost proscons
Algorithms

Easy to implement and few parameters to set
Time and space grow linearly with number of
examples. Ability to manage very large learning
problems
Does not constrain explicitly the complexity of
the learner
Naturally combines feature selection with
learning
Has been succesfully applied to many practical
problems

54
AdaBoost proscons
Algorithms

Seems to be rather robust to overfitting
(number of rounds) but sensitive to noise
Performance is very good when there are
relatively few relevant terms (features)
Can perform poorly when there is insufficient
training data relative to the complexity of the
base classifiers, the training errors of the base
classifiers become too large too quickly

55
Algorithms
SVM A General Definition

Support Vector Machines (SVM) are learning
systems that use a hypothesis space of linear
functions in a high dimensional feature space,
trained with a learning algorithm from
optimisation theory that implements a learning
bias derived from statistical learning theory.
(Cristianini Shawe-Taylor, 2000)

56
SVM A General Definition
Algorithms

Support Vector Machines (SVM) are learning
systems that use a hypothesis space of linear
functions in a high dimensional feature space,
trained with a learning algorithm from
optimisation theory that implements a learning
bias derived from statistical learning theory.
(Cristianini Shawe-Taylor, 2000)

Key Concepts
57
Linear Classifiers
Algorithms

Hyperplanes in RN.
Defined by a weight vector (w) and a threshold
(b).
They induce a classification rule

58
Optimal Hyperplane Geometric Intuition
Algorithms
59
Optimal Hyperplane Geometric Intuition
Algorithms
Maximal Margin Hyperplane
?
?
?
60
Linearly separable data
Algorithms
Quadratic Programming
61
Non-separable case (soft margin)
Algorithms
62
Non-linear SVMs
Algorithms

Implicit mapping into feature space via kernel
functions

63
Non-linear SVMs
Algorithms

Kernel functions
Must be efficiently computable
Characterization via Mercers theorem
One of the curious facts about using a kernel is
that we do not need to know the underlying
feature map in order to be able to learn in the
feature space! (Cristianini Shawe-Taylor, 2000)
Examples polynomials, Gaussian radial basis
functions, two-layer sigmoidal neural networks,
etc.

64
Non linear SVMs
Algorithms
Degree 3 polynomial kernel
lin. non-separable
lin. separable
65
Toy Examples
Algorithms

All examples have been run with the 2D graphic
interface of SVMLIB (Chang and Lin, National
University of Taiwan)
LIBSVM is an integrated software for support
vector classification, (C-SVC, nu-SVC),
regression (epsilon-SVR, un-SVR) and distribution
estimation (one-class SVM). It supports
multi-class classification. The basic algorithm
is a simplification of both SMO by Platt and
SVMLight by Joachims. It is also a simplification
of the modification 2 of SMO by Keerthy et al.
Our goal is to help users from other fields to
easily use SVM as a tool. LIBSVM provides a
simple interface where users can easily link it
with their own programs
Available from www.csie.ntu.edu.tw/cjlin/libsvm
(it icludes a Web integrated demo tool)

66
Toy Examples (I)
Algorithms
Linearly separable data set Linear SVM Maximal
margin Hyperplane
67
Toy Examples (I)
Algorithms
(still) Linearly separable data set Linear
SVM High value of C parameter Maximal margin
Hyperplane
The example is correctly classified
68
Toy Examples (I)
Algorithms
(still) Linearly separable data set Linear
SVM Low value of C parameter Trade-off between
margin and training error
The example is now a bounded SV
69
Toy Examples (II)
Algorithms
70
Toy Examples (II)
Algorithms
71
Toy Examples (II)
Algorithms
72
Toy Examples (III)
Algorithms
73
SVM Summary
Algorithms

SVMs introduced in COLT92 (Boser, Guyon,
Vapnik, 1992). Great developement since then
Kernel-induced feature spaces SVMs work
efficiently in very high dimensional feature
spaces ()
Learning bias maximal margin optimisation.
Reduces the danger of overfitting. Generalization
bounds for SVMs ()
Compact representation of the induced hypothesis.
The solution is sparse in terms of SVs ()

74
SVM Summary
Algorithms

Due to Mercers conditions on the kernels the
optimi-sation problems are convex. No local
minima ()
Optimisation theory guides the implementation.
Efficient learning ()
Mainly for classification but also for
regression, density estimation, clustering, etc.
Success in many real-world applications OCR,
vision, bioinformatics, speech recognition, NLP
TextCat, POS tagging, chunking, parsing, etc. ()
Parameter tuning (). Implications in convergence
times, sparsity of the solution, etc.

75
Outline

Machine Learning for NLP

The Classification Problem
Three ML Algorithms
Applications to NLP

76
NLP problems
Applications

Warning! We will not focus on final NLP
applications, but on intermediate tasks...
We will classify the NLP tasks according to their
(structural) complexity

77
NLP problems structural complexity
Applications

Decisional problems
Text Categorization, Document filtering, Word
Sense Disambiguation, etc.
Sequence tagging and detection of sequential
structures
POS tagging, Named Entity extraction, syntactic
chunking, etc.
Hierarchical structures
Clause detection, full parsing, IE of complex
concepts, composite Named Entities, etc.

78
POS tagging
Applications

Morpho-syntactic ambiguity
Part of Speech Tagging

He was shot in the hand as he chased the robbers
in the back street
NN VB
JJ VB
NN VB
(The Wall Street Journal Corpus)
79
POS tagging
Applications
preposition-adverb tree
80
POS tagging
Applications
preposition-adverb tree
Collocations
as_RB much_RB as_IN
as_RB soon_RB as_IN
as_RB well_RB as_IN
81
POS tagging
Applications
RTT (Màrquez Rodríguez 97)
Language Model
stop?
Classify
Update
Filter
Tagged text
Raw text
Morphological analysis
yes
no
Disambiguation
82
POS tagging
Applications
STT (Màrquez Rodríguez 97)
83
Detection of sequential and hierarchical
structures
Applications

Named Entity recognition
Clause detection

84
Summary/conclusions
Conclusions

We have briefly outlined
The ML setting supervised learning for
classification
Three concrete machine learning algorithms
How to apply them to solve itermediate NLP tasks

85
Conclusions
Summary/conclusions

Any ML algorithm for NLP should be
Robust to noise and outliers
Efficient in large feature/example spaces
Adaptive to new/changing domains
portability, tuning, etc.
Able to take advantage of unlabelled examples
semi-supervised learning

86
Summary/conclusions
Conclusions

Statistical and ML-based Natural Language
Processing is a very active and multidisciplinary
area of research

87
Some current research lines
Conclusions

Appropriate learning paradigm for all kind of NLP
problems TiMBL (DBZ99), TBEDL (Brill95), ME
(Ratnaparkhi98), SNoW (Roth98), CRF (Pereira
Singer02).
Definition of an adequate (and task-specific)
feature space mapping from the input space to a
high dimensional feature space, kernels, etc.
Resolution of complex NLP problems inference
with classifiers constraint satisfaction
etc.

88
Bibliografia
Conclusions

You may found additional information at
http//www.lsi.upc.es/lluism/
tesi.html
publicacions/pubs.html
cursos/talks.html
cursos/MLandNL.html
cursos/emnlp1.html
This talk at
http//www.lsi.upc.es/lluism/udg03.ppt.gz

89
Seminar Statistical NLP
Machine Learning for Natural Language Processing
Lluís Màrquez TALP Research Center Llenguatges i
Sistemes Informàtics Universitat Politècnica de
Catalunya
Girona, June 2003

Write a Comment

User Comments (0)