Machine Learning: Basic Introduction presentation

About This Presentation

Transcript and Presenter's Notes

Title: Machine Learning: Basic Introduction

1
Machine Learning Basic Introduction

Jan Odijk
January 2011
LOT Winter School 2011

2
Overview

Introduction
Rule-based Approaches
Machine Learning Approaches
Statistical Approach
Memory Based Learning
Methodology
Evaluation
Machine Learning CLARIN

3
Introduction

As a scientific discipline
Studies algorithms that allow computers to evolve
behaviors based on empirical data
Learning empirical data are used to improve
performance on some tasks
Core concept Generalize from observed data

4
Introduction

Plural Formation
Observed list of (singular form, plural form)
Generalize predict plural form given a singular
form for new words (not in observed list)
PoS tagging
Observed text corpus with PoS-tag annotations
Generalize predict Pos-Tag of each token from a
new text corpus

5
Introduction

Supervised Learning
Map input into desired output, e.g. classes
Requires a training set
Unsupervised Learning
Model a set of inputs (e.g. into clusters)
No training set required

6
Introduction

Many approaches
Decision Tree Learning
Artificial Neural Networks
Genetic programming
Support Vector Machines
Statistical Approaches
Memory Based Learning

7
Introduction

Focus here
Supervised learning
Statistical Approaches
Memory-based learning

8
Rule-Based Approaches

Rule based systems for language
Lexicon
Lists all idiosyncratic properties of lexical
items
Unpredictable properties e.g man is a noun
Exceptions to rules, e.g. past tense(go) went
Hand-crafted
In a fully formalized manner

9
Rule-Based Approaches

Rule based systems for language (cont.)
Rules
Specifies regular properties of language
E.g. direct object directly follows verb (in
English)
Hand-crafted
In a fully formalized manner

10
Rule-Based Approaches

Problems for rule based systems
Lexicon
Very difficult to specify and create
Always incomplete
Existing dictionaries
Were developed for use by humans
Do not specify enough properties
Do not specify the properties in a formalized
manner

11
Rule-Based Approaches

Problems for rule based systems (cont.)
Rules
Extremely difficult to describe a language (or
even a significant subset of language) by rules
Rule systems become very large and difficult to
maintain
(No robustness (fail softly) for unexpected
input)

12
Machine Learning

Machine Learning
A machine learns
Lexicon
Regularities of language
From a large corpus of observed data

13
Statistical Approach

Statistical approach
Goal get output O given some input I
Given a word in English, get its translation in
Spanish
Given acoustic signal with speech, get the
written transcription of the spoken word
Given preceding tags and following ambitag, get
tag of the current word
Work with probabilities P(OI)

14
Statistical Approach

P(A) probability of A
A an event (usually modeled by a set)
Event spaceall possible event elements ?
0 P(A) 1
For finite event space, and a uniform
distribution P(A) A / ?

15
Statistical Approach

Simple Example
A fair coin is tossed 3 times
What is the probability of (exactly) two heads?
2 possibilities for each toss Heads or Tails
Solution
? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
A HHT, HTH, THH
P(A) A / ? 3/8

16
Statistical Approach

Conditional Probability
P(AB)
Probability of event A given that event B has
occurred
P(AB) P (A n B) / P(B) (for P(B)gt0)
A AnB B

17
Statistical Approach

A fair coin is tossed 3 times
What is the probability of (exactly) two heads
(A) if the first toss has occurred and is H (B)?
? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
A HHT, HTH, THH
B HHH,HHT,HTH,HTT
A n B HHT, HTH
P(AB)P(AnB) / P(B) 2/8 / 4/8 2 / 4 ½

18
Statistical Approach

Given
P(AB)P(AnB) / P(B) ? (multiply by P(B))
P(AnB) P(AB) P(B)
P(BnA) P(BA) P(A)
P(AnB) P(BnA) ?
P(AnB) P(BA) P(A)
Bayes Theorem
P(AB) P(AnB)/P(B) P(BA)P(A) / P(B)

19
Statistical Approach

Bayes Theorem Check
? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
A HHT, HTH, THH
B HHH,HHT,HTH,HTT
A n B HHT, HTH
P(BA) P(BnA) / P(A) 2/8 / 3/8 2/3
P(AB) P(BA)P(A) / P(B) (2/3 3/8) / (4/8)
2 6/24 1/2

20
Statistical Approach

Statistical approach
Using Bayesian inference (noisy channel model)
get P(OI) for all possible O, given I
take that O given input I for which P(OI) is
highest Ô
Ô argmaxO P(OI)

21
Statistical Approach

Statistical approach
How to obtain P(OI)?
Bayes Theorem
P(OI)

P(IO) P(O)
P(I)
22
Statistical Approach

Did we gain anything?
Yes!
P(O) and P(IO) often easier to estimate than
P(OI)
P(I) can be ignored it is independent of O.
(though we have no probabilities anymore)
In particular
argmaxO P(OI) argmaxO P(IO) P(O)

23
Statistical Approach

P(O) (also called the Prior probability)
Used for the language model in MT and ASR
cannot be computed must be estimated
P(w) estimated using the relative frequency of w
in a (representative) corpus
count how often w occurs in the corpus
Divide by total number of word tokens in corpus
relative frequency set this as P(w)
(ignoring smoothing)

24
Statistical Approach

P(IO) (also called the likelihood)
Cannot easily be computed
But estimated on the basis of a corpus
Speech recognition
Transcribed speech corpus
? Acoustic Model
Machine translation
Aligned parallel corpus
? Translation Model

25
Statistical Approach

How to deal with sentences instead of words?
Sentence w1..wn
P(S) P(w1)..P(wn)?
NO This misses the connections between the words
P(S) (chain rule)
P(w1)P(w2w1)P(w3w1w2)..P(wnw1..wn-1)

26
Statistical Approach

N-grams needed (not really feasible)
Probabilities of n-grams are estimated by the
relative frequency of n-grams in a corpus
Frequencies get too low for n-grams ngt3 to be
useful
In practice use bigrams, trigrams (4-grams)
E.g. Bigram model
P(S) P(w1w2) P(w2w3).. P(wn-1wn)

27
Memory Based Learning

Classification
Determine input features
Determine output classes
Store observed examples
Use similarity metrics to classify unseen cases

28
Memory Based Learning

Example PP-attachment
Given a input sequence V ..N.. PP
PP attaches to V?, or
PP attaches to N?
Examples
John ate crisps with Mary
John ate pizza with fresh anchovies
John had pizza with his best friends

29
Memory Based Learning

Input features (feature vector)
Verb
Head noun of complement NP
Preposition
Head noun of complement NP in PP
Output classes (indicated by class labels)
Verb (i.e. attaches to the verb)
Noun (i.e. attaches to the noun)

30
Memory Based Learning

Training Corpus

Id Verb Noun1 Prep Noun2 Class
1 ate crisps with Mary Verb
2 ate pizza with anchovies Noun
3 had pizza with friends Verb
4 has pizza with John Verb
5
31
Memory Based Learning

MBL Store training corpus (feature vectors
associated class in memory)
for new cases
Stored in memory?
Yes assign associated class
No use similarity metrics

32
Similarity Metrics

(actually distance metrics)
Input eats pizza with Liam
Compare input feature vector X with each vector Y
in memory ?(X,Y)
Comparing vectors sum the differences for the n
individual features ?(X,Y) Sni1 d(xi,yi)

33
Similarity Metrics

d(f1,f2)
(f1,f2 numeric)
(f1-f2)/(max-min)
12 2 10 in a range of 0 .. 100 ? 10/1000.1
12 - 2 10 in a range of 0 .. 20 ? 10/20 0.5
(f1,f2 not numeric)
0 if f1 f2 no difference ? distance 0
1 if f1? f2 difference ? distance 1

34
Similarity Metrics
Id Verb Noun1 Prep Noun2 Class ?(X,Y)
New(X) eats pizza with Liam ??
Mem 1 ate1 crisps1 with0 Mary1 Verb 3
Mem 2 ate1 Pizza0 with0 anchovies1 Noun 2
Mem 3 had1 Pizza0 with0 Friends1 Verb 2
Mem 4 has1 Pizza0 with0 John1 Verb 2
Mem 5
35
Similarity Metrics

Look at the k nearest neighbours (k-NN)
(k 1) look at the nearest set of vectors
The set of feature vectors with ids 2,3,4 has
the smallest distance (viz. 2)
Take the most frequent class occurring in this
set Verb
Assign this as class to the new example

36
Similarity Metrics

with ?(X,Y) Sni1 d(xi,yi)
every feature is equally important
Perhaps some features are more important
Adaptation
?(X,Y) Sni1 wi d(xi,yi)
Where wi is the weight of feature i

37
Similarity Metrics

How to obtain the weight of a feature?
Can be based on knowledge
Can be computed from the training corpus
In various ways
Information Gain
Gain Ratio
?2

38
Methodology

Split corpus into
Training corpus
Test Corpus
Essential to keep test corpus separate
(Ideally) Keep Test Corpus unseen
Sometimes
Development set
To do tests while developing

39
Methodology

Split
Training 50
Test 50
Pro
Large test set
Con
Small training set

40
Methodology

Split
Training 90
Test 10
Pro
Large training set
Con
Small test set

41
Methodology

10-fold cross-validation
Split corpus in 10 equal subsets
Train on 9 Test on 1 (in all 10 combinations)
Pro
Large training sets
Still independent test sets
Con training set still not maximal
requires a lot of computation

42
Methodology

Leave One Out
Use all examples in training set except 1
Test on 1 example (in all combinations)
Pro
Maximal training sets
Still independent test sets
Con requires a lot of computation

43
Evaluation
True class True class
Positive (P) Negative (N)
Predicted class Correct True Positive (TP) False Positive (FP)
Predicted class Incorrect False negative (FN) True Negative (TN)
44
Evaluation

TP examples that have class C and are predicted
to have class C
FP examples that have class C but are
predicted to have class C
FN examples that have class C but are predicted
to have class C
TN examples that have class C and are predicted
to have class C

45
Evaluation

Precision TP / (TPFP)
Recall True Positive Rate TP / P
False Positive Rate FP / N
F-Score (2PrecRec) / (PrecRec)
Accuracy (TPTN)/(TPTNFPFN)

46
Example Applications

Morphology for Dutch
Segmentation into stems and affixes
Abnormaliteiten -gt abnormaal iteit en
Map to morphological features (eg inflectional)
liepen-gt lopen past plural
Instance for each character
Features Focus char 5 preceding and 5 following
letters class

47
Example Applications

Morphology for Dutch Results
Prec Rec F-Score
Full 81.1 80.7 80.9
Typed Seg 90.3 89.9 90.1
Untyped Seg 90.4 90.0 90.2
Segcorrectly segmented
Typed assigned correct type
Full typed segm correct spelling changes

48
Example Applications

Part-of-Speech Tagging
Assignment of tags to words in context
word -gt (word, tag)
book that flight -gt
(book, verb) (that,Det) (flight, noun)
Book in isolation is ambiguous between noun and
verb marked by an ambitag noun/verb

49
Example Applications

Part-of-Speech Tagging Features
Context
preceding tag following ambitag
Word
Actual word form for 1000 most frequent words
some features of the word
ambitag of the word
/-capitalized
/-with digits
/-hyphen

50
Example Applications

Part-of-Speech Tagging Results
WSJ 96.4 accuracy
LOB Corpus 97.0 accuracy

51
Example Applications

Phrase Chunking
Marking of major phrase boundaries
The man gave the boy the money -gt
NP the man gave NP the boy NP the money
Usually encoded with tags per word
I-X inside X Ooutside B-Xbeginning of new X
theI-NP manI-NP gaveO theI-NP boyI-NP theB-NP
moneyI-NP

52
Example Applications

Phrase Chunking Features
Word form
PoS-tags of
2 preceding words
The focus word
1 word to the right

53
Example Applications

Phrase Chunking Results
Prec Rec F-score
NP 92.5 92.2 92.3
VP 91.9 91.7 91.8
ADJP 68.4 65.0 66.7
ADVP 78.0 77.9 77.9
PP 91.9 92.2 92.0

54
Example Applications

Coreference Marking
COREA project
Demo
Een 21-jarige dronkenlap3 besloot maandagnacht
zijn50053 roes uit te slapen op de snelweg A19
bij Naarden . De politie129 trof de man145005
slapend aan achter het stuur van zijn501714
auto18, terwijl de motor nog draaide

55
Machine Learning CLARIN

Web services in work flow systems are created for
several MBL-based tools
Orthographic Normalization
Morphological analysis
Lemmatization
Pos-Tagging
Chunking
Coreference assignment
Semantic annotation (semantic roles, locative and
temporal adverbs)

56
Machine learning CLARIN

Web services in work flow systems are created for
statistically based tools such
Speech recognition
Audio mining
All based on SPRAAK
Tomorrow more on this!

Write a Comment

User Comments (0)

About PowerShow.com

Machine Learning: Basic Introduction PowerPoint PPT Presentation