Machine Learning: Basic Introduction - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Machine Learning: Basic Introduction

Description:

... locative and temporal adverbs) ... get its translation in Spanish Given ... Probabilities of n-grams are estimated by the relative frequency of n-grams in a ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 57
Provided by: jod747
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning: Basic Introduction


1
Machine Learning Basic Introduction
  • Jan Odijk
  • January 2011
  • LOT Winter School 2011

2
Overview
  • Introduction
  • Rule-based Approaches
  • Machine Learning Approaches
  • Statistical Approach
  • Memory Based Learning
  • Methodology
  • Evaluation
  • Machine Learning CLARIN

3
Introduction
  • As a scientific discipline
  • Studies algorithms that allow computers to evolve
    behaviors based on empirical data
  • Learning empirical data are used to improve
    performance on some tasks
  • Core concept Generalize from observed data

4
Introduction
  • Plural Formation
  • Observed list of (singular form, plural form)
  • Generalize predict plural form given a singular
    form for new words (not in observed list)
  • PoS tagging
  • Observed text corpus with PoS-tag annotations
  • Generalize predict Pos-Tag of each token from a
    new text corpus

5
Introduction
  • Supervised Learning
  • Map input into desired output, e.g. classes
  • Requires a training set
  • Unsupervised Learning
  • Model a set of inputs (e.g. into clusters)
  • No training set required

6
Introduction
  • Many approaches
  • Decision Tree Learning
  • Artificial Neural Networks
  • Genetic programming
  • Support Vector Machines
  • Statistical Approaches
  • Memory Based Learning

7
Introduction
  • Focus here
  • Supervised learning
  • Statistical Approaches
  • Memory-based learning

8
Rule-Based Approaches
  • Rule based systems for language
  • Lexicon
  • Lists all idiosyncratic properties of lexical
    items
  • Unpredictable properties e.g man is a noun
  • Exceptions to rules, e.g. past tense(go) went
  • Hand-crafted
  • In a fully formalized manner

9
Rule-Based Approaches
  • Rule based systems for language (cont.)
  • Rules
  • Specifies regular properties of language
  • E.g. direct object directly follows verb (in
    English)
  • Hand-crafted
  • In a fully formalized manner

10
Rule-Based Approaches
  • Problems for rule based systems
  • Lexicon
  • Very difficult to specify and create
  • Always incomplete
  • Existing dictionaries
  • Were developed for use by humans
  • Do not specify enough properties
  • Do not specify the properties in a formalized
    manner

11
Rule-Based Approaches
  • Problems for rule based systems (cont.)
  • Rules
  • Extremely difficult to describe a language (or
    even a significant subset of language) by rules
  • Rule systems become very large and difficult to
    maintain
  • (No robustness (fail softly) for unexpected
    input)

12
Machine Learning
  • Machine Learning
  • A machine learns
  • Lexicon
  • Regularities of language
  • From a large corpus of observed data

13
Statistical Approach
  • Statistical approach
  • Goal get output O given some input I
  • Given a word in English, get its translation in
    Spanish
  • Given acoustic signal with speech, get the
    written transcription of the spoken word
  • Given preceding tags and following ambitag, get
    tag of the current word
  • Work with probabilities P(OI)

14
Statistical Approach
  • P(A) probability of A
  • A an event (usually modeled by a set)
  • Event spaceall possible event elements ?
  • 0 P(A) 1
  • For finite event space, and a uniform
    distribution P(A) A / ?

15
Statistical Approach
  • Simple Example
  • A fair coin is tossed 3 times
  • What is the probability of (exactly) two heads?
  • 2 possibilities for each toss Heads or Tails
  • Solution
  • ? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
  • A HHT, HTH, THH
  • P(A) A / ? 3/8

16
Statistical Approach
  • Conditional Probability
  • P(AB)
  • Probability of event A given that event B has
    occurred
  • P(AB) P (A n B) / P(B) (for P(B)gt0)
  • A AnB B

17
Statistical Approach
  • A fair coin is tossed 3 times
  • What is the probability of (exactly) two heads
    (A) if the first toss has occurred and is H (B)?
  • ? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
  • A HHT, HTH, THH
  • B HHH,HHT,HTH,HTT
  • A n B HHT, HTH
  • P(AB)P(AnB) / P(B) 2/8 / 4/8 2 / 4 ½

18
Statistical Approach
  • Given
  • P(AB)P(AnB) / P(B) ? (multiply by P(B))
  • P(AnB) P(AB) P(B)
  • P(BnA) P(BA) P(A)
  • P(AnB) P(BnA) ?
  • P(AnB) P(BA) P(A)
  • Bayes Theorem
  • P(AB) P(AnB)/P(B) P(BA)P(A) / P(B)

19
Statistical Approach
  • Bayes Theorem Check
  • ? HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
  • A HHT, HTH, THH
  • B HHH,HHT,HTH,HTT
  • A n B HHT, HTH
  • P(BA) P(BnA) / P(A) 2/8 / 3/8 2/3
  • P(AB) P(BA)P(A) / P(B) (2/3 3/8) / (4/8)
    2 6/24 1/2

20
Statistical Approach
  • Statistical approach
  • Using Bayesian inference (noisy channel model)
  • get P(OI) for all possible O, given I
  • take that O given input I for which P(OI) is
    highest Ô
  • Ô argmaxO P(OI)

21
Statistical Approach
  • Statistical approach
  • How to obtain P(OI)?
  • Bayes Theorem
  • P(OI)

P(IO) P(O)
P(I)
22
Statistical Approach
  • Did we gain anything?
  • Yes!
  • P(O) and P(IO) often easier to estimate than
    P(OI)
  • P(I) can be ignored it is independent of O.
  • (though we have no probabilities anymore)
  • In particular
  • argmaxO P(OI) argmaxO P(IO) P(O)

23
Statistical Approach
  • P(O) (also called the Prior probability)
  • Used for the language model in MT and ASR
  • cannot be computed must be estimated
  • P(w) estimated using the relative frequency of w
    in a (representative) corpus
  • count how often w occurs in the corpus
  • Divide by total number of word tokens in corpus
  • relative frequency set this as P(w)
  • (ignoring smoothing)

24
Statistical Approach
  • P(IO) (also called the likelihood)
  • Cannot easily be computed
  • But estimated on the basis of a corpus
  • Speech recognition
  • Transcribed speech corpus
  • ? Acoustic Model
  • Machine translation
  • Aligned parallel corpus
  • ? Translation Model

25
Statistical Approach
  • How to deal with sentences instead of words?
  • Sentence w1..wn
  • P(S) P(w1)..P(wn)?
  • NO This misses the connections between the words
  • P(S) (chain rule)
  • P(w1)P(w2w1)P(w3w1w2)..P(wnw1..wn-1)

26
Statistical Approach
  • N-grams needed (not really feasible)
  • Probabilities of n-grams are estimated by the
    relative frequency of n-grams in a corpus
  • Frequencies get too low for n-grams ngt3 to be
    useful
  • In practice use bigrams, trigrams (4-grams)
  • E.g. Bigram model
  • P(S) P(w1w2) P(w2w3).. P(wn-1wn)

27
Memory Based Learning
  • Classification
  • Determine input features
  • Determine output classes
  • Store observed examples
  • Use similarity metrics to classify unseen cases

28
Memory Based Learning
  • Example PP-attachment
  • Given a input sequence V ..N.. PP
  • PP attaches to V?, or
  • PP attaches to N?
  • Examples
  • John ate crisps with Mary
  • John ate pizza with fresh anchovies
  • John had pizza with his best friends

29
Memory Based Learning
  • Input features (feature vector)
  • Verb
  • Head noun of complement NP
  • Preposition
  • Head noun of complement NP in PP
  • Output classes (indicated by class labels)
  • Verb (i.e. attaches to the verb)
  • Noun (i.e. attaches to the noun)

30
Memory Based Learning
  • Training Corpus

Id Verb Noun1 Prep Noun2 Class
1 ate crisps with Mary Verb
2 ate pizza with anchovies Noun
3 had pizza with friends Verb
4 has pizza with John Verb
5
31
Memory Based Learning
  • MBL Store training corpus (feature vectors
    associated class in memory)
  • for new cases
  • Stored in memory?
  • Yes assign associated class
  • No use similarity metrics

32
Similarity Metrics
  • (actually distance metrics)
  • Input eats pizza with Liam
  • Compare input feature vector X with each vector Y
    in memory ?(X,Y)
  • Comparing vectors sum the differences for the n
    individual features ?(X,Y) Sni1 d(xi,yi)

33
Similarity Metrics
  • d(f1,f2)
  • (f1,f2 numeric)
  • (f1-f2)/(max-min)
  • 12 2 10 in a range of 0 .. 100 ? 10/1000.1
  • 12 - 2 10 in a range of 0 .. 20 ? 10/20 0.5
  • (f1,f2 not numeric)
  • 0 if f1 f2 no difference ? distance 0
  • 1 if f1? f2 difference ? distance 1

34
Similarity Metrics
Id Verb Noun1 Prep Noun2 Class ?(X,Y)
New(X) eats pizza with Liam ??
Mem 1 ate1 crisps1 with0 Mary1 Verb 3
Mem 2 ate1 Pizza0 with0 anchovies1 Noun 2
Mem 3 had1 Pizza0 with0 Friends1 Verb 2
Mem 4 has1 Pizza0 with0 John1 Verb 2
Mem 5
35
Similarity Metrics
  • Look at the k nearest neighbours (k-NN)
  • (k 1) look at the nearest set of vectors
  • The set of feature vectors with ids 2,3,4 has
    the smallest distance (viz. 2)
  • Take the most frequent class occurring in this
    set Verb
  • Assign this as class to the new example

36
Similarity Metrics
  • with ?(X,Y) Sni1 d(xi,yi)
  • every feature is equally important
  • Perhaps some features are more important
  • Adaptation
  • ?(X,Y) Sni1 wi d(xi,yi)
  • Where wi is the weight of feature i

37
Similarity Metrics
  • How to obtain the weight of a feature?
  • Can be based on knowledge
  • Can be computed from the training corpus
  • In various ways
  • Information Gain
  • Gain Ratio
  • ?2

38
Methodology
  • Split corpus into
  • Training corpus
  • Test Corpus
  • Essential to keep test corpus separate
  • (Ideally) Keep Test Corpus unseen
  • Sometimes
  • Development set
  • To do tests while developing

39
Methodology
  • Split
  • Training 50
  • Test 50
  • Pro
  • Large test set
  • Con
  • Small training set

40
Methodology
  • Split
  • Training 90
  • Test 10
  • Pro
  • Large training set
  • Con
  • Small test set

41
Methodology
  • 10-fold cross-validation
  • Split corpus in 10 equal subsets
  • Train on 9 Test on 1 (in all 10 combinations)
  • Pro
  • Large training sets
  • Still independent test sets
  • Con training set still not maximal
  • requires a lot of computation

42
Methodology
  • Leave One Out
  • Use all examples in training set except 1
  • Test on 1 example (in all combinations)
  • Pro
  • Maximal training sets
  • Still independent test sets
  • Con requires a lot of computation

43
Evaluation
True class True class
Positive (P) Negative (N)
Predicted class Correct True Positive (TP) False Positive (FP)
Predicted class Incorrect False negative (FN) True Negative (TN)
44
Evaluation
  • TP examples that have class C and are predicted
    to have class C
  • FP examples that have class C but are
    predicted to have class C
  • FN examples that have class C but are predicted
    to have class C
  • TN examples that have class C and are predicted
    to have class C

45
Evaluation
  • Precision TP / (TPFP)
  • Recall True Positive Rate TP / P
  • False Positive Rate FP / N
  • F-Score (2PrecRec) / (PrecRec)
  • Accuracy (TPTN)/(TPTNFPFN)

46
Example Applications
  • Morphology for Dutch
  • Segmentation into stems and affixes
  • Abnormaliteiten -gt abnormaal iteit en
  • Map to morphological features (eg inflectional)
  • liepen-gt lopen past plural
  • Instance for each character
  • Features Focus char 5 preceding and 5 following
    letters class

47
Example Applications
  • Morphology for Dutch Results
  • Prec Rec F-Score
  • Full 81.1 80.7 80.9
  • Typed Seg 90.3 89.9 90.1
  • Untyped Seg 90.4 90.0 90.2
  • Segcorrectly segmented
  • Typed assigned correct type
  • Full typed segm correct spelling changes

48
Example Applications
  • Part-of-Speech Tagging
  • Assignment of tags to words in context
  • word -gt (word, tag)
  • book that flight -gt
  • (book, verb) (that,Det) (flight, noun)
  • Book in isolation is ambiguous between noun and
    verb marked by an ambitag noun/verb

49
Example Applications
  • Part-of-Speech Tagging Features
  • Context
  • preceding tag following ambitag
  • Word
  • Actual word form for 1000 most frequent words
  • some features of the word
  • ambitag of the word
  • /-capitalized
  • /-with digits
  • /-hyphen

50
Example Applications
  • Part-of-Speech Tagging Results
  • WSJ 96.4 accuracy
  • LOB Corpus 97.0 accuracy

51
Example Applications
  • Phrase Chunking
  • Marking of major phrase boundaries
  • The man gave the boy the money -gt
  • NP the man gave NP the boy NP the money
  • Usually encoded with tags per word
  • I-X inside X Ooutside B-Xbeginning of new X
  • theI-NP manI-NP gaveO theI-NP boyI-NP theB-NP
    moneyI-NP

52
Example Applications
  • Phrase Chunking Features
  • Word form
  • PoS-tags of
  • 2 preceding words
  • The focus word
  • 1 word to the right

53
Example Applications
  • Phrase Chunking Results
  • Prec Rec F-score
  • NP 92.5 92.2 92.3
  • VP 91.9 91.7 91.8
  • ADJP 68.4 65.0 66.7
  • ADVP 78.0 77.9 77.9
  • PP 91.9 92.2 92.0

54
Example Applications
  • Coreference Marking
  • COREA project
  • Demo
  • Een 21-jarige dronkenlap3 besloot maandagnacht
    zijn50053 roes uit te slapen op de snelweg A19
    bij Naarden . De politie129 trof de man145005
    slapend aan achter het stuur van zijn501714
    auto18, terwijl de motor nog draaide

55
Machine Learning CLARIN
  • Web services in work flow systems are created for
    several MBL-based tools
  • Orthographic Normalization
  • Morphological analysis
  • Lemmatization
  • Pos-Tagging
  • Chunking
  • Coreference assignment
  • Semantic annotation (semantic roles, locative and
    temporal adverbs)

56
Machine learning CLARIN
  • Web services in work flow systems are created for
    statistically based tools such
  • Speech recognition
  • Audio mining
  • All based on SPRAAK
  • Tomorrow more on this!
Write a Comment
User Comments (0)
About PowerShow.com