LING 572 - PowerPoint PPT Presentation

1 / 81
About This Presentation
Title:

LING 572

Description:

Covering basic statistical methods that produce state-of-the-art results ... Random variable and random vector ... Convert an instance into a feature vector ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 82
Provided by: coursesWa1
Category:
Tags: ling

less

Transcript and Presenter's Notes

Title: LING 572


1
Introduction
  • LING 572
  • Fei Xia
  • Week 1 1/4/06

2
Outline
  • Course overview
  • Mathematical foundation (Prereq)
  • Probability theory
  • Information theory
  • Basic concepts in the classification task

3
Course overview
4
General info
  • Course url http//courses.washington.edu/ling572
  • Syllabus (incl. slides, assignments, and papers)
    updated every week.
  • Message board
  • ESubmit
  • Slides
  • I will try to put the slides online before class.
  • Additional slides are not required and not
    covered in class.

5
Office hour
  • Fei
  • Email
  • Email address fxia_at_u
  • Subject line should include ling572
  • The 48-hour rule
  • Office hour
  • Time Fr 10-1120am
  • Location Padelford A-210G

6
Lab session
  • Bill McNeil
  • Email billmcn_at_u
  • Lab session what time is good for you?
  • Explaining homework and solution
  • Mallet related questions
  • Reviewing class material
  • ? I highly recommend you to attend lab sessions,
    especially the first few sessions.

7
Time for Lab Session
  • Time
  • Monday 1000am - 1220pm, or
  • Tues 1030 am - 1130 am, or
  • ??
  • Location ??
  • ? Thursday 3-4pm, MGH 271?

8
Misc
  • Ling572 Mailing list ling572a_wi07_at_u
  • EPost
  • Mallet developer mailing list
  • mallet-dev_at_cs.umass.edu

9
Prerequisites
  • Ling570
  • Some basic algorithms FSA, HMM,
  • NLP tasks tokenization, POS tagging, .
  • Programming If you dont know Java well, talk to
    me.
  • Java Mallet
  • Basic concepts in probability and statistics
  • Ex random variables, chain rule, Gaussian
    distribution, .
  • Basic concepts in Information Theory
  • Ex entropy, relative entropy,

10
Expectations
  • Reading
  • Papers are online
  • Reference book Manning Schutze (MS)
  • Finish reading papers before class
  • ? I will ask you questions.

11
Grades
  • Assignments (9 parts) 90
  • Programming language Java
  • Class participation 10
  • No quizzes, no final exams
  • No incomplete unless you can prove your case.

12
Course objectives
  • Covering basic statistical methods that produce
    state-of-the-art results
  • Focusing on classification algorithms
  • Touching on unsupervised and semi-supervised
    algorithms
  • Some material is not easy. We will focus on
    applications, not theoretical proofs.

13
Course layout
  • Supervised methods
  • Classification algorithms
  • Individual classifiers
  • Naïve Bayes
  • kNN and Rocchio
  • Decision tree
  • Decision list ??
  • Maximum Entropy (MaxEnt)
  • Classifier ensemble
  • Bagging
  • Boosting
  • System combination

14
Course layout (cnt)
  • Supervised algorithms (cont)
  • Sequence labeling algorithms
  • Transformation-based learning (TBL)
  • FST, HMM,
  • Semi-supervised methods
  • Self-training
  • Co-training

15
Course layout (cont)
  • Unsupervised methods
  • EM algorithm
  • Forward-backward algorithm
  • Inside-outside algorithm

16
Questions for each method
  • Modeling
  • what is the model?
  • How does the decomposition work?
  • What kind of assumption is made?
  • How many types of model parameters?
  • How many internal (or non-model) parameters?
  • How to handle multi-class problem?
  • How to handle non-binary features?

17
Questions for each method (cont)
  • Training how to estimate parameters?
  • Decoding how to find the best solution?
  • Weaknesses and strengths?
  • Is the algorithm
  • robust? (e.g., handling outliners)
  • scalable?
  • prone to overfitting?
  • efficient in training time? Test time?
  • How much data is needed?
  • Labeled data
  • Unlabeled data

18
Relation between 570/571 and 572
  • 570/571 are organized by tasks 572 is organized
    by learning methods.
  • 572 focuses on statistical methods.

19
NLP tasks covered in Ling570
  • Tokenization
  • Morphological analysis
  • POS tagging
  • Shallow parsing
  • WSD
  • NE tagging

20
NLP tasks covered in Ling571
  • Parsing
  • Semantics
  • Discourse
  • Dialogue
  • Natural language generation (NLG)

21
A ML method for multiple NLP tasks
  • Task (570/571)
  • Tokenization
  • POS tagging
  • Parsing
  • Reference resolution
  • Method (572)
  • MaxEnt

22
Multiple methods for one NLP task
  • Task (570/571) POS tagging
  • Method (572)
  • Decision tree
  • MaxEnt
  • Boosting
  • Bagging
  • .

23
Projects Task 1
  • Text Classification Task 20 groups
  • P1 First look at the Mallet package
  • P2 Your first tui class
  • Naïve Bayes
  • P3 Feature selection
  • Decision Tree
  • P4 Bagging
  • Boosting
  • Individual project

24
Projects Task 2
  • Sequence labeling task IGT detection
  • P5 MaxEnt
  • P6 Beam Search
  • P7 TBA
  • P8 Presentation final class
  • P9 Final report
  • Group project (?)

25
Both projects
  • Use Mallet, a Java package
  • Two types of work
  • Reading code to understand ML methods
  • Writing code to solve problems

26
Feedback on assignments
  • Misc section in each assignment
  • How long it takes to finish the homework?
  • Which part is difficult?

27
Mallet overview
  • It is a Java package, that includes many
  • classifiers,
  • sequence labeling algorithms,
  • optimization algorithms,
  • useful data classes,
  • You should
  • read Mallet Guides
  • attend mallet tutorial next Tuesday
    1030-1130am LLC109
  • start on Hw1
  • I will use Mallet class/method names if possible.

28
Questions for course overview?
29
Outline
  • Course overview
  • Mathematical foundation
  • Probability theory
  • Information theory
  • Basic concepts in the classification task

30
Probability Theory
31
Basic concepts
  • Sample space, event, event space
  • Random variable and random vector
  • Conditional probability, joint probability,
    marginal probability (prior)

32
Sample space, event, event space
  • Sample space (O) a collection of basic outcomes.
  • Ex toss a coin twice HH, HT, TH, TT
  • Event an event is a subset of O.
  • Ex HT, TH
  • Event space (2O) the set of all possible events.

33
Random variable
  • The outcome of an experiment need not be a
    number.
  • We often want to represent outcomes as numbers.
  • A random variable X is a function O?R.
  • Ex toss a coin twice X(HH)0, X(HT)1,

34
Two types of random variables
  • Discrete X takes on only a countable number of
    possible values.
  • Ex Toss a coin 10 times. X is the number of
    tails that are noted.
  • Continuous X takes on an uncountable number of
    possible values.
  • Ex X is the lifetime (in hours) of a light bulb.

35
Probability function
  • The probability function of a discrete variable X
    is a function which gives the probability p(xi)
    that X equals xi a.k.a. p(xi) p(Xxi).

36
Random vector
  • Random vector is a finite-dimensional vector of
    random variables XX1,,Xk.
  • P(x) P(x1,x2,,xn)P(X1x1,., Xnxn)
  • Ex P(w1, , wn, t1, , tn)

37
Three types of probability
  • Joint prob P(x,y) prob of x and y happening
    together
  • Conditional prob P(xy) prob of x given a
    specific value of y
  • Marginal prob P(x) prob of x for all possible
    values of y

38
Common tricks (I)Marginal prob ? joint prob
39
Common tricks (II)Chain rule
40
Common tricks (III)Bayes rule
41
Common tricks (IV)Independence assumption
42
Prior and Posterior distribution
  • Prior distribution P(?)
  • a distribution over parameter values ? set
    prior to observing any data.
  • Posterior Distribution P(? data)
  • It represents our belief that ? is true
    after observing the data.
  • Likelihood of the model ? P(data ?)
  • Relation among the three Bayes Rule
  • P(? data) P(data ?) P(?) / P(data)

43
Two ways of estimating ?
  • Maximum likelihood (ML)
  • ? arg max? P(data ?)
  • Maxinum A-Posterior (MAP)
  • ? arg max? P(? data)

44
Information Theory
45
Information theory
  • It is the use of probability theory to quantify
    and measure information.
  • Basic concepts
  • Entropy
  • Joint entropy and conditional entropy
  • Cross entropy and relative entropy
  • Mutual information and perplexity

46
Entropy
  • Entropy is a measure of the uncertainty
    associated with a distribution.
  • The lower bound on the number of bits it takes to
    transmit messages.
  • An example
  • Display the results of horse races.
  • Goal minimize the number of bits to encode the
    results.

47
An example
  • Uniform distribution pi1/8.
  • Non-uniform distribution (1/2,1/4,1/8, 1/16,
    1/64, 1/64, 1/64, 1/64)

(0, 10, 110, 1110, 111100, 111101, 111110, 111111)
  • Uniform distribution has higher entropy.
  • MaxEnt make the distribution as uniform as
    possible.

48
Joint and conditional entropy
  • Joint entropy
  • Conditional entropy

49
Cross Entropy
  • Entropy
  • Cross Entropy
  • Cross entropy is a distance measure between p(x)
    and q(x) p(x) is the true probability q(x) is
    our estimate of p(x).

50
Relative Entropy
  • Also called Kullback-Leibler divergence
  • Another distance measure between prob functions
    p and q.
  • KL divergence is asymmetric (not a true
    distance)

51
Mutual information
  • It measures how much is in common between X and
    Y
  • I(XY)KL(p(x,y)p(x)p(y))

52
Perplexity
  • Perplexity is 2H.
  • Perplexity is the weighted average number of
    choices a random variable has to make.

53
Questions for Mathematical foundation?
54
Outline
  • Course overview
  • Mathematical foundation
  • Probability theory
  • Information theory
  • Basic concepts in the classification task

55
Types of ML problems
  • Classification problem
  • Estimation problem
  • Clustering
  • Discovery
  • A learning method can be applied to one or more
    types of ML problems.
  • We will focus on the classification problem.

56
Definition of classification problem
  • Task
  • C c1, c2, .., cm is a set of pre-defined
    classes (a.k.a., labels, categories).
  • Dd1, d2, is a set of input needed to be
    classified.
  • A classifier is a function D C ? 0, 1.
  • Multi-label vs. single-label
  • Single-label for each di, only one class is
    assigned to it.
  • Multi-class vs. binary classification problem
  • Binary C 2.

57
Conversion to single-label binary problem
  • Multi-label ? single-label
  • We will focus on single-label problem.
  • A classifier D C ? 0, 1
  • becomes D ? C
  • More general definition D C ? 0, 1
  • Multi-class ? binary problem
  • Positive examples vs. negative examples

58
Examples of classification problems
  • Text classification
  • Document filtering
  • Language/Author/Speaker id
  • WSD
  • PP attachment
  • Automatic essay grading

59
Problems that can be treated as a classification
problem
  • Tokenization / Word segmentation
  • POS tagging
  • NE detection
  • NP chunking
  • Parsing
  • Reference resolution

60
Labeled vs. unlabeled data
  • Labeled data
  • (xi, yi) is a set of labeled data.
  • xi 2 D data/input, often represented as a
    feature vector.
  • yi 2 C target/label
  • Unlabeled data
  • xi without yi.

61
Instance, training and test data
  • xi with or without yi is called an instance.
  • Training data a set of (labeled) instances.
  • Test data a set of unlabeled instances.
  • The training data is stored in an InstanceList in
    Mallet, so is test data.

62
Attribute-value table
  • Each row corresponds to an instance.
  • Each column corresponds to a feature.
  • A feature type (a.k.a. a feature template) w-1
  • A feature w-1book
  • Binary feature vs. non-binary feature

63
Attribute-value table
f1 f2 fK Target
d1 yes 1 no -1000 c2
d2
d3

dn
64
Feature sequence vs. Feature vector
  • Feature sequence a (featName, featValue) list
    for features that are present.
  • Feature Vector a (featName, featValue) list for
    all the features.
  • Representing data x as a feature vector.

65
Data/Input ? a feature vector
  • Example
  • Task text classification
  • Original x a document
  • Feature vector bag-of-words approach
  • In Mallet, the process is handled by a sequence
    of pipes
  • Tokenization
  • Lowercase
  • Merging the counts

66
Classifier and decision matrix
  • A classifier is a function f f(x) (ci,
    scorei). It fills out a decision matrix.
  • (ci, scorei) is called a Classification in
    Mallet.

d1 d2 d3 .
c1 0.1 0.4 0
c2 0.9 0.1 0
c3

67
Trainer (a.k.a Learner)
  • A trainer is a function that takes an
    InstanceList as input, and outputs a classifier.
  • Training stage
  • Classifier train (instanceList)
  • Test stage
  • Classification classify (instance)

68
Important concepts (summary)
  • Instance, InstanceList
  • Labeled data, unlabeled data
  • Training data, test data
  • Feature, feature template
  • Feature vector
  • Attribute-value table
  • Trainer, classifier
  • Training stage, test stage

69
Steps for solving an NLP task with classifiers
  • Convert the task into a classification problem
    (optional)
  • Split data into training/test/validation
  • Convert the data into attribute-value table
  • Training
  • Decoding
  • Evaluation

70
Important subtasks (for you)
  • Converting the data into attribute-value table
  • Define feature types
  • Feature selection
  • Convert an instance into a feature vector
  • Understanding training/decoding algorithms for
    various algorithms.

71
Notation
Classification in general Text categorization
Input/data xi di
Target/label yi ci
Features fk tk (term)

72
Questions for Concepts in a classification task?
73
Summary
  • Course overview
  • Mathematical foundation
  • Probability theory
  • Information theory
  • MS Ch2
  • Basic concepts in the classification task

74
Downloading
  • Hw1
  • Mallet Guide
  • Homework Guide

75
Coming up
  • Next Tuesday
  • Mallet tutorial on 1/8 (Tues) 1030-1130am at
    LLC 109.
  • Classification algorithm overview and Naïve
    Bayes read the paper beforehand.
  • Next Thursday
  • kNN and Rocchio read the other paper
  • Hw1 is due at 11pm on 1/13

76
Additional slides
77
An example
  • 570/571
  • POS tagging HMM
  • Parsing PCFG
  • MT Model 1-4 training
  • 572
  • HMM forward-backward algorithm
  • PCFG inside-outside algorithm
  • MT EM algorithm
  • ? All special cases of EM algorithm, one method
    of unsupervised learning.

78
Proof Relative entropy is always non-negative
79
Entropy of a language
  • The entropy of a language L
  • If we make certain assumptions that the language
    is nice, then the cross entropy can be
    calculated as

80
Cross entropy of a language
  • The cross entropy of a language L
  • If we make certain assumptions that the language
    is nice, then the cross entropy can be
    calculated as

81
Conditional Entropy
Write a Comment
User Comments (0)
About PowerShow.com