Natural Language Processing 2 Basic Probability - PowerPoint PPT Presentation

1 / 67

About This Presentation

Title:

Natural Language Processing 2 Basic Probability

Description:

The collection of basic outcomes (or sample points) for our experiment is called ... Every observation/trial is a basic outcome or sample point. ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 68

Provided by: leno2

Category:

more less

Transcript and Presenter's Notes

Title: Natural Language Processing 2 Basic Probability

1
Natural Language Processing(2) Basic
Probability

Dr. Xuan Wang(? ?)
Intelligence Computing Research Center
Harbin Institute of Technology Shenzhen Graduate
School
Slides from Dr. Mary P. Harper ECE, Purdue
University

2
Motivation

Statistical NLP aims to do statistical inference.
Statistical inference consists of taking some
data (generated according to some unknown
probability distribution) and then making some
inferences about this distribution.
An example of statistical inference is the task
of language modeling, namely predicting the next
word given a window of previous words. To do
this, we need a model of the language.
Probability theory helps us to find such a model.

3
Probability Terminology

Probability theory deals with predicting how
likely it is that something will happen.
The process by which an observation is made is
called an experiment or a trial (e.g., tossing a
coin twice).
The collection of basic outcomes (or sample
points) for our experiment is called the sample
space.

4
Probability Terminology

An event is a subset of the sample space.
Probabilities are numbers between 0 and 1, where
0 indicates impossibility and 1, certainty.
A probability function/distribution distributes a
probability mass of 1 throughout the sample
space.

5
Experiments Sample Spaces

Set of possible basic outcomes of an experiment
sample space W
coin toss (W head,tail)
tossing coin 2 times (W HH, HT, TH, TT)
dice roll (W 1, 2, 3, 4, 5, 6)
missing word ( W _at_ vocabulary size)
Discrete (countable) versus continuous
(uncountable)
Every observation/trial is a basic outcome or
sample point.
Event A is a set of basic outcomes with A Ì W , Æ
is the impossible event

6
Events and Probability

The probability of event A is denoted p(A) (also
called the prior probability, i.e., the
probability before we consider any additional
knowledge).
Example Experiment toss coin three times
W HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
cases with two or more tails
A HTT, THT, TTH, TTT
P(A) A/ W 1/2 (assuming uniform
distribution)
all heads
A HHH
P(A) A/ W 1/8

7
Probability Properties

Basic properties
P 0,1
P(W) 1
For disjoint events P(ÈAi) åi P(Ai)
NB axiomatic definition of probability take
the above three conditions as axioms
Immediate consequences
P(Æ) 0, P(A ) 1 - P(A), A Í B Þ P(A) P(B)

8
Joint Probability

Joint Probability of A and B P(A,B) P(A Ç B)
2-dimensional table (AxB) with a value in
every cell
giving the probability of that specific pair
occurring.

9
Conditional Probability

Sometimes we have partial knowledge about the
outcome of an experiment, then the conditional
(or
posterior) probability can be helpful. If we know
that event B is true, then we can determine the
probability that A is true given this knowledge
P(AB) P(A,B) / P(B)

10
Conditional and Joint Probabilities

P(AB) P(A,B)/P(B) P(A,B) P(AB) P(B)
P(BA) P(A,B)/P(A) P(A,B) P(BA) P(B)
Chain rule

11
Bayes Rules

Since P(A,B) P(B,A), (P(A Ç B) P(B Ç A)), and
P(A,B) P(AB) P(B) P(BA) P(A)
P(AB) P(A,B)/P(B) P(BA) P(A) / P(B)
P(BA) P(A,B)/P(A) P(AB) P(B)/P(A)

12
Example

S have stiff neck, M have meningitis(???)
P(SM) 0.5, P(M) 1/50,000, P(S)1/20
I have stiff neck, should I worry?

13
Independence
Two events A and B are independent of each other
if P(A) P(AB) Example two coin tosses,
weather today and weather on March 4th, 1789 If A
and B are independent, then we compute P(A,B)
from P(A) and P(B) as P(A,B) P(AB) P(B)
P(A) P(B) Two events A and B are conditionally
independent of each other given C if P(AC)
P(AB,C)
14
A Golden Rule(of Statistical NLP)

If we are interested in which event is most
likely given A, we can use Bayes rule, max over
all B
P(A) is a normalizing constant
However the denominator is easy to obtain.

15
Random Variables (RV)

Random variables (RV) allow us to talk about the
probabilities of numerical values that are
related to the event space (with a specific
numeric range)
An RV is a function X W Q
in general Q Rn, typically R
easier to handle real numbers than real-world
events
An RV is discrete if Q is a countable subset of
R an indicator RV (or Bernoulli trial) if Q0,
1
Can define a probability mass function (pmf) for
RV X that gives the probability it has at
different values
where Ax w Î W X(w) x
often just p(x) if it is clear from context what
x is

16
Example

Suppose we define a discrete RV X that is the sum
of the faces of two die, then Q2, , 11, 12
with the pmf as follows
P(X2)1/36,
P(X3)2/36,
P(X4)3/36,
P(X5)4/36,
P(X6)5/36,
P(X7)6/36,
P(X8)5/36,
P(X9)4/36,
P(X10)3/36,
P(X11)2/36,
P(X12)1/36

17
Expectation and Variance

The expectation is the mean or average of a RV
defined as
The variance of a RV is a measure of whether the
values of the RV tend to vary over trials
The standard deviation (s) is the square root of
the
variance.

18
Examples

What is the expectation of the sum of the
numbers on two dice?
Or more simply
2 P(X2) 2 1/36 1/18 E(SUM) E(D1D2)
3 P(X3) 3 2/36 3/18 E(D1) E(D2)
4 P(X4) 4 3/36 6/18 E(D1) E(D2)
5 P(X5) 5 4/36 10/18 1 1/6 2
1/6 6 1/6
6 P(X6) 6 5/36 15/18 1/6 (1 2
3 4 5 6) 21/6
7 P(X7) 7 6/36 21/18 Hence,
E(SUM) 21/6 21/6 7
8 P(X8) 8 5/36 20/18
9 P(X9) 9 4/36 18/18
10 P(X10) 10 3/36 15/18
11 P(X11) 11 2/36 11/18
12 p(X12) 12 1/36 6/18
E(SUM) 126/18 7

19
Examples
20
Joint and Conditional Distributionsfor RVs
21
Estimating Probability Functions
22
Parametric Methods

Assume that the language phenomenon is
acceptably modeled by one of the well-known
standard distributions (e.g., binomial, normal).
By assuming an explicit probabilistic model of
the
process by which the data was generated, then
determining a particular probability distribution
within the family requires only the specification
of
a few parameters, which requires less training
data
(i.e., only a small number of parameters need to
be estimated).

23
Non-parametric Methods

No assumption is made about the underlying
distribution of the data.
For example, simply estimate P empirically by
counting a large number of random events is a
distribution-free method.
Given less prior information, more training
data is
needed.

24
Estimating Probability

Example Toss coin three times
W HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
count cases with exactly two tails A HTT,
THT, TTH
Run an experiment 1000 times (i.e., 3000
tosses)
Counted 386 cases with two tails (HTT, THT, or
TTH)
Estimate of p(A) 386 / 1000 .386
Run again 373, 399, 382, 355, 372, 406, 359
p(A) .379 (weighted average) or simply 3032 /
8000
Uniform distribution assumption p(A) 3/8
.375

25
Standard Distributions

In practice, one commonly finds the same basic
form of a
probability mass function, but with different
constants
employed.
Families of pmfs are called distributions and
the constants
that define the different possible pmfs in one
family are
called parameters.
Discrete Distributions the binomial
distribution, the
multinomial distribution, the Poisson
distribution.
Continuous Distributions the normal
distribution, the
standard normal distribution.

26
Standard Distributions Binomial
27
Binomial Distribution

Works well for tossing a coin. However, for
linguistic corpora one never has complete
independence from one sentence to the next
approximation.
Use it when counting whether something has a
certain property or not (assuming independence).
Actually quite common in SNLP e.g., look
through a corpus to find out the estimate of the
percentage of sentences that have a certain word
in
them or how often a verb is used as transitive or
intransitive.
Expectation is n p, variance is n p (1-p)

28
Standard Distributions Normal
29
Frequentist Statistics
30
Baysian Statistics I Bayesian Updating

Updating
Assume that the data are coming in sequentially
and are independent.
Given an a-priori probability distribution, we
can
update our beliefs when a new datum comes in by
calculating the Maximum A Posteriori (MAP)
distribution.
The MAP probability becomes the new prior and
the process repeats on each new datum.

31
Bayesian Statistics MAP
32
Bayesian Statistics II Bayesian

Decision Theory
Bayesian Statistics can be used to evaluate
which
model or family of models better explains some
data.
We define two different models of the event and
calculate the likelihood ratio between these two
models.

33
Bayesian Decision Theory
34
Essential Information Theory

Developed by Shannon in the 1940s.
Goal is to maximize the amount of information
that can be
transmitted over an imperfect communication
channel.
Wished to determine the theoretical maxima for
data
compression (entropy H) and transmission rate
(channel
capacity C).
If a message is transmitted at a rate slower than
C, then the
probability of transmission errors can be made as
small as
desired.

35
Entropy
36
Entropy(cont)
37
Using the Formula Examples
38
The Limits
39
Coding Interpretation of Entropy

The least (average) number of bits needed to
encode a message (string, sequence, series,...)
(each element being a result of a random process
with some distribution p) gives H(p).
Compression algorithms
do well on data with repeating (easily
predictable low
entropy) patterns
their results have high entropy Þcompressing
compressed data does nothing

40
Coding Example
41
Properties of Entropy I
42
Joint Entropy
43
Conditional Entropy
44
Properties of Entropy II
45
Chain Rule for Entropy
46
Entropy Rate

Because the amount of information contained in
a
message depends on its length, we may want to
compare using entropy rate (the entropy per
unit).
The entropy rate of a language is the limit of
the
entropy rate of a sample of language as the
sample
gets longer and longer.

47
Mutual Information
48
Relationship between I and H
49
Mutual Information (cont)
50
Mutual Information (cont)
51
Mutual Information and Entropy
52
The Noisy Channel Model

Want to optimize a communication across a
channel in terms of throughput and accuracy the
communication of messages in the presence of
noise in the channel.
There is a duality between compression
(achieved
by removing all redundancy) and transmission
accuracy (achieved by adding controlled
redundancy so that the input can be recovered in
the presence of noise).

53
The Noisy Channel Model

Goal encode the message in such a way that it
occupies minimal space while still containing
enough redundancy to be able to detect and
correct errors.

54
Language and the Noisy ChannelModel

In language we cant control the encoding phase
we can only decode the output to give the most
likely input.
Determine the most likely input given the output!

55
The Noisy Channel Model
56
Relative Entropy Kullback-LeiblerDivergence
57
Entropy and Language

Entropy is measure of uncertainty. The more we
know about something the lower the entropy.
If a language model captures more of the
structure
of the language than another model, then its
entropy should be lower.
Entropy can be thought of as a matter of how
surprised we will be to see the next word given
previous words we have already seen.

58
Entropy and Language

??????????,??????????3.98??,??????????4.00
??,??????????4.01??,??????????4.03??,??4.10??,????
??4.12??,???4.35?? ? ????????9.71???
??????????,???????????10.0??,????????11.46 ???????
???????????
?????,?????????80,???67,????73???????????70?
??????????73,???55,???? 63,????????????

59
A example Mengs Profile

http//219.223.235.139/weblog/profile.php?umengxj
Aoccdrnig to rscheearch at an Elingsh
uinervtisy, it deosn't mttaer in waht oredr the
ltteers in a wrod are, the olny iprmoetnt tihng
is taht the frist and lsat ltteer are in the
rghit pclae. The rset can be a toatl mses and you
can sitll raed it wouthit a porbelm. Tihs is
bcuseae we do not raed ervey lteter by it slef
but the wrod as a wlohe and the biran fguiers it
out aynawy. so please excuse me for every typo in
the blog, btw fixes and patches are welcome.

60
Perplexity(???)

A measure related to the notion of cross
entropy
and used in the speech recognition community is
called the perplexity.
A perplexity of k means that you are as
surprised
on average as you would have been if you had had
to guess between k equi-probable choices at each
step.

61
???? World Super Star

A.M. Turing Award
ACM's most prestigious technical award is
accompanied by a prize of 100,000. It is given
to an individual selected for contributions of a
technical nature made to the computing community.
The contributions should be of lasting and major
technical importance to the computer field.
Financial support of the Turing Award is provided
by the Intel Corporation.

62
Alan Turing 19121954
63
2004 Winners Vinton G. CerfRobert E. Kahn

Citation
For pioneering work on internetworking, including
the design and implementation of the Internet's
basic communications protocols, TCP/IP, and for
inspired leadership in networking

64
Vinton Cerf

?????,?????????(ICANN)???
ICANN(Internet Corporation for Assigned Names and
Numbers)???????????,???1998?10?,??????????????????
??????????????????????????????ICANN???????????????
??????,????????(IP Address Space)???????????(Proto
col parameters)????????(Domain Name
System)?????????(Root Server System)??????"?????"
,Vinton G. Cerf????????-----TCIP/IP???????????????
1997?12?,??????Cerf??????Robert E.
Kahn???????????,????????????????????Cerf??????????
???????????????????????????????,??????????????????
??????????

65
Vinton G. Cerf

Vinton Gray Cerf grew up in San Francisco,
the son of a Navy officer who fought in World
War IIiv. He would receive a B.S. in
mathematics from Stanford, and graduate degrees
from UCLA. However, it was his research work at
Stanford that would begin his lifelonginvolvement
in the Internet.

66
Robert E. Kahn

???-??,??????????????????,TCP/IP???????,?????Ar
panet???????,????????????????????(National
Academy of Engineering)??,??????????IEEE??(IEEE)fe
llow,????????(American Association for Artificial
Intelligence)fellow,???????(ACM)
fellow,?????????? ???-???????????????(CNRI,Cor
poration for National Research Initiatives)????CNR
I????.???1986????????,????????????????????????????
?,?????IETF??????? ???-??1938???????????,?????
????????,???????????,????????????????1969?,???????
???????(IMP)??,???????????IMP?????????????????
??1970?,??????????????(NCP),??????????80????,??
?????????????(NII)???, NII???????????????
1997?,???-????????????????,??????????????????????
??