Natural Language Processing 2 Basic Probability - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Natural Language Processing 2 Basic Probability

Description:

The collection of basic outcomes (or sample points) for our experiment is called ... Every observation/trial is a basic outcome or sample point. ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 68
Provided by: leno2
Category:

less

Transcript and Presenter's Notes

Title: Natural Language Processing 2 Basic Probability


1
Natural Language Processing(2) Basic
Probability
  • Dr. Xuan Wang(? ?)
  • Intelligence Computing Research Center
  • Harbin Institute of Technology Shenzhen Graduate
    School
  • Slides from Dr. Mary P. Harper ECE, Purdue
    University

2
Motivation
  • Statistical NLP aims to do statistical inference.
  • Statistical inference consists of taking some
    data (generated according to some unknown
    probability distribution) and then making some
    inferences about this distribution.
  • An example of statistical inference is the task
    of language modeling, namely predicting the next
    word given a window of previous words. To do
    this, we need a model of the language.
  • Probability theory helps us to find such a model.

3
Probability Terminology
  • Probability theory deals with predicting how
    likely it is that something will happen.
  • The process by which an observation is made is
    called an experiment or a trial (e.g., tossing a
    coin twice).
  • The collection of basic outcomes (or sample
    points) for our experiment is called the sample
    space.

4
Probability Terminology
  • An event is a subset of the sample space.
  • Probabilities are numbers between 0 and 1, where
    0 indicates impossibility and 1, certainty.
  • A probability function/distribution distributes a
    probability mass of 1 throughout the sample
    space.

5
Experiments Sample Spaces
  • Set of possible basic outcomes of an experiment
    sample space W
  • coin toss (W head,tail)
  • tossing coin 2 times (W HH, HT, TH, TT)
  • dice roll (W 1, 2, 3, 4, 5, 6)
  • missing word ( W _at_ vocabulary size)
  • Discrete (countable) versus continuous
    (uncountable)
  • Every observation/trial is a basic outcome or
    sample point.
  • Event A is a set of basic outcomes with A Ì W , Æ
    is the impossible event

6
Events and Probability
  • The probability of event A is denoted p(A) (also
    called the prior probability, i.e., the
    probability before we consider any additional
    knowledge).
  • Example Experiment toss coin three times
  • W HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
  • cases with two or more tails
  • A HTT, THT, TTH, TTT
  • P(A) A/ W 1/2 (assuming uniform
    distribution)
  • all heads
  • A HHH
  • P(A) A/ W 1/8

7
Probability Properties
  • Basic properties
  • P 0,1
  • P(W) 1
  • For disjoint events P(ÈAi) åi P(Ai)
  • NB axiomatic definition of probability take
    the above three conditions as axioms
  • Immediate consequences
  • P(Æ) 0, P(A ) 1 - P(A), A Í B Þ P(A) P(B)

8
Joint Probability
  • Joint Probability of A and B P(A,B) P(A Ç B)
  • 2-dimensional table (AxB) with a value in
    every cell
  • giving the probability of that specific pair
    occurring.

9
Conditional Probability
  • Sometimes we have partial knowledge about the
  • outcome of an experiment, then the conditional
    (or
  • posterior) probability can be helpful. If we know
  • that event B is true, then we can determine the
  • probability that A is true given this knowledge
  • P(AB) P(A,B) / P(B)

10
Conditional and Joint Probabilities
  • P(AB) P(A,B)/P(B) P(A,B) P(AB) P(B)
  • P(BA) P(A,B)/P(A) P(A,B) P(BA) P(B)
  • Chain rule

11
Bayes Rules
  • Since P(A,B) P(B,A), (P(A Ç B) P(B Ç A)), and
    P(A,B) P(AB) P(B) P(BA) P(A)
  • P(AB) P(A,B)/P(B) P(BA) P(A) / P(B)
  • P(BA) P(A,B)/P(A) P(AB) P(B)/P(A)

12
Example
  • S have stiff neck, M have meningitis(???)
  • P(SM) 0.5, P(M) 1/50,000, P(S)1/20
  • I have stiff neck, should I worry?

13
Independence
Two events A and B are independent of each other
if P(A) P(AB) Example two coin tosses,
weather today and weather on March 4th, 1789 If A
and B are independent, then we compute P(A,B)
from P(A) and P(B) as P(A,B) P(AB) P(B)
P(A) P(B) Two events A and B are conditionally
independent of each other given C if P(AC)
P(AB,C)
14
A Golden Rule(of Statistical NLP)
  • If we are interested in which event is most
    likely given A, we can use Bayes rule, max over
    all B
  • P(A) is a normalizing constant
  • However the denominator is easy to obtain.

15
Random Variables (RV)
  • Random variables (RV) allow us to talk about the
    probabilities of numerical values that are
    related to the event space (with a specific
    numeric range)
  • An RV is a function X W Q
  • in general Q Rn, typically R
  • easier to handle real numbers than real-world
    events
  • An RV is discrete if Q is a countable subset of
    R an indicator RV (or Bernoulli trial) if Q0,
    1
  • Can define a probability mass function (pmf) for
    RV X that gives the probability it has at
    different values

  • where Ax w Î W X(w) x
  • often just p(x) if it is clear from context what
    x is

16
Example
  • Suppose we define a discrete RV X that is the sum
    of the faces of two die, then Q2, , 11, 12
    with the pmf as follows
  • P(X2)1/36,
  • P(X3)2/36,
  • P(X4)3/36,
  • P(X5)4/36,
  • P(X6)5/36,
  • P(X7)6/36,
  • P(X8)5/36,
  • P(X9)4/36,
  • P(X10)3/36,
  • P(X11)2/36,
  • P(X12)1/36

17
Expectation and Variance
  • The expectation is the mean or average of a RV
  • defined as
  • The variance of a RV is a measure of whether the
  • values of the RV tend to vary over trials
  • The standard deviation (s) is the square root of
    the
  • variance.

18
Examples
  • What is the expectation of the sum of the
    numbers on two dice?
  • Or more simply
  • 2 P(X2) 2 1/36 1/18 E(SUM) E(D1D2)
  • 3 P(X3) 3 2/36 3/18 E(D1) E(D2)
  • 4 P(X4) 4 3/36 6/18 E(D1) E(D2)
  • 5 P(X5) 5 4/36 10/18 1 1/6 2
    1/6 6 1/6
  • 6 P(X6) 6 5/36 15/18 1/6 (1 2
    3 4 5 6) 21/6
  • 7 P(X7) 7 6/36 21/18 Hence,
    E(SUM) 21/6 21/6 7
  • 8 P(X8) 8 5/36 20/18
  • 9 P(X9) 9 4/36 18/18
  • 10 P(X10) 10 3/36 15/18
  • 11 P(X11) 11 2/36 11/18
  • 12 p(X12) 12 1/36 6/18
  • E(SUM) 126/18 7

19
Examples
20
Joint and Conditional Distributionsfor RVs
21
Estimating Probability Functions
22
Parametric Methods
  • Assume that the language phenomenon is
  • acceptably modeled by one of the well-known
  • standard distributions (e.g., binomial, normal).
  • By assuming an explicit probabilistic model of
    the
  • process by which the data was generated, then
  • determining a particular probability distribution
  • within the family requires only the specification
    of
  • a few parameters, which requires less training
    data
  • (i.e., only a small number of parameters need to
  • be estimated).

23
Non-parametric Methods
  • No assumption is made about the underlying
  • distribution of the data.
  • For example, simply estimate P empirically by
  • counting a large number of random events is a
  • distribution-free method.
  • Given less prior information, more training
    data is
  • needed.

24
Estimating Probability
  • Example Toss coin three times
  • W HHH, HHT, HTH, HTT, THH, THT, TTH, TTT
  • count cases with exactly two tails A HTT,
    THT, TTH
  • Run an experiment 1000 times (i.e., 3000
    tosses)
  • Counted 386 cases with two tails (HTT, THT, or
    TTH)
  • Estimate of p(A) 386 / 1000 .386
  • Run again 373, 399, 382, 355, 372, 406, 359
  • p(A) .379 (weighted average) or simply 3032 /
    8000
  • Uniform distribution assumption p(A) 3/8
    .375

25
Standard Distributions
  • In practice, one commonly finds the same basic
    form of a
  • probability mass function, but with different
    constants
  • employed.
  • Families of pmfs are called distributions and
    the constants
  • that define the different possible pmfs in one
    family are
  • called parameters.
  • Discrete Distributions the binomial
    distribution, the
  • multinomial distribution, the Poisson
    distribution.
  • Continuous Distributions the normal
    distribution, the
  • standard normal distribution.

26
Standard Distributions Binomial
27
Binomial Distribution
  • Works well for tossing a coin. However, for
    linguistic corpora one never has complete
    independence from one sentence to the next
  • approximation.
  • Use it when counting whether something has a
    certain property or not (assuming independence).
  • Actually quite common in SNLP e.g., look
    through a corpus to find out the estimate of the
    percentage of sentences that have a certain word
    in
  • them or how often a verb is used as transitive or
    intransitive.
  • Expectation is n p, variance is n p (1-p)

28
Standard Distributions Normal
29
Frequentist Statistics
30
Baysian Statistics I Bayesian Updating
  • Updating
  • Assume that the data are coming in sequentially
  • and are independent.
  • Given an a-priori probability distribution, we
    can
  • update our beliefs when a new datum comes in by
  • calculating the Maximum A Posteriori (MAP)
  • distribution.
  • The MAP probability becomes the new prior and
  • the process repeats on each new datum.

31
Bayesian Statistics MAP
32
Bayesian Statistics II Bayesian
  • Decision Theory
  • Bayesian Statistics can be used to evaluate
    which
  • model or family of models better explains some
  • data.
  • We define two different models of the event and
  • calculate the likelihood ratio between these two
  • models.

33
Bayesian Decision Theory
34
Essential Information Theory
  • Developed by Shannon in the 1940s.
  • Goal is to maximize the amount of information
    that can be
  • transmitted over an imperfect communication
    channel.
  • Wished to determine the theoretical maxima for
    data
  • compression (entropy H) and transmission rate
    (channel
  • capacity C).
  • If a message is transmitted at a rate slower than
    C, then the
  • probability of transmission errors can be made as
    small as
  • desired.

35
Entropy
36
Entropy(cont)
37
Using the Formula Examples
38
The Limits
39
Coding Interpretation of Entropy
  • The least (average) number of bits needed to
  • encode a message (string, sequence, series,...)
  • (each element being a result of a random process
  • with some distribution p) gives H(p).
  • Compression algorithms
  • do well on data with repeating (easily
    predictable low
  • entropy) patterns
  • their results have high entropy Þcompressing
  • compressed data does nothing

40
Coding Example
41
Properties of Entropy I
42
Joint Entropy
43
Conditional Entropy
44
Properties of Entropy II
45
Chain Rule for Entropy
46
Entropy Rate
  • Because the amount of information contained in
    a
  • message depends on its length, we may want to
  • compare using entropy rate (the entropy per
    unit).
  • The entropy rate of a language is the limit of
    the
  • entropy rate of a sample of language as the
    sample
  • gets longer and longer.

47
Mutual Information
48
Relationship between I and H
49
Mutual Information (cont)
50
Mutual Information (cont)
51
Mutual Information and Entropy
52
The Noisy Channel Model
  • Want to optimize a communication across a
  • channel in terms of throughput and accuracy the
  • communication of messages in the presence of
  • noise in the channel.
  • There is a duality between compression
    (achieved
  • by removing all redundancy) and transmission
  • accuracy (achieved by adding controlled
  • redundancy so that the input can be recovered in
  • the presence of noise).

53
The Noisy Channel Model
  • Goal encode the message in such a way that it
    occupies minimal space while still containing
    enough redundancy to be able to detect and
    correct errors.

54
Language and the Noisy ChannelModel
  • In language we cant control the encoding phase
    we can only decode the output to give the most
    likely input.
  • Determine the most likely input given the output!

55
The Noisy Channel Model
56
Relative Entropy Kullback-LeiblerDivergence
57
Entropy and Language
  • Entropy is measure of uncertainty. The more we
  • know about something the lower the entropy.
  • If a language model captures more of the
    structure
  • of the language than another model, then its
  • entropy should be lower.
  • Entropy can be thought of as a matter of how
  • surprised we will be to see the next word given
  • previous words we have already seen.

58
Entropy and Language
  • ??????????,??????????3.98??,??????????4.00
    ??,??????????4.01??,??????????4.03??,??4.10??,????
    ??4.12??,???4.35?? ? ????????9.71???  
    ??????????,???????????10.0??,????????11.46 ???????
    ???????????
  • ?????,?????????80,???67,????73???????????70?
    ??????????73,???55,???? 63,????????????

59
A example Mengs Profile
  • http//219.223.235.139/weblog/profile.php?umengxj
  • Aoccdrnig to rscheearch at an Elingsh
    uinervtisy, it deosn't mttaer in waht oredr the
    ltteers in a wrod are, the olny iprmoetnt tihng
    is taht the frist and lsat ltteer are in the
    rghit pclae. The rset can be a toatl mses and you
    can sitll raed it wouthit a porbelm. Tihs is
    bcuseae we do not raed ervey lteter by it slef
    but the wrod as a wlohe and the biran fguiers it
    out aynawy. so please excuse me for every typo in
    the blog, btw fixes and patches are welcome.

60
Perplexity(???)
  • A measure related to the notion of cross
    entropy
  • and used in the speech recognition community is
  • called the perplexity.
  • A perplexity of k means that you are as
    surprised
  • on average as you would have been if you had had
  • to guess between k equi-probable choices at each
  • step.

61
???? World Super Star
  • A.M. Turing Award
  • ACM's most prestigious technical award is
    accompanied by a prize of 100,000. It is given
    to an individual selected for contributions of a
    technical nature made to the computing community.
    The contributions should be of lasting and major
    technical importance to the computer field.
    Financial support of the Turing Award is provided
    by the Intel Corporation.

62
Alan Turing 19121954
63
2004 Winners Vinton G. CerfRobert E. Kahn
  • Citation
  • For pioneering work on internetworking, including
    the design and implementation of the Internet's
    basic communications protocols, TCP/IP, and for
    inspired leadership in networking

64
Vinton Cerf
  • ?????,?????????(ICANN)???
  • ICANN(Internet Corporation for Assigned Names and
    Numbers)???????????,???1998?10?,??????????????????
    ??????????????????????????????ICANN???????????????
    ??????,????????(IP Address Space)???????????(Proto
    col parameters)????????(Domain Name
    System)?????????(Root Server System)??????"?????"
    ,Vinton G. Cerf????????-----TCIP/IP???????????????
    1997?12?,??????Cerf??????Robert E.
    Kahn???????????,????????????????????Cerf??????????
    ???????????????????????????????,??????????????????
    ??????????

65
Vinton G. Cerf
  • Vinton Gray Cerf grew up in San Francisco,
  • the son of a Navy officer who fought in World
  • War IIiv. He would receive a B.S. in
    mathematics from Stanford, and graduate degrees
    from UCLA. However, it was his research work at
    Stanford that would begin his lifelonginvolvement
    in the Internet.

66
Robert E. Kahn
  • ???-??,??????????????????,TCP/IP???????,?????Ar
    panet???????,????????????????????(National
    Academy of Engineering)??,??????????IEEE??(IEEE)fe
    llow,????????(American Association for Artificial
    Intelligence)fellow,???????(ACM)
    fellow,?????????? ???-???????????????(CNRI,Cor
    poration for National Research Initiatives)????CNR
    I????.???1986????????,????????????????????????????
    ?,?????IETF??????? ???-??1938???????????,?????
    ????????,???????????,????????????????1969?,???????
    ???????(IMP)??,???????????IMP?????????????????
    ??1970?,??????????????(NCP),??????????80????,??
    ?????????????(NII)???, NII???????????????
    1997?,???-????????????????,??????????????????????
    ??

67
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com