Some basic concepts of Information Theory and Entropy - PowerPoint PPT Presentation

About This Presentation
Title:

Some basic concepts of Information Theory and Entropy

Description:

Some basic concepts of Information Theory and Entropy Information theory, IT Entropy Mutual Information Use in NLP H is a weighted average for log(p(X) where the ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 24
Provided by: hora171
Learn more at: https://www.cs.upc.edu
Category:

less

Transcript and Presenter's Notes

Title: Some basic concepts of Information Theory and Entropy


1
Some basic concepts of Information Theory and
Entropy
  • Information theory, IT
  • Entropy
  • Mutual Information
  • Use in NLP

2
Entropy
  • Related to the coding theory- more efficient
    code fewer bits for more frequent messages at
    the cost of more bits for the less frequent

3
  • EXAMPLE You have to send messages about the two
    occupants in a house every five minutes
  • Equal probability
  • 0 no occupants
  • 1 first occupant
  • 2 second occupant
  • 3 Both occupants
  • Different probability
  • Situation Probability
    Code
  • no occupants .5
    0
  • first occupant .125
    110
  • second occupant .125
    111
  • Both occupants .25
    10

4
  • Let X a random variable taking values x1, x2,
    ..., xn from a domain de according to a
    probability distribution
  • We can define the expected value of X, E(x) as
    the summatory of the possible values weighted
    with their probability
  • E(X) p(x1)X(x1) p(x2)X(x2) ... p(xn)X(xn)

5
Entropy
  • A message can thought of as a random variable W
    that can take one of several values V(W) and a
    probability distribution P.
  • Is there a lower bound on the number of bits
    neede tod encode a message? Yes, the entropy
  • It is possible to get close to the minimum (lower
    bound)
  • It is also a measure of our uncertainty about wht
    the message says (lot of bits- uncertain, few -
    certain)

6
  • Given an event we want to associate its
    information content (I)
  • From Shannon in the 1940s
  • Two constraints
  • Significance
  • The less probable is an event the more
    information it contains
  • P(x1) gt P(x2) gt I(x2) gt I(x1)
  • Additivity
  • If two events are independent
  • I(x1x2) I(x1) I(x2)

7
  • I(m) 1/p(m) does not satisfy the second
    requirement
  • I(x) - log p(x) satisfies both
  • So we define I(X) - log p(X)

8
  • Let X a random variable, described by p(X),
    owning an information content I
  • Entropy is the expected value of I E(I)
  • Entropy measures information content of a random
    variable. We can consider it as the average
    length of the message needed to transmite a
    value of this variable using an optimal coding.
  • Entropy measures the degree of desorder
    (uncertainty) of the random variable.

9
  • Uniform distribution of a variable X.
  • Each possible value xi ? X with X M has the
    same probability pi 1/M
  • If the value xi is codified in binary we need
    log2 M bits of information
  • Non uniform distribution.
  • by analogy
  • Each value xi has a different probability pi
  • Let assume pi to be independent
  • If Mpi 1/ pi we will need log2 Mpi log2 (1/
    pi ) - log2 pi bits of information

10
Let X a, b, c, d with pa 1/2 pb 1/4
pc 1/8 pd 1/8 entropy(X) E(I) -1/2 log2
(1/2) -1/4 log2 (1/4) -1/8 log2 (1/8) -1/8 log2
(1/8) 7/4 1.75 bits
X a?
no
si
X b?
a
no
si
X c?
b
si
no
c
a
Average number of questions 1.75
11
Let X with a binomial distribution X 0 with
probability p X 1 with probability (1-p) H(X)
-p log2 (p) -(1-p) log2 (1-p) p 0 gt 1 - p
1 H(X) 0 p 1 gt 1 - p 0 H(X) 0 p
1/2 gt 1 - p 1/2 H(X) 1
H(Xp)
1 0
0 1/2 1 p
12
(No Transcript)
13
  • joint entropy of two random variables, X, Y is
    average information content for specifying both
    variables

14
  • The conditional entropy of a random variable Y
    given another random variable X, describes what
    amount of information is needed in average to
    communicate when the reader already knows X

15
Chaining rule for probabilities
  • P(A,B) P(AB)P(B) P(BA)P(A)
  • P(A,B,C,D) P(A)P(BA)P(CA,B)P(DA,B,C..)

16
Chaining rule for entropies
17
Mutual Information
  • I(X,Y) is the mutual information between X and
    Y.
  • I(X,Y) measures the reduction of incertaincy of X
    when Y is known
  • It measures too the amouny of information X owns
    about Y (or Y about X)

18
  • I 0 only when X and Y are independent
  • H(XY)H(X)
  • H(X)H(X)-H(XX)I(X,X)
  • Entropy is the autoinformation (mutual
    information between X and X)

19
(No Transcript)
20
Pointwise Mutual Information
  • The PMI of a pair of outcomes x and y belonging
    to discrete random variables quantifies the
    discrepancy between the probability of their
    coincidence given their joint distribution versus
    the probability of their coincidence given only
    their individual distributions and assuming
    independence
  • The mutual information of X and Y is the expected
    value of the Specific Mutual Information of all
    possible outcomes.

21
  • H entropy of a language L
  • We ignore p(X)
  • Let q(X) a LM
  • How good is q(X) as an estimation of p(X) ?

22
Cross Entropy
Measures the surprise of a model q when it
describes events following a distribution p
23
Relative Entropy Relativa or Kullback-Leibler
(KL) divergence
Measures the difference between two probabilistic
distributions
Write a Comment
User Comments (0)
About PowerShow.com