Probability%20Theory%20%20Bayes%20Theorem%20and%20Na - PowerPoint PPT Presentation

About This Presentation
Title:

Probability%20Theory%20%20Bayes%20Theorem%20and%20Na

Description:

All words = just count all the words in the dictionary ... MEDLINE (National Library of Medicine) $2 million/year for manual indexing of journal articles ... – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 70
Provided by: cisU
Category:

less

Transcript and Presenter's Notes

Title: Probability%20Theory%20%20Bayes%20Theorem%20and%20Na


1
Probability Theory Bayes Theorem and Naïve
Bayes classification
2
Definition of Probability
  • Probability theory encodes our knowledge or
    belief about the collective likelihood of the
    outcome of an event.
  • We use probability theory to try to predict which
    outcome will occur for a given event.

3
Sample Spaces
  • We think about the sample space as being set of
    all possible outcomes
  • For tossing a coin, the possible outcomes are
    Heads or Tails.
  • For competing in the Olympics, the set of
    outcomes for a given contest are gold, silver,
    bronze, no_award.
  • For computing part-of-speech, the set of outcomes
    are JJ, DT, NN, RB, etc.
  • We use probability theory to try to predict which
    outcome will occur for a given event.

4
Axioms of Probability Theory
  • Probabilities are real numbers between 01
    representing the a priori likelihood that a
    proposition is true.
  • Necessarily true propositions have probability 1,
    necessarily false have probability 0.
  • P(true) 1 P(false) 0

5
Probability Axioms
  • Probability law must satisfy certain properties
  • Nonnegativity
  • P(A) gt 0, for every event A
  • Additivity
  • If A and B are two disjoint events, then the
    probability of their union satisfies
  • P(A U B) P(A) P(B)
  • Normalization
  • The probability of the entire sample space S is
    equal to 1, i.e., P(S) 1.
  • P(Cold) 0.1 (the probability that it will
    be cold)
  • P(Cold) 0.9 (the probability that it will be
    not cold)

6
An example
  • An experiment involving a single coin toss
  • There are two possible outcomes, Heads and Tails
  • Sample space S is H,T
  • If coin is fair, should assign equal
    probabilities to 2 outcomes
  • Since they have to sum to 1
  • P(H) 0.5
  • P(T) 0.5
  • P(H,T) P(H)P(T) 1.0

7
Another example
  • Experiment involving 3 coin tosses
  • Outcome is a 3-long string of H or T
  • S HHH,HHT,HTH,HTT,THH,THT,TTH,TTT
  • Assume each outcome is equally likely
    (equiprobable)
  • Uniform distribution
  • What is probability of the event A that exactly 2
    heads occur?
  • A HHT,HTH,THH
  • P(A) P(HHT)P(HTH)P(THH)
  • 1/8 1/8 1/8
  • 3/8

8
Probability definitions
  • In summary
  • number of outcomes corresponding to
    event E
  • P(E) --------------------------------------
    -------------------
  • total number of outcomes
  • Probability of drawing a spade from 52
    well-shuffled playing cards
  • 13/52 ¼ 0.25

9
How about non-uniform probabilities? An example
  • A biased coin,
  • twice as likely to come up tails as heads,
  • is tossed twice
  • What is the probability that at least one head
    occurs?
  • Sample space hh, ht, th, tt (h heads, t
    tails)
  • Sample points/probability for the event
  • (each head has probability 1/3 each tail has
    prob 2/3)
  • ht 1/3 x 2/3 2/9 hh 1/3 x 1/3 1/9
  • th 2/3 x 1/3 2/9 tt 2/3 x 2/3 4/9
  • Answer 5/9 ?0.56
  • (sum of weights in red since want outcome with at
    least one heads)
  • By contrast, probability of event for the
    unbiased coin 0.75

10
Moving toward language
  • Whats the probability of drawing a 2 from a deck
    of 52 cards?
  • Whats the probability of a random word (from a
    random dictionary page) being a verb?

11
Probability and part of speech tags
  • Whats the probability of a random word (from a
    random dictionary page) being a verb?
  • How to compute each of these?
  • All words just count all the words in the
    dictionary
  • of ways to get a verb number of words which
    are verbs!
  • If a dictionary has 50,000 entries, and 10,000
    are verbs. P(V) is 10000/50000 1/5 .20

12
Independence
  • Most things in life depend on one another, that
    is, they are dependent
  • If I drive to SF, this may effect your attempt to
    go there.
  • It may also effect the environment, the economy
    of SF,etc.
  • If 2 events are independent, this means that the
    occurrence of one event makes it neither more nor
    less probable that the other occurs.
  • If I flip my coin, it wont have an effect on the
    outcome of your coin flip.
  • P(A,B) P(A) P(B) if and only if A and B are
    independent.
  • P(heads,tails) P(heads) P(tails) .5 .5
    .25
  • Note P(AB)P(A) iff A,B independent
  • P(BA)P(B) iff A,B independent

13
Conditional Probability
  • A way to reason about the outcome of an
    experiment based on partial information
  • How likely is it that a person has a disease
    given that a medical test was negative?
  • In a word guessing game the first letter for the
    word is a t. What is the likelihood that the
    second letter is an h?
  • A spot shows up on a radar screen. How likely is
    it that it corresponds to an aircraft?

14
Conditional Probability
  • Conditional probability specifies the probability
    given that the values of some other random
    variables are known.
  • P(Sneeze Cold) 0.8
  • P(Cold Sneeze) 0.6
  • The probability of a sneeze given a cold is 80.
  • The probability of a cold given a sneeze is 60.

15
More precisely
  • Given an experiment, a corresponding sample space
    S, and the probability law
  • Suppose we know that the outcome is within some
    given event B
  • The first letter was t
  • We want to quantify the likelihood that the
    outcome also belongs to some other given event A.
  • The second letter will be h
  • We need a new probability law that gives us the
    conditional probability of A given B
  • P(AB) the probability of A given B

16
Joint Probability Distribution
  • The joint probability distribution for a set of
    random variables X1Xn gives the probability of
    every combination of values
  • P(X1,...,Xn)
  • Sneeze Sneeze
  • Cold 0.08 0.01
  • Cold 0.01 0.9
  • The probability of all possible cases can be
    calculated by summing the appropriate subset of
    values from the joint distribution.
  • All conditional probabilities can therefore also
    be calculated
  • P(Cold Sneeze)
  • BUT its often very hard to obtain all the
    probabilities for a joint distribution

17
Conditional Probability Example
  • Lets say A is its raining.
  • Lets say P(A) in dry California is .01
  • Lets say B is it was sunny ten minutes ago
  • P(AB) means
  • what is the probability of it raining now if it
    was sunny 10 minutes ago
  • P(AB) is probably way less than P(A)
  • Perhaps P(AB) is .0001
  • Intuition The knowledge about B should change
    our estimate of the probability of A.

18
Conditional Probability
  • Let A and B be events
  • P(A,B) and P(A ? B) both means the probability
    that BOTH A and B occur
  • p(BA) the probability of event B occurring
    given event A occurs
  • definition p(AB) p(A ? B) / p(B)
  • P(A, B) P(AB) P(B) (simple arithmetic)
  • P(A, B) P(B, A) (AND is symmetric)

19
Bayes Theorem
  • We start with conditional probability definition
  • So say we know how to compute P(AB). What if we
    want to figure out P(BA)? We can re-arrange the
    formula using Bayes Theorem

20
Deriving Bayes Rule
21
How to compute probabilities?
  • We dont have the probabilities for most NLP
    problems
  • We can try to estimate them from data
  • (thats the learning part)
  • Usually we cant actually estimate the
    probability that something belongs to a given
    class given the information about it
  • BUT we can estimate the probability that
    something in a given class has particular values.

22
Simple Bayesian Reasoning
  • If we assume there are n possible disjoint
    rhetorical zones, z1 zn
  • P(zi w) P(w zi) P(zi) P(w)
  • Want to know the probability of the zone given
    the word.
  • Can count how often we see the word in a sentence
    from a given zone in the training set, and how
    often the zone itself occurs.
  • P(w zi ) number of times we see this zone with
    this word divided by how often we see the zone
  • This is the learning part.
  • Since P(w) is always the same, we can ignore it.
  • So now only need to compute P(w zi) P(zi)

23
Bayes Independence Example
  • Imagine there are diagnoses ALLERGY, COLD, and
    WELL and symptoms SNEEZE, COUGH, and FEVER
  • Prob Well Cold Allergy
  • P(d) 0.9 0.05 0.05
  • P(sneezed) 0.1 0.9 0.9
  • P(cough d) 0.1 0.8 0.7
  • P(fever d) 0.01 0.7 0.4

24
Bayes Independence Example
  • By assuming independence, we ignore the possible
    interactions.
  • If symptoms are sneeze cough no fever
  • P(well e) (0.9)(0.1)(0.1)(0.99)/P(e)
    0.0089/P(e)
  • P(cold e) (.05)(0.9)(0.8)(0.3)/P(e)
    0.01/P(e)
  • P(allergy e) (.05)(0.9)(0.7)(0.6)/P(e)
    0.019/P(e)
  • P(e) .0089 .01 .019 .0379
  • P(well e) .23
  • P(cold e) .26
  • P(allergy e) .50
  • Diagnosis allergy

25
Kupiec et al. Feature Representation
  • Fixed-phrase feature
  • Certain phrases indicate summary, e.g. in
    summary
  • Paragraph feature
  • Paragraph initial/final more likely to be
    important.
  • Thematic word feature
  • Repetition is an indicator of importance
  • Uppercase word feature
  • Uppercase often indicates named entities.
    (Taylor)
  • Sentence length cut-off
  • Summary sentence should be gt 5 words.

26
Training
  • Hand-label sentences in training set (good/bad
    summary sentences)
  • Train classifier to distinguish good/bad summary
    sentences
  • Model used Naïve Bayes
  • Can rank sentences according to score and show
    top n to user.

27
Naïve Bayes Classifier
  • The simpler version of Bayes was
  • P(BA) P(AB)P(B)/P(A)
  • P(Sentence feature) P(feature S)
    P(S)/P(feature)
  • Using Naïve Bayes, we expand the number of
    feaures by defining a joint probability
    distribution
  • P(Sentence, f1, f2, fn) P(Sentence)? P(fi
    Sentence)/ ? P(fi )
  • We learn P(Sentence) and P(fi Sentence) in
    training
  • Test we need to state P(Sentence f1, f2, fn)
  • P(Sentence f1, f2, fn)
  • P(Sentence, f1, f2, fn) / P(f1, f2,
    fn)

28
Details Bayesian Classifier
  • Assuming statistical independence

29
Bayesian Classifier (Kupiec at el 95)
  • Each Probability is calculated empirically from a
    corpus
  • See how often each feature is seen with a
    sentence selected for a summary, vs how often
    that feature is seen in any sentence.
  • Higher probability sentences are chosen to be in
    the summary
  • Performance
  • For 25 summaries, 84 precision

30
How to compute this?
  • For training, for each feature f
  • For each sentence s
  • Is the sentence in the target summary S?
    Increment T
  • Increment F no matter which sentence it is in.
  • P(fS) T/N
  • P(f) F/N
  • For testing, for each document
  • For each sentence
  • Multiply the probabilities of all of the features
    occurring in the sentence times the probability
    of any sentence being in the summary (a
    constant). Divide by the probability of the
    features occurring in any sentence.

31
Learning Sentence Extraction Rules (Kupiec et
al. 95)
  • Results
  • About 87 (498) of all summary sentences (568)
    could be directly matched to sentences in the
    source (79 direct matches, 3 direct joins, 5
    incomplete joins)
  • Location was best feature at 163/498 33
  • Parafixed-phrasesentence length cut-off gave
    best sentence recall performance 217/49844
  • At compression rate 25 (20 sentences),
    performance peaked at 84 sentence recall

J. Kupiec, J. Pedersen, and F. Chen. "A Trainable
Document Summarizer", Proceedings of the 18th
ACM-SIGIR Conference, pages 68--73, 1995.
32
Language Identification
33
Language identification
  • Tutti gli esseri umani nascono liberi ed eguali
    in dignità e diritti. Essi sono dotati di ragione
    e di coscienza e devono agire gli uni verso gli
    altri in spirito di fratellanza.
  • Alle Menschen sind frei und gleich an Würde und
    Rechten geboren. Sie sind mit Vernunft und
    Gewissen begabt und sollen einander im Geist der
    Brüderlichkeit begegnen.
  • Universal Declaration of Human Rights, UN, in 363
    languages
  • http//www.unhchr.ch/udhr/navigate/alpha.htm

34
Language identification
  • égaux
  • eguali
  • iguales
  • edistämään
  • Ü
  • How to do determine, for a stretch of text, which
    language it is from?

35
Language Identification
  • Turns out to be really simple
  • Just a few character bigrams can do it (Sibun
    Reynar 96)
  • Based on language models for sample languages

36
Language model basics
  • Unigram and bigram models
  • Evaluating N-gram language models
  • Perplexity
  • The intuition of perpexity is that given two
    probabilistic models, the better model is the one
    that better predicts new data (not used to train
    the model)
  • We can measure better prediction by comparing the
    probability the model assigns to the test data
  • The better probability will assign a higher
    probability

37
Perplexity
minimazing perplexity is equivalent to maximazing
the test set probability according to the
language model
38
Entropy
  • X---a random variable with probability
    distribution p(x)

Measures how predictive a given N-gram model is
about what the next word could be
39
KL divergence (relative entropy)
Basis of comparing two probability distributions
40
Language Identification
  • Turns out to be really simple
  • Just a few character bigrams can do it (Sibun
    Reynar 96)
  • Used Kullback Leibler distance (relative entropy)
  • Compare probability distribution of the test set
    to those for the languages trained on
  • Smallest distance determines the language
  • Using special character sets helps a bit, but
    barely

41
Language Identification
  • (Sibun Reynar 96)

42
Confusion Matrix
  • A table that shows, for each class, which ones
    your algorithm got right and which wrong

Gold standard
Algorithms guess
43
(No Transcript)
44
Author Identification(Stylometry)
45
Author Identification
  • Also called Stylometry in the humanities
  • An example of a Classification Problem
  • Classifiers
  • Decide which of N buckets to put an item in
  • (Some classifiers allow for multiple buckets)

46
The Disputed Federalist Papers
  • In 1787-1788, Jay, Madison, and Hamilton wrote a
    series of anonymous essays to convince the voters
    of New York to ratify the new U. S. Constitution.
  • Scholars have consensus that
  • 5 authored by Jay
  • 51 authored by Hamilton
  • 14 authored by Madison
  • 3 jointly by Hamilton and Madison
  • 12 remain in dispute Hamilton or Madison?

47
Author identification
  • Federalist papers
  • In 1963 Mosteller and Wallace solved the problem
  • They identified function words as good candidates
    for authorships analysis
  • Using statistical inference they concluded the
    author was Madison
  • Since then, other statistical techniques have
    supported this conclusion.

48
Function vs. Content Words
High rates for by favor M, low favor H High
rates for from favor M, low says little High
rates for to favor H, low favor M
49
Function vs. Content Words
No consistent pattern for war
50
Federalist Papers Problem
Fung, The Disputed Federalist Papers SVM Feature
Selection Via Concave Minimization, ACM TAPIA03
51
Classification
  • Goal Assign objects from a universe to two or
    more classes or categories
  • Examples
  • Problem Object
    Categories
  • Tagging Word POS
  • Sense Disambiguation Word The
    words senses
  • Information retrieval Document
    Relevant/not relevant
  • Sentiment classification Document
    Positive/negative
  • Author identification Document Authors

52
Text Categorization Applications
  • Web pages organized into category hierarchies
  • Journal articles indexed by subject categories
    (e.g., the Library of Congress, MEDLINE, etc.)
  • Responses to Census Bureau occupations
  • Patents archived using International Patent
    Classification
  • Patient records coded using international
    insurance categories
  • E-mail message filtering
  • News events tracked and filtered by topics
  • Spam

53
  • Yahoo News Categories

54
Cost of Manual Text Categorization
  • Yahoo!
  • 200 (?) people for manual labeling of Web pages
  • using a hierarchy of 500,000 categories
  • MEDLINE (National Library of Medicine)
  • 2 million/year for manual indexing of journal
    articles
  • using MEdical Subject Headings (18,000
    categories)
  • Mayo Clinic
  • 1.4 million annually for coding patient-record
    events
  • using the International Classification of
    Diseases (ICD) for billing insurance companies
  • US Census Bureau decennial census (1990 22
    million responses)
  • 232 industry categories and 504 occupation
    categories
  • 15 million if fully done by hand

55
Knowledge Statistical
Engineering Learning
vs.
  • For US Census Bureau Decennial Census 1990
  • 232 industry categories and 504 occupation
    categories
  • 15 million if fully done by hand
  • Define classification rules manually
  • Expert System AIOCS
  • Development time 192 person-months (2 people, 8
    years)
  • Accuracy 47
  • Learn classification function
  • Nearest Neighbor classification (Creecy 92
    1-NN)
  • Development time 4 person-months (Thinking
    Machine)
  • Accuracy 60

56
Text Topic categorization
  • Topic categorization classify the document into
    semantics topics

The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie. One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.
57
The Reuters collection
  • A gold standard
  • Collection of (21,578) newswire documents.
  • For research purposes a standard text collection
    to compare systems and algorithms
  • 135 valid topics categories

58
Reuters
  • Top topics in Reuters

59
Reuters Document Example
ltREUTERS TOPICS"YES" LEWISSPLIT"TRAIN"
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798"gt ltDATEgt 2-MAR-1987 165143.42lt/DATEgt
ltTOPICSgtltDgtlivestocklt/DgtltDgthoglt/Dgtlt/TOPICSgt ltTITLE
gtAMERICAN PORK CONGRESS KICKS OFF
TOMORROWlt/TITLEgt ltDATELINEgt CHICAGO, March 2 -
lt/DATELINEgtltBODYgtThe American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3lt/BODYgtlt/TEXTgtlt/REUTERSgt
60
Classification vs. Clustering
  • Classification assumes labeled data we know how
    many classes there are and we have examples for
    each class (labeled data).
  • Classification is supervised
  • In Clustering we dont have labeled data we just
    assume that there is a natural division in the
    data and we may not know how many divisions
    (clusters) there are
  • Clustering is unsupervised

61
Categories (Labels, Classes)
  • Labeling data
  • 2 problems
  • Decide the possible classes (which ones, how
    many)
  • Domain and application dependent
  • Label text
  • Difficult, time consuming, inconsistency between
    annotators

62
Reuters Example, revisited
Why not topic policy ?
ltREUTERS TOPICS"YES" LEWISSPLIT"TRAIN"
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798"gt ltDATEgt 2-MAR-1987 165143.42lt/DATEgt
ltTOPICSgtltDgtlivestocklt/DgtltDgthoglt/Dgtlt/TOPICSgt ltTITLE
gtAMERICAN PORK CONGRESS KICKS OFF
TOMORROWlt/TITLEgt ltDATELINEgt CHICAGO, March 2 -
lt/DATELINEgtltBODYgtThe American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3lt/BODYgtlt/TEXTgtlt/REUTERSgt
63
Binary vs. multi-way classification
  • Binary classification two classes
  • Multi-way classification more than two classes
  • Sometime it can be convenient to treat a
    multi-way problem like a binary one one class
    versus all the others, for all classes

64
Features
  • gtgtgt text "Seven-time Formula One champion
    Michael Schumacher took on the Shanghai circuit
    Saturday in qualifying for the first Chinese
    Grand Prix."
  • gtgtgt label sport
  • gtgtgt labeled_text LabeledText(text, label)
  • Here the classification takes as input the whole
    string
  • Whats the problem with that?
  • What are the features that could be useful for
    this example?

65
Feature terminology
  • Feature An aspect of the text that is relevant
    to the task
  • Some typical features
  • Words present in text
  • Frequency of words
  • Capitalization
  • Are there NE?
  • WordNet
  • Others?

66
Feature terminology
  • Feature An aspect of the text that is relevant
    to the task
  • Feature value the realization of the feature in
    the text
  • Words present in text Kerry, Schumacher, China
  • Frequency of word Kerry(10), Schumacher(1)
  • Are there dates? Yes/no
  • Are there PERSONS? Yes/no
  • Are there ORGANIZATIONS? Yes/no
  • WordNet Holonyms (China is part of Asia),
    Synonyms(China, People's Republic of China, mainla
    nd China)

67
Feature Types
  • Boolean (or Binary) Features
  • Features that generate boolean (binary) values.
  • Boolean features are the simplest and the most
    common type of feature.
  • f1(text) 1 if text contain Kerry
  • 0 otherwise
  • f2(text) 1 if text contain PERSON
  • 0 otherwise

68
Feature Types
  • Integer Features
  • Features that generate integer values.
  • Integer features can be used to give classifiers
    access to more precise information about the
    text.
  • f1(text) Number of times text contains Kerry
  • f2(text) Number of times text contains PERSON

69
Feature selection
  • How do we choose the right features?
  • Next lecture
Write a Comment
User Comments (0)
About PowerShow.com