Loading...

PPT – Lecture Notes 16: Bayes PowerPoint presentation | free to download - id: 1a89c4-ZDc1Z

The Adobe Flash plugin is needed to view this content

Lecture Notes 16 Bayes Theorem and Data Mining

- Zhangxi Lin
- ISQS 6347

Modeling Uncertainty

- Probability Review
- Bayes Classifier
- Value of Information
- Conditional Probability and Bayes Theorem
- Expected Value of Perfect Information
- Expected Value of Imperfect Information

Probability Review

- P(AB) P(A and B) / P(B)
- Probability of A given B
- Example, there are 40 female students in a class

of 100. 10 of them are from some foreign

countries. 20 male students are also foreign

students. - Even A student from a foreign country
- Even B a female student
- If randomly choosing a female student to present

in the class, the probability she is a foreign

student P(AB) 10 / 40 0.25, or P(AB) P

(A B) / P (B) (10 /100) / (40 / 100) 0.1 /

0.4 0.25 - That is, P(AB) of AB / of B ( of AB /

Total) / ( of B / Total) P(A B) / P(B)

Venn Diagrams

3010 40

2010 30

Foreign Student (20)

Female (30)

(10)

Male non-foreign student (40)

Female foreign student (10)

Probability Review

- Complement

Non Female

Female

Non Foreign Student

Foreign student

Bayes Classifier

Bayes Theorem (From Wikipedia)

- In probability theory, Bayes' theorem (often

called Bayes' Law) relates the conditional and

marginal probabilities of two random events. It

is often used to compute posterior probabilities

given observations. For example, a patient may be

observed to have certain symptoms. Bayes' theorem

can be used to compute the probability that a

proposed diagnosis is correct, given that

observation. - As a formal theorem, Bayes' theorem is valid in

all interpretations of probability. However, it

plays a central role in the debate around the

foundations of statistics frequentist and

Bayesian interpretations disagree about the ways

in which probabilities should be assigned in

applications. Frequentists assign probabilities

to random events according to their frequencies

of occurrence or to subsets of populations as

proportions of the whole, while Bayesians

describe probabilities in terms of beliefs and

degrees of uncertainty. The articles on Bayesian

probability and frequentist probability discuss

these debates at greater length.

Bayes Theorem

So

The above formula is referred to as Bayes

theorem. It is extremely Useful in decision

analysis when using information.

Example of Bayes Theorem

- Given
- A doctor knows that meningitis (M) causes stiff

neck (S) 50 of the time - Prior probability of any patient having

meningitis is 1/50,000 - Prior probability of any patient having stiff

neck is 1/20 - If a patient has stiff neck, whats the

probability he/she has meningitis?

Bayes Classifiers

- Consider each attribute and class label as random

variables - Given a record with attributes (A1, A2,,An)
- Goal is to predict class C ( (c1, c2, , cm))
- Specifically, we want to find the value of C that

maximizes P(C A1, A2,,An ) - Can we estimate P(C A1, A2,,An ) directly from

data?

Bayes Classifiers

- Approach
- compute the posterior probability P(C A1, A2,

, An) for all values of C using the Bayes

theorem - Choose value of C that maximizes P(C A1, A2,

, An) - Equivalent to choosing value of C that maximizes

P(A1, A2, , AnC) P(C) - How to estimate P(A1, A2, , An C )?

Example

- C Evade (Yes, No)
- A1 Refund (Yes, No)
- A2 Marital Status (Single, Married, Divorced)
- A3 Taxable income (60k 220k)
- We can obtain P(A1, A2, A3C), P(A1, A2, A3), and

P(C) from the data set - Then calculate P(CA1, A2, A3) for predictions

given A1, A2, and A3, while C is unknown.

Naïve Bayes Classifier

- Assume independence among attributes Ai when

class is given - P(A1, A2, , An C) P(A1 Cj) P(A2 Cj) P(An

Cj) - Can estimate P(Ai Cj) for all Ai and Cj.
- New point is classified to Cj if P(Cj) ? P(Ai

Cj) is maximal. - Note The above is equivalent to find i such that

? P(Ai Cj) is maximal, since P(Cj) is

identical.

How to Estimate Probabilities from Data?

- Class P(C) Nc/N
- e.g., P(No) 7/10, P(Yes) 3/10
- For discrete attributes P(Ai Ck)

Aik/ Nc - where Aik is number of instances having

attribute Ai and belongs to class Ck - Examples
- P(StatusMarriedNo) 4/7P(RefundYesYes)0

k

How to Estimate Probabilities from Data?

- For continuous attributes
- Discretize the range into bins
- one ordinal attribute per bin
- violates independence assumption
- Two-way split (A lt v) or (A gt v)
- choose only one of the two splits as new

attribute - Probability density estimation
- Assume attribute follows a normal distribution
- Use data to estimate parameters of distribution

(e.g., mean and standard deviation) - Once probability distribution is known, can use

it to estimate the conditional probability P(Aic)

How to Estimate Probabilities from Data?

- Normal distribution
- One for each (Ai,ci) pair
- For (Income, ClassNo)
- If ClassNo
- sample mean 110
- sample variance 2975

Example of Naïve Bayes Classifier

Given a Test Record

- P(XClassNo) P(RefundNoClassNo) ?

P(Married ClassNo) ? P(Income120K

ClassNo) 4/7 ? 4/7 ? 0.0072

0.0024 - P(XClassYes) P(RefundNo ClassYes)

? P(Married ClassYes)

? P(Income120K ClassYes)

1 ? 0 ? 1.2 ? 10-9 0 - Since P(XNo)P(No) gt P(XYes)P(Yes)
- Therefore P(NoX) gt P(YesX) gt Class No

Naïve Bayes Classifier

- If one of the conditional probability is zero,

then the entire expression becomes zero - Probability estimation

c number of classes p prior probability m

parameter

Example of Naïve Bayes Classifier

A attributes M mammals N non-mammals

P(AM)P(M) gt P(AN)P(N) gt Mammals

Naïve Bayes (Summary)

- Robust to isolated noise points
- Handle missing values by ignoring the instance

during probability estimate calculations - Robust to irrelevant attributes
- Independence assumption may not hold for some

attributes - Use other techniques such as Bayesian Belief

Networks (BBN)

Value of Information

- When facing uncertain prospects we need

information in order to reduce uncertainty - Information gathering includes consulting

experts, conducting surveys, performing

mathematical or statistical analyses, etc.

Expected Value of Perfect Information (EVPI)

Problem An buyer is to buy something online

Net gain

Seller type

Bad

- 100

0.01

Not use insurance Pay 100

EMV 18.8

Good

0.99

20

Buyer

Bad

- 2

0.01

EMV 17.8

Good

Use insurance Pay 1002 102

18

0.99

Expected Value of Imperfect Information (EVII)

- We rarely access to perfect information, which is

common. Thus we must extend our analysis to deal

with imperfect information. - Now suppose we can access the online reputation

to estimate the risk in trading with a seller. - Someone provide their suggestions to you

according to their experience. Their predictions

are not 100 correct - If the product is actually good, the persons

prediction is 90 correct, whereas the remaining

10 is suggested bad. - If the product is actually bad, the persons

prediction is 80 correct, whereas the remaining

20 is suggested good. - Although the estimate is not accurate enough, it

can be used to improve our decision making - If we predict the risk is high to buy the product

online, we purchase insurance

Decision Tree

Extended from the previous online trading question

Questions 1. Given the suggestion What is your

decision? 2. What is the probability wrt the

decision you made? 3. How do you estimate The

accuracy of a prediction?

Bad (?)

- 100

Seller type

No Ins

20

Good (?)

Predicted Good

Bad (?)

- 2

Insurance

Good (?)

18

Buyer

Bad (?)

- 100

No Ins

20

Good (?)

Bad (?)

- 2

Predicted Bad

Insurance

Good (?)

18

Applying Bayes Theorem

- Let Good be even A
- Let Bad be even B
- Let Predicted Good be event G
- Let Predicted Bad be event W
- According to the previous information, for

example by data mining the historical data, we

know - P(GA) 0.9, P(WA) 0.1
- P(WB) 0.8, P(GB) 0.2
- P(A) 0.99, P(B) 0.01
- We want to learn the probability the outcome is

good providing the prediction is good. i.e. - P(AG) ?
- We want to learn the probability the outcome is

bad providing the prediction is bad. i.e. - P(BW) ?
- We may apply Bayes theorem to solve this with

imperfect information

Calculate P(G) and P(W)

- P(G) P(GA)P(A) P(GB)P(B)
- 0.9 0.99 0.2 0.01
- 0.893
- P(W) P(WB)P(B) P(WA)P(A)
- 0.8 0.01 0.1 0.99
- 0.107
- 1 - P(G)

Applying Bayes Theorem

- We have
- P(AG) P(GA)P(A) / P(G)
- P(GA)P(A) / P(GA)P(A) P(GB)P(B)
- P(GA)P(A) / P(GA)P(A) P(GB)(1 - P(A))
- 0.9 0.99 / 0.9 0.99 0.2 0.01
- 0.9978 gt 0.99
- P(BW) P(WB)P(B) / P(W)
- P(WB)P(B) / P(WB)P(B) P(WA)P(A)
- P(WB)P(B) / P(WB)P(B) P(WA)(1 - P(B))
- 0.8 0.01 / 0.8 0.01 0.1 0.99
- 0.0748 gt 0.01
- Apparently, data mining provides good information

and changes the original probability

Decision Tree

P(A) 0.99, P(B) 0.01

Bad (0.0022)

- 100

Seller type

EMV 19.87 Your choice

No Ins

Predicted Good P(G) 0.893

20

Good (0.9978)

Bad (0.0022)

- 2

EMV 17.78

Insurance

Good (0.9978)

18

Buyer

Bad (0.0748)

- 100

EMV 11.03

No Ins

20

Good (0.9252)

Bad (0.0748)

- 2

Predicted Bad P(W) 0.107

EMV 16.50 Your choice

Insurance

Good (0.9252)

18

Data mining can significantly improve your

decision making accuracy!

Consequences of a Decision

Predicted Good (G) (not to buy insurance) Predicted Bad (W) (need to buy insurance 2)

Actual Good (A) a Gain 20 b Net Gain 18

Actual Bad (B) c Lose 100 d Cost 2

P(A) (a b) / (a b c d) 0.99 P(B)

(c d) / (a b c d) 0.01

P(G) (a c) / (a b c d) 0.893

P(W) (b d) / (a b c d) 0.107

P(GA) a / (a b) 0.9, P(WA) b / (a b)

0.1 P(WB) c / (c d) 0.8, P(GB) d / (c

d) 0.2

German Bank Credit Decision

Computed Good (Action A, B) Computed Bad (Action A, B)

Actual Good True Positive 600 (6, 0) False Negative 100 (0, -1)

Actual Bad False Positive 80 (-2, -1) True Negative 220 (-20, 0)

700 300

680 320

This is a modified version of the German Bank

credit decision problem. 1. Assume because of the

anti-discrimination regulation there could be a

cost in FN depending on the action taken. 2. The

bank has two choices of actions A B. Each will

have different results. 3.Question 1 When the

classification model suggests that a specific

loan applicant has a probability 0.8 to be GOOD,

which action should be taken? 4. Question 2 When

the classification model suggests that a specific

loan applicant has a probability 0.6 to be GOOD,

which action should be taken?

The Payoffs from Two Actions

Computed Good (Action A) Computed Bad (Action A)

Actual Good True Positive 600 (6) False Negative 100 (0)

Actual Bad False Positive 100 (-2) True Negative 200 (-20)

700 300

700 300

Computed Good (Action B) Computed Bad (Action B)

Actual Good True Positive 600 (0) False Negative 100 (-1)

Actual Bad False Positive 100 (-1) True Negative 200 (0)

700 300

700 300

Summary

- There are two decision scenarios
- In previous classification problems, when

predicted target is 1 then take an action,

otherwise do nothing. Only the action will make

something different. - There is a cutoff value for this kind of

decision. A risk-aversion person may set a

higher level of cutoff value, when the utility

function is not linear with regard to the

monetary result. - The risk-aversion person may opt for earn less

without the emotional worry of the risk. - In current Bayesian decision problem, when the

predicted target is 1 then take action A,

otherwise take Action B. Both actions will result

in some outcomes.

Web Page Browsing

P0

Problem When a browsing user Entered P5 from

P2, What is the probability He will proceed to

P3? How to solve the problem in general? 1.

Assume this is the first Order Markovian chain.

2. Construct a transition probability matrix

P1

P2

P5

0.7

P4

0.3

P3

- We notice that
- P(P2P4P0) may not equal to P(P2P4P1)
- There is only one entrance of the web site at P0
- There is no link from P3 to other pages.

Transition Probabilities

P(K,L)Probability of traveling FROM K TO L

P0/H

P1

P2

P3

P4

P5

Exit

P0/H

P(H,1)

P(H,2)

P(H,3)

P(H,4)

P(H,5)

P(H,E)

P(H,H)

P1

P(1,1)

P(1,2)

P(1,3)

P(1,4)

P(1,5)

P(1,E)

P(1,H)

P2

P(2,1)

P(2,2)

P(2,3)

P(2,4)

P(2,5)

P(2,E)

P(2,H)

P3

P(3,1)

P(3,2)

P(3,3)

P(3,4)

P(3,5)

P(3,E)

P(3,H)

P4

P(4,1)

P(4,2)

P(4,3)

P(4,4)

P(4,5)

P(4,E)

P(4,H)

P5

P(5,1)

P(5,2)

P(5,3)

P(5,4)

P(5,5)

P(5,E)

P(5,H)

Exit

0

0

0

0

0

0

0

Demonstration

- Dataset Commrex web log data
- Data Exploration
- Link analysis
- The links among nodes
- Calculate the transition matrix
- The Bayesian network model for the web log data
- Reference
- David Heckerman, A Tutorial on Learning With

Bayesian Networks, March 1995 (Revised November

1996), Technical Report, MSR-TR-95-06\\BASRV1\ISQS

6347\tr-95-06.pdf

Readings

- SPB, Chapter 3
- RG, Chapter 10