1 / 146

petri.nokelainen_at_uta.fiSchool of Education

University of Tampere, Finland

Introduction to Discrete Bayesian Methods

Petri Nokelainen

Outline

- Overview
- Introduction to Bayesian Modeling
- Bayesian Classification Modeling
- Bayesian Dependency Modeling
- Bayesian Unsupervised Model-based Visualization

Overview

(Nokelainen, 2008.)

Overview

BDM Bayesian Dependency Modeling BCM

Bayesian Classification Modeling BUMV Bayesian

Unsupervised Model-based Visualization

Bayesian Classification Modeling

http//b-course.cs.helsinki.fi

The classification accuracy of the best model

found is 83.48 (58.57).

COMMON FACTORS PUB_T CC_PR CC_HE PA C_SHO C_FAIL

CC_AB CC_ES

Bayesian Dependency Modeling

http//b-course.cs.helsinki.fi

Bayesian Unsupervised Model-based Visualization

http//www.bayminer.com

Outline

- Overview
- Introduction to Bayesian Modeling
- Bayesian Classification modeling
- Bayesian Dependency modeling
- Bayesian Unsupervised Model-based Visualization

Introduction to Bayesian Modeling

- In the social science researchers point of view,

the requirements of traditional frequentistic

statistical analysis are very challenging. - For example, the assumption of normality of both

the phenomena under investigation and the data is

prerequisite for traditional parametric

frequentistic calculations.

Introduction to Bayesian Modeling

- In situations where
- a latent construct cannot be appropriately

represented as a continuous variable, - ordinal or discrete indicators do not reflect

underlying continuous variables, - the latent variables cannot be assumed to be

normally distributed, - traditional Gaussian modeling is clearly not

appropriate. - In addition, normal distribution analysis sets

minimum requirements for the number of

observations, and the measurement level of

variables should be continuous.

Introduction to Bayesian Modeling

- Frequentistic parametric statistical techniques

are designed for normally distributed (both

theoretically and empirically) indicators that

have linear dependencies. - Univariate normality
- Multivariate normality
- Bivariate linearity

(Nokelainen, 2008, p. 119)

- The upper part of the figure contains two

sections, namely parametric and

non-parametric divided into eight sub-sections

(DNIMMOCS OLD). - Parametric approach is viable only if
- 1) Both the phenomenon modeled and the sample

follow normal distribution. - 2) Sample size is large enough (at least 30

observations). - 3) Continuous indicators are used.
- 4) Dependencies between the observed variables

are linear. - Otherwise non-parametric techniques should be

applied.

D Design (ce controlled experiment, co

correlational study) N Sample size IO

Independent observations ML Measurement level

(c continuous, d discrete, n nominal) MD

Multivariate distribution (n normal, similar) O

Outliers C Correlations S Statistical

dependencies (l linear, nl non-linear)

Introduction to Bayesian Modeling

N 11 500

Introduction to Bayesian Modeling

- Bayesian method
- (1) is parameter-free and the user input is not

required, instead, prior distributions of the

model offer a theoretically justifiable method

for affecting the model construction - (2) works with probabilities and can hence be

expected to produce robust results with discrete

data containing nominal and ordinal attributes - (3) has no limit for minimum sample size
- (4) is able to analyze both linear and

non-linear dependencies - (5) assumes no multivariate normal model
- (6) allows prediction.

Introduction to Bayesian Modeling

- Probability is a mathematical construct that

behaves in accordance with certain rules and can

be used to represent uncertainty. - The classical statistical inference is based on a

frequency interpretation of probability, and the

Bayesian inference is based on subjective or

degree of belief interpretation. - Bayesian inference uses conditional probabilities

to represent uncertainty. - P(H E,I) - the probability of unknown things or

hypothesis (H), given the evidence (E) and

background information (I).

Introduction to Bayesian Modeling

- The essence of Bayesian inference is in the rule,

known as Bayes' theorem, that tells us how to

update our initial probabilities P(H) if we see

evidence E, in order to find out P(HE).

- A priori probability
- Conditional probability
- Posteriori probability

P(EH) P(H) P(HE) P(EH)P(H

) P(EH) P(H)

Introduction to Bayesian Modeling

- The theorem was invented by an english reverend

Thomas Bayes (1701-1761) and published

posthumously (1763).

Introduction to Bayesian Modeling

- Bayesian inference comprises the following three

principal steps - (1) Obtain the initial probabilities P(H) for the

unknown things. (Prior distribution.) - (2) Calculate the probabilities of the evidence E

(data) given different values for the unknown

things, i.e., P(E H). (Likelihood or

conditional distribution.) - (3) Calculate the probability distribution of

interest P(H E) using Bayes' theorem.

(Posterior distribution.) - Bayes' theorem can be used sequentially.

Introduction to Bayesian Modeling

- If we first receive some evidence E (data), and

calculate the posterior P(H E), and at some

later point in time receive more data E', the

calculated posterior can be used in the role of

prior to calculate a new posterior P(H E,E')

and so on. - The posterior P(H E) expresses all the

necessary information to perform predictions. - The more evidence we get, the more certain we

will become of the unknowns, until all but one

value combination for the unknowns have

probabilities so close to zero that they can be

neglected.

C_Example 1 Applying Bayes Theorem

- Company A is employing workers on short term

jobs that are well paid. - The job sets certain prerequisites to applicants

linguistic abilities. - Earlier all the applicants were interviewed, but

nowadays it has become an impossible task as both

the number of open vacancies and applicants has

increased enormously. - Personnel department of the company was ordered

to develop a questionnaire to preselect the most

suitable applicants for the interview.

C_Example 1 Applying Bayes Theorem

- Psychometrician who developed the instrument

estimates that it would work out right on 90 out

of 100 applicants, if they are honest. - We know on the basis of earlier interviews that

the terms (linguistic abilities) are valid for

one per 100 person living in the target

population. - The question is If an applicant gets enough

points to participate in the interview, is he or

she hired for the job (after an interview)?

C_Example 1 Applying Bayes Theorem

- A priori probability P(H) is described by the

number of those people in the target population

that really are able to meet the requirements of

the task (1 out of 100 .01). - Counter assumption of the a priori is P(H) that

equals to 1-P(H), thus it is .99. - Psychometricians beliefs about how the instrument

works is called conditional probability P(EH)

.9. - Instruments failure to indicate non-valid

applicants, i.e., those that are not able to

succeed in the following interview, is stated as

P(EH) that equals to .1. - These values need not to sum to one!

C_Example 1 Applying Bayes Theorem

- A priori probability
- Conditional probability
- Posterior probability

- P(EH) P(H)
- P(HE)
- P(EH) P(H) P(EH) P(H)

(.9) (.01) P(HE) (.9)

(.01) (.1) (.99)

.08

C_Example 1 Applying Bayes Theorem

C_Example 1 Applying Bayes Theorem

- What if the measurement error of the

psychometricians instrument would have been 20

per cent? - P(EH)0.8 P(EH)0.2

C_Example 1 Applying Bayes Theorem

C_Example 1 Applying Bayes Theorem

- What if the measurement error of the

psychometricians instrument would have been only

one per cent? - P(EH)0.99 P(EH)0.01

C_Example 1 Applying Bayes Theorem

C_Example 1 Applying Bayes Theorem

- Quite often people tend to estimate the

probabilities to be too high or low, as they are

not able to update their beliefs even in simple

decision making tasks when situations change

dynamically (Anderson, 1995).

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- One of the most important rules educational

science scientific journals apply to judge the

scientific merits of any submitted manuscript is

that all the reported results should be based on

so called null hypothesis significance testing

procedure (NHSTP) and its featured product,

p-value. - Gigerenzer, Krauss and Vitouch (2004, p. 392)

describe the null ritual as follows - 1) Set up a statistical null hypothesis of no

mean difference or zero correlation. Dont

specify the predictions of your research or of

any alternative substantive hypotheses - 2) Use 5 per cent as a convention for rejecting

the null. If significant, accept your research

hypothesis - 3) Always perform this procedure.

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- A p-value is the probability of the observed data

(or of more extreme data points), given that the

null hypothesis H0 is true, P(DH0) (id.). - The first common misunderstanding is that the

p-value of, say t-test, would describe how

probable it is to have the same result if the

study is repeated many times (Thompson, 1994). - Gerd Gigerenzer and his colleagues (id., p. 393)

call this replication fallacy as P(DH0) is

confused with 1P(D).

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- The second misunderstanding, shared by both

applied statistics teachers and the students, is

that the p-value would prove or disprove H0.

However, a significance test can only provide

probabilities, not prove or disprove null

hypothesis. - Gigerenzer (id., p. 393) calls this fallacy an

illusion of certainty Despite wishful thinking,

p(DH0) is not the same as P(H0D), and a

significance test does not and cannot provide a

probability for a hypothesis. - A Bayesian statistics provide a way of

calculating a probability of a hypothesis

(discussed later in this section).

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- My statistics course grades (Autumn 2006, n 12)

ranged from one to five as follows 1) n 3 2)

n 2 3) n 4 4) n 2 5) n 1, showing that

the lowest grade frequency (1) from the course

is three (25.0). - Previous data from the same course (2000-2005)

shows that only five students out of 107 (4.7)

had the lowest grade. - Next, I will use the classical statistical

approach (the likelihood principle) and Bayesian

statistics to calculate if the number of the

lowest course grades is exceptionally high on my

latest course when compared to my earlier stat

courses.

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- There are numerous possible reasons behind such

development, for example, I have become more

critical on my assessment or the students are

less motivated in learning quantitative

techniques. - However, I believe that the most important

difference between the last and preceding courses

is that the assessment was based on a computer

exercise with statistical computations. - The preceding courses were assessed only with

essay answers.

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- I assume that the 12 students earned their grade

independently (independent observations) of each

other as the computer exercise was conducted

under my or my assistants supervision. - I further assume that the chance of getting the

lowest grade (?), is the same for each student. - Therefore X, the number of lowest grades (1) in

the scale from 1 to 5 among the 12 students in

the latest stat course, has a binomial (12, ?)

distribution X Bin(12, ?). - For any integer r between 0 and 12,

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- The expected number of lowest grades is 12(5/107)

0.561. - Theta is obtained by dividing the expected number

of lowest grades with the number of students

0.561 / 12 ? 0.05. - The null hypothesis is formulated as follows H0

? 0.05, stating that the rate of the lowest

grades from the current stat course is not a big

thing and compares to the previous courses rates.

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- Three alternative hypotheses are formulated to

address the concern of the increased number of

lowest grades (6, 7 and 8, respectively) H1 ?

0.06 H2 ? 0.07 H3 ? 0.08. - H1 12/(107/6) .67 -gt .67/12.056 ? .06
- H2 12/(107/7) .79 -gt .79/12.065 ? .07
- H3 12/(107/8) .90 -gt .90/12.075 ? .08

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- To compare the hypotheses, we calculate binomial

distributions for each value of ?. - For example, the null hypothesis (H0) calculation

yields

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- The results for the alternative hypotheses are as

follows - PH1(3.06, 12) ? .027
- PH2(3.07, 12) ? .039
- PH3(3.08, 12) ? .053.
- The ratio of the hypotheses is roughly 1223

and could be verbally interpreted with statements

like the second and third hypothesis explain the

data about equally well, or the fourth

hypothesis explains the data about three times as

well as the first hypothesis.

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- Lavine (1999) reminds that P(r?, n), as a

function of r (3) and ? .05 .06 .07 .08,

describes only how well each hypotheses explains

the data no value of r other than 3 is relevant.

- For example, P(4.05, 12) is irrelevant as it

does not describe how well any hypothesis

explains the data. - This likelihood principle, that is, to base

statistical inference only on the observed data

and not on a data that might have been observed,

is an essential feature of Bayesian approach.

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- The Fisherian, so called classical approach to

test the null hypothesis (H0 ? .05) against

the alternative hypothesis (H1 ? gt .05) is to

calculate the p-value that defines the

probability under H0 of observing an outcome at

least as extreme as the outcome actually

observed

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- As an example, the first part of the formula is

solved as follows

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- After calculations, the p-value of .02 would

suggest H0 rejection, if the rejection level of

significance is set at 5 per cent. - Calculation of p-value violates the likelihood

principle by using P(r?, n) for values of r

other than the observed value of r 3 (Lavine,

1999) - The summands of P(4.05, 12), P(5.05, 12), ,

P(12.05, 12) do not describe how well any

hypothesis explains observed data.

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- A Bayesian approach will continue from the same

point as the classical approach, namely

probabilities given by the binomial

distributions, but also make use of other

relevant sources of a priori information. - In this domain, it is plausible to think that the

computer test (SPSS exam) would make the number

of total failures more probable than in the

previous times when the evaluation was based

solely on the essays. - On the other hand, the computer test has only 40

per cent weight in the equation that defines the

final stat course grade .3(Essay_1)

.3(Essay_2) .4(Computer test)/3 Final grade.

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- Another aspect is to consider the nature of the

aforementioned tasks, as the essays are distance

work assignments while the computer test is to be

performed under observation. - Perhaps the course grades of my earlier stat

courses have a narrower dispersion due to

violence of the independent observation

assumption? - For example, some students may have copy-pasted

text from other sources or collaborated without a

permission. - As we see, there are many sources of a priori

information that I judge to be inconclusive and,

thus, define that null hypothesis is as likely to

be true or false.

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- This a priori judgment is expressed

mathematically as P(H0) ? 1/2 ? P(H1) P(H2)

P(H3). - I further assume that the alternative hypotheses

H1, H2 or H3 share the same likelihood P(H1) ?

P(H2) ? P(H3) ? 1/6. - These prior distributions summarize the knowledge

about ? prior to incorporating the information

from my course grades.

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- An application of Bayes' theorem yields

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- Similar calculations for the alternative

hypotheses yields P(H1r3) ? .16 P(H2r3) ?

.29 P(H3r3) ? .31. - These posterior distributions summarize the

knowledge about ? after incorporating the grade

information. - The four hypotheses seem to be about equally

likely (.30 vs. .16, .29, .31). - The odds are about 2 to 1 (.30 vs. .70) that the

latest stat course had higher rate of lowest

grades than 0.05.

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- The difference between the classical and Bayesian

statistics would be only philosophical

(probability vs. inverse probability) if they

would always lead to similar conclusions. - In this case the p-value would suggest rejection

of H0 (p .02). - Bayesian analysis would also suggest evidence

against ? .05 (.30 vs. .70, ratio of .43).

C_Example 2 Comparison of Traditional

Frequentistic and Bayesian Approach

- What if the number of the lowest grades in the

last course would be two? - The classical approach would not anymore suggest

H0 rejection (p .12). - Bayesian result would still say that there is

more evidence against than for the H0 (.39 vs.

.61, ratio of .64).

Outline

- Overview
- Introduction to Bayesian Modeling
- Bayesian Classification Modeling
- Bayesian Dependency Modeling
- Bayesian Unsupervised Model-based Visualization

BCM Bayesian Classification Modeling BDM

Bayesian Dependency Modeling BUMV Bayesian

Unsupervised Model-based Visualization

B-Course

Bayesian Classification Modeling

- Bayesian Classification Modeling (BCM) is

implemented in B-Course software that is based on

discrete Bayesian methods. - This also applies to Bayesial Dependency Modeling

that is discussed later. - Quantitative indicators with high measurement

lever (continuous, interval) lose more

information in the discretization process than

qualitative indicators (ordinal, nominal) as

they all are treated in the analysis as nominal

(discrete) indicators.

Bayesian Classification Modeling

- For example, variable gender may include

numerical values 1 (Female) or 2 (Male) or

text values Female and Male in discrete

Bayesian analysis. - This will inevitably lead to a loss of power

(Cohen, 1988 Murphy Myors, 1998), however,

ensuring that sample size is large enough is a

simple way to address this problem.

Sample size estimation

- N
- Population size.
- n
- Estimated sample size.
- Sampling error (e)
- Difference between the true (unknown) value and

observed values, if the survey were repeated

(sample collected) numerous times. - Confidence interval
- Spread of the observed values that would be seen

if the survey were repeated numerous times. - Confidence level
- How often the observed values would be within

sampling error of the true value if the survey

were repeated numerous times.

(Murphy Myors, 1998.)

Bayesian Classification Modeling

- Aim of the BCM is to select the variables that

are best predictors for different class

memberships (e.g., gender, job title, level of

giftedness). - In the classification process, the automatic

search is looking for the best set of variables

to predict the class variable for each data item.

Bayesian Classification Modeling

- The search procedure resembles the traditional

linear discriminant analysis (LDA, see Huberty,

1994), but the implementation is totally

different. - For example, a variable selection problem that is

addressed with forward, backward or stepwise

selection procedure in LDA is replaced with a

genetic algorithm approach (e.g., Hilario,

Kalousisa, Pradosa Binzb, 2004 Hsu, 2004) in

the Bayesian classification modeling.

Bayesian Classification Modeling

- The genetic algorithm approach means that

variable selection is not limited to one (or two

or three) specific approach instead many

approaches and their combinations are exploited. - One possible approach is to begin with the

presumption that the models (i.e., possible

predictor variable combinations) that resemble

each other a lot (i.e., have almost same

variables and discretizations) are likely to be

almost equally good. - This leads to a search strategy in which models

that resemble the current best model are selected

for comparison, instead of picking models

randomly.

Bayesian Classification Modeling

- Another approach is to abandon the habit of

always rejecting the weakest model and instead

collect a set of relatively good models. - The next step is to combine the best parts of

these models so that the resulting combined model

is better than any of the original models. - B-Course is capable of mobilizing many more

viable approaches, for example, rejecting the

better model (algorithms like hill climbing,

simulated annealing) or trying to avoid picking

similar model twice (tabu search).

Bayesian Classification Modeling

Nokelainen, P., Ruohotie, P., Tirri, H. (1999).

For an example of practical use of BCM, see

Nokelainen, Tirri, Campbell and Walberg (2007).

The results of Bayesian classification modeling

showed that the estimated classification accuracy

of the best model found was 60. The left-hand

side of Figure 3 shows that only three variables,

Olympians Conducive Home Atmosphere (SA),

Olympians School Shortcomings (C_SHO), and

Computer literacy composite (COMP), were

successful predictors for the A or C group

membership. All the other variables that were

not accepted in the model are to be considered as

connective factors between the two groups. The

middle section of Figure 3 shows that the two

strongest predictors were Olympians Conducive

Home Atmosphere (20.9) and Olympians School

Shortcomings (22.6). The confusion matrix shows

that most of the A (25 correct out of 39) and the

C (29 out of 47) group members were correctly

classified. The matrix also shows that nine

participants of the group A were incorrectly

classified into group C and vice versa.

(No Transcript)

Figure 4 presents predictive modeling of the A

and C groups (A_C, A or C group membership)

by Olympians Conducive Home Atmosphere (SA),

Olympians School Shortcomings (C_SHO), and

Computer Literacy Composite (COMP). The

left-hand side of the figure presents the initial

model with no values fixed. The model in the

middle presents a scenario where all the A group

members are selected. When we compare this model

to the one on the right-hand side (i.e.,

presenting a situation where all the C group

members are selected), we notice, for example,

that conditional distribution of the Olympians

Conducive Home Atmosphere (SA) has changed. It

shows that highly productive Olympians have

reported more Conducive home atmosphere (54.0)

than the members of the low productivity group C

(23.0).

(No Transcript)

- Modeling of Vocational Excellence in Air Traffic

Control - This paper aims to describe the characteristics

and predictors that explain air traffic

controllers (ATCO) vocational expertise and

excellence. - The study analyzes the role of natural abilities,

self-regulative abilities and environmental

conditions in ATCOs vocational development.

(Pylväs, Nokelainen Roisko, in press.)

- Modeling of Vocational Excellence in Air Traffic

Control - The target population of the study consisted of

ATCOs in Finland (N300) of which 28,

representing four different airports, were

interviewed. - The research data also included interviewees

aptitude test scoring, study records and employee

assessments.

- Modeling of Vocational Excellence in Air Traffic

Control - The research questions were examined by using

theoretical concept analysis. - The qualitative data analysis was conducted with

content analysis and Bayesian classification

modeling.

- Modeling of Vocational Excellence in Air Traffic

Control

- Modeling of Vocational Excellence in Air Traffic

Control - (RQ1a)
- What are the differences in characteristics

between the air traffic controllers representing

vocational expertise and vocational excellence?

- Modeling of Vocational Excellence in Air Traffic

Control - "the natural ambition of wanting to be good. Air

traffic controllers have perhaps generally a

strong professional pride." - Interesting and rewarding work, that is the

basis of wanting to stay in this work until

retiring.

- Modeling of Vocational Excellence in Air Traffic

Control - "I read all the regulations and instructions

carefully and precisely, and try to think the

majority wave aside of them. It reflects on

work." "but still I consider myself more

precise than the majority a bad air traffic

controller have delays, good air traffic

controllers do not have delays which is something

that also pilots appreciate because of the strict

time limits.

- Modeling of Vocational Excellence in Air Traffic

Control

- Modeling of Vocational Excellence in Air Traffic

Control

- Modeling of Vocational Excellence in Air Traffic

Control

- Modeling of Vocational Excellence in Air Traffic

Control

- Modeling of Vocational Excellence in Air Traffic

Control

Classification accuracy 89.

- Modeling of Vocational Excellence in Air Traffic

Control

- Modeling of Vocational Excellence in Air Traffic

Control

Outline

- Research Overview
- Introduction to Bayesian Modeling
- Investigating Non-linearities with Bayesian

Networks - Bayesian Classification Modeling
- Bayesian Dependency Modeling
- Bayesian Unsupervised Model-based Visualization

BCM Bayesian Classification Modeling BDM

Bayesian Dependency Modeling BUMV Bayesian

Unsupervised Model-based Visualization

B-Course

Bayesian Dependency Modeling

- Bayesian dependency modeling (BDM) is applied to

examine dependencies between variables by both

their visual representation and probability ratio

of each dependency - Graphical visualization of Bayesian network

contains two components - 1) Observed variables visualized as ellipses.
- 2) Dependences visualized as lines between nodes.

C_Example 4 Calculation of Bayesian Score

- Bayesian score (BS), that is, the probability of

the model P(MD), allows the comparison of

different models.

Figure 9. An Example of Two Competing Bayesian

Network Structures

(Nokelainen, 2008, p. 121.)

C_Example 4 Calculation of Bayesian Score

- Let us assume that we have the following data
- x1 x2
- 1 1
- 1 1
- 2 2
- 1 2
- 1 1
- Model 1 (M1) represents the two variables, x1 and

x2 respectively, without statistical dependency,

and the model 2 (M2) represents the two variables

with a dependency (i.e., with a connecting arc). - The binomial data might be a result of an

experiment, where the five participants have

drinked a nice cup of tea before (x1) and after

(x2) a test of geographic knowledge.

C_Example 4 Calculation of Bayesian Score

- In order to calculate P(M1,2D), we need to solve

P(DM1,2) for the two models M1 and M2. - Probability of the data given the model is solved

by using the following marginal likelihood

equation (Congdon, 2001, p. 473 Myllymäki,

Silander, Tirri, Uronen, 2001 Myllymäki

Tirri, 1998, p. 63)

C_Example 4 Calculation of Bayesian Score

- Nij describes the number of rows in the data

that have jth configuration for parents of ith

variable - Nijk describes how many rows in the

data have kth value for the ith variable also

have jth configuration for parents of ith

variable - N is the equivalent sample size set

to be the average number of values divided by two.

- In the Equation 4, following symbols are used
- n is the number of variables (i indexes variables

from 1 to n) - ri is the number of values in ith variable (k

indexes these values from 1 to ri - qi is the number of possible configurations of

parents of ith variable - The marginal likelihood equation produces a

Bayesian Dirichlet score that allows model

comparison (Heckerman et al., 1995 Tirri, 1997

Neapolitan Morris, 2004).

C_Example 4 Calculation of Bayesian Score

- First, P(DM1) is calculated given the values of

variable x1

(2/2)/1

(2/2)/21

x1 x2 1 1 1 1 2 2 1 2 1 1

C_Example 4 Calculation of Bayesian Score

- Second, the values for the x2 are calculated

x1 x2 1 1 1 1 2 2 1 2 1 1

C_Example 4 Calculation of Bayesian Score

- The BS, probability for the first model P(M1D),

is 0.027 0.012 ? 0.000324.

C_Example 4 Calculation of Bayesian Score

- Third, P(DM2) is calculated given the values of

variable x1

C_Example 4 Calculation of Bayesian Score

- Fourth, the values for the first parent

configuration (x1 1) are calculated

C_Example 4 Calculation of Bayesian Score

- Fifth, the values for the second parent

configuration (x1 2) are calculated

C_Example 4 Calculation of Bayesian Score

- The BS, probability for the second model P(M2D),

is 0.027 0.027 0.500 ? 0.000365.

C_Example 4 Calculation of Bayesian Score

- Bayes theorem enables the calculation of the

ratio of the two models, M1 and M2. - As both models share the same a priori

probability, P(M1) P(M2), both probabilities

are canceled out. - Also the probability of the data P(D) is canceled

out in the following equation as it appears in

both formulas in the same position

C_Example 4 Calculation of Bayesian Score

- The result of model comparison shows that since

the ratio is less than 1, the M2 is more probable

than M1. - This result becomes explicit when we investigate

the sample data more closely. - Even a sample this small (n 5) shows that there

is a clear tendency between the values of x1 and

x2 (four out of five value pairs are identical).

x1 x2 1 1 1 1 2 2 1 2 1 1

- How many models are there?

For an example of practical use of BDM, see

Nokelainen and Tirri (2010).

Our hypothesis regarding the first research

question was that intrinsic goal orientation

(INT) is positively related to moral judgment

(Batson Thompson, 2001 Kunda Schwartz,

1983). It was also hypothesized, based on

Blasis (1999) argumentation that emotions cannot

be predictors of moral action, that fear of

failure (affective motivational section) is not

related to moral judgment. Research evidence

showed support for both hypotheses firstly, only

intrinsic motivation was directly (positively)

related to moral judgment, and secondly,

affective motivational section was not present in

the predictive model.

(Nokelainen Tirri, 2010.)

Conditioning the three levels of moral judgment

showed that there is a positive statistical

relationship between moral judgment and intrinsic

goal orientation. The probability of belonging to

the highest intrinsically motivated group three

(M 3.7 5.0) increases from 15 per cent to 90

per cent alongside with the moral judgment

abilities. There is also similar but less steep

increase in extrinsic goal orientation (from 5

to 12), but we believe that it is mostly tied to

increase in extrinsic goal orientation.

(Nokelainen Tirri, 2010.)

For an example of practical use of BDM see

Nokelainen and Tirri (2007).

(Nokelainen Tirri, 2007.)

(Nokelainen Tirri, 2007.)

2 vs. 90

21 vs. 78

EL_iv_17_49 In conflict situations, my superior

is able to draw out all parties and

understand the differing

perspectives. EL_ii_09_26 My superior sees

other people in positive rather than in negative

light. EL_ii_09_25 My superior has an

optimistic "glass half full" outlook.

(Nokelainen Tirri, 2007.)

69

66

EL_iv_17_49 In conflict situations, my superior

is able to draw out all parties and understand

the differing perspectives. EL_ii_09_26 My

superior sees other people in positive rather

than in negative light. EL_ii_09_25 My

superior has an optimistic "glass half full"

outlook.

(Nokelainen Tirri, 2007.)

95

85

EL_iv_17_49 In conflict situations, my superior

is able to draw out all parties and understand

the differing perspectives. EL_ii_09_26 My

superior sees other people in positive rather

than in negative light. EL_ii_09_25 My

superior has an optimistic "glass half full"

outlook.

(Nokelainen Tirri, 2007.)

Outline

- Overview
- Introduction to Bayesian Modeling
- Bayesian Classification Modeling
- Bayesian Dependency Modeling
- Bayesian Unsupervised Model-based Visualization

BCM Bayesian Classification Modeling BDM

Bayesian Dependency Modeling BUMV Bayesian

Unsupervised Model-based Visualization

BayMiner

Bayesian Unsupervised Model-based Visualization

LDA

BSMV

SUPERVISED

UNSUPERVISED

VISUALIZATION TECH.

CLUSTER ANALYSIS

EFA

DISC. MULTIV. ANAL.

REDUCING

NON-REDUC.

PROJECTION TECH.

NON-LINEAR

LINEAR

NEUR.N.

MDS

PCA

SOM

PRIN.C.

ICA

BUMV

PROJ.PUR.

Bayesian Unsupervised Model-based Visualization

- Supervised techniques, for example, linear

discriminant analysis (LDA) and supervised

Bayesian networks (BSMV, see Kontkanen, Lahtinen,

Myllymäki, Silander Tirri, 2000) assume a given

structure (Venables Ripley, 2002, p. 301). - Unsupervised techniques, for example, exploratory

factor analysis (EFA) discover variable structure

from the evidence of the data matrix. - Unsupervised techniques are further divided into

four sub categories 1) Visualization techniques

2) Cluster analysis 3) Factor analysis 4)

Discrete multivariate analysis.

Bayesian Unsupervised Model-based Visualization

LDA

BSMV

SUPERVISED

UNSUPERVISED

VISUALIZATION TECH.

CLUSTER ANALYSIS

EFA

DISC. MULTIV. ANAL.

Bayesian Unsupervised Model-based Visualization

- According to Venables and Ripley (id.),

visualization techniques are often more effective

than clustering techniques discovering

interesting groupings in the data, and they avoid

the danger of over-interpretation of the results

as researcher is not allowed to input the number

of expected latent dimensions. - In cluster analysis the centroids that represent

the clusters are still high-dimensional, and some

additional illustration techniques are needed for

visualization (Kaski, 1997), for example MDS

(Kim, Kwon Cook, 2000).

Bayesian Unsupervised Model-based Visualization

- Several graphical means have been proposed for

visualizing high-dimensional data items directly,

by letting each dimension govern some aspect of

the visualization and then integrating the

results into one figure. - These techniques can be used to visualize any

kinds of high-dimensional data vectors, either

the data items themselves or vectors formed of

some descriptors of the data set like the

five-number summaries (Tukey, 1977).

Bayesian Unsupervised Model-based Visualization

- Simplest technique to visualize a data set is to

plot a profile of each item, that is, a

two-dimensional graph in which the dimensions are

enumerated on the x-axis and the corresponding

values on the y-axis. - Other alternatives are scatter plots and pie

diagrams.

Bayesian Unsupervised Model-based Visualization

- The major drawback that applies to all these

techniques is that they do not reduce the amount

of data. - If the data set is large, the display consisting

of all the data items portrayed separately will

be incomprehensible. (Kaski, 1997.) - Techniques reducing the dimensionality of the

data items are called projection techniques.

Bayesian Unsupervised Model-based Visualization

LDA

BSMV

SUPERVISED

UNSUPERVISED

VISUALIZATION TECH.

CLUSTER ANALYSIS

EFA

DISC. MULTIV. ANAL.

REDUCING

NON-REDUC.

PROJECTION TECH.

Bayesian Unsupervised Model-based Visualization

- The goal of the projection is to represent the

input data items in a lower-dimensional space in

such a way that certain properties of the

structure of the data set are preserved as

faithfully as possible. - The projection can be used to visualize the data

set if a sufficiently small output dimensionality

is chosen. (id.) - Projection techniques are divided into two major

groups, linear and non-linear projection

techniques.

Bayesian Unsupervised Model-based Visualization

LDA

BSMV

SUPERVISED

UNSUPERVISED

VISUALIZATION TECH.

CLUSTER ANALYSIS

EFA

DISC. MULTIV. ANAL.

REDUCING

NON-REDUC.

PROJECTION TECH.

NON-LINEAR

LINEAR

Bayesian Unsupervised Model-based Visualization

- Linear projection techniques consist of principal

component analysis (PCA) and projection pursuit. - In exploratory projection pursuit (Friedman,

1987) the data is projected linearly, but this

time a projection, which reveals as much of the

non-normally distributed structure of the data

set as possible is sought. - This is done by assigning a numerical

interestingness index to each possible

projection, and by maximizing the index. - The definition of interestingness is based on how

much the projected data deviates from normally

distributed data in the main body of its

distribution.

Bayesian Unsupervised Model-based Visualization

LDA

BSMV

SUPERVISED

UNSUPERVISED

VISUALIZATION TECH.

CLUSTER ANALYSIS

EFA

DISC. MULTIV. ANAL.

REDUCING

NON-REDUC.

PROJECTION TECH.

NON-LINEAR

LINEAR

PCA

PROJ.PUR.

Bayesian Unsupervised Model-based Visualization

- Non-linear unsupervised projection techniques

consist of multidimensional scaling, principal

curves and various other techniques including

SOM, neural networks and Bayesian unsupervised

networks (Kontkanen, Lahtinen, Myllymäki Tirri,

2000).

Bayesian Unsupervised Model-based Visualization

LDA

BSMV

SUPERVISED

UNSUPERVISED

VISUALIZATION TECH.

CLUSTER ANALYSIS

EFA

DISC. MULTIV. ANAL.

REDUCING

NON-REDUC.

PROJECTION TECH.

NON-LINEAR

LINEAR

NEUR.N.

MDS

PCA

SOM

PRIN.C.

ICA

BUMV

PROJ.PUR.

Bayesian Unsupervised Model-based Visualization

- Aforementioned PCA technique, despite its

popularity, cannot take into account non-linear

structures, structures consisting of arbitrarily

shaped clusters or curved manifolds since it

describes the data in terms of a linear subspace.

- Projection pursuit tries to express some

non-linearities, but if the data set is

high-dimensional and highly non-linear it may be

difficult to visualize it with linear projections

onto a low-dimensional display even if the

projection angle is chosen carefully (Friedman,

1987).

Bayesian Unsupervised Model-based Visualization

- Several approaches have been proposed for

reproducing non-linear higher-dimensional

structures on a lower-dimensional display. - The most common techniques allocate a

representation for each data point in the

lower-dimensional space and try to optimize these

representations so that the distances between

them would be as similar as possible to the

original distances of the corresponding data

items. - The techniques differ in how the different

distances are weighted and how the

representations are optimized. (Kaski, 1997.)

Bayesian Unsupervised Model-based Visualization

- Multidimensional scaling (MDS) is not one

specific tool, instead it refers to a group of

techniques that is widely used especially in

behavioral, econometric, and social sciences to

analyze subjective evaluations of pairwise

similarities of entities. - The starting point of MDS is a matrix consisting

of the pairwise dissimilarities of the entities. - The basic idea of the MDS technique is to

approximate the original set of distances with

distances corresponding to a configuration of

points in a Euclidean space.

Bayesian Unsupervised Model-based Visualization

- MDS can be considered to be an alternative to

factor analysis. - In general, the goal of the analysis is to detect

meaningful underlying dimensions that allow the

researcher to explain observed similarities or

dissimilarities (distances) between the

investigated objects. - In factor analysis, the similarities between

objects (e.g., variables) are expressed in the

correlation matrix.

Bayesian Unsupervised Model-based Visualization

- With MDS we may analyze any kind of similarity or

dissimilarity matrix, in addition to correlation

matrices, specifying that we want to reproduce

the distances based on n dimensions. - After formation of matrix MDS attempts to arrange

objects (e.g., factors of growth-oriented

atmosphere) in a space with a particular number

of dimensions so as to reproduce the observed

distances. - As a result, the distances are explained in terms

of underlying dimensions.

Bayesian Unsupervised Model-based Visualization

- MDS based on Euclidean distance do not generally

reflect properly to the properties of complex

problem domains. - In real-world situations the similarity of two

vectors is not a universal property in different

points of view they in the end may appear quite

dissimilar (Kontkanen, Lahtinen, Myllymäki,

Silander Tirri, 2000). - Another problem with the MDS techniques is that

they are computationally very intensive for large

data sets.

Bayesian Unsupervised Model-based Visualization

- Bayesian unsupervised model-based visualization

(BUMV) is based on Bayesian Networks (BN). - BN is a representation of a probability

distribution over a set of random variables,

consisting of a directed acyclic graph (DAG),

where the nodes correspond to domain variables,

and the arcs define a set of independence

assumptions which allow the joint probability

distribution for a data vector to be factorized

as a product of simple conditional probabilities.

Two vectors are considered similar if they lead

to similar predictions, when given as input to

the same Bayesian network model. (Kontkanen,

Lahtinen, Myllymäki, Silander Tirri, 2000.)

Bayesian Unsupervised Model-based Visualization

- Naturally, there are numerous viable options to

BUMV, such as Self-Organizing Map (SOM) and

Independent Component Analysis (ICA). - SOM is a neural network algorithm that has been

used for a wide variety of applications, mostly

for engineering problems but also for data

analysis (Kohonen, 1995). - SOM is based on neighborhood preserving

topological map tuned according to geometric

properties of sample vectors. - ICA minimizes the statistical dependence of the

components trying to find a transformation in

which the components are as statistically

independent as possible (Hyvärinen Oja, 2000). - The usage of ICA is comparable to PCA where the

aim is to present the data in a manner that

facilitates further analysis.

Bayesian Unsupervised Model-based Visualization

- First major difference between Bayesian and

neural network approaches for educational science

researcher is that the former operates with a

familiar symmetrical probability range from 0 to

1 while the upper limit of asymmetrical

probability scale in the latter approach is

unknown. - The second fundamental difference between the two

types of networks is that a perceptron in the

hidden layers of neural networks does not in

itself have an interpretation in the domain of

the system, whereas all the nodes of a Bayesian

network represent concepts that are well defined

with respect to the domain (Jensen, 1995).

Bayesian Unsupervised Model-based Visualization

- The meaning of a node and its probability table

can be subject to discussion, regardless of their

function in the network, but it does not make any

sense to discuss the meaning of the nodes and the

weights in a neural network Perceptrons in the

hidden layers only have a meaning in the context

of the functionality of the network. - Construction of a Bayesian network requires

detailed knowledge of the domain in question. - If such knowledge can only be obtained through a

series of examples (i.e., a data base of cases),

neural networks seem to be an easier approach.

This might be true in cases such as the reading

of handwritten letters, face recognition, and

other areas where the activity is a 'craftsman

like' skill based solely on experience.

(Jensen, 1995.)

Bayesian Unsupervised Model-based Visualization

- It is often criticized that in order to construct

a Bayesian network you have to know too many

probabilities. - However, there is not a considerable difference

between this number and the number of weights and

thresholds that have to be known in order to

build a neural network, and these can only be

learnt by training. - A weakness of neural networks tis hat you are

unable to utilize the knowledge you might have in

advance. - Probabilities, on the other hand, can be assessed

using a combination of theoretical insight,

empiric studies independent of the constructed

system, training, and various more or less

subjective estimates.

(Jensen, 1995.)

Bayesian Unsupervised Model-based Visualization

- In the construction of a neural network, it is

decided in advance about which relations

information is gathered, and which relations the

system is expected to compute (the route of

inference is fixed). - Bayesian networks are much more flexible in that

respect.

(Jensen, 1995.)

For an example of practical use of BUMV, see

Nokelainen and Ruohotie (2009).

Results showed that managers and teachers had

higher growth motivation and level of commitment

to work than other personnel, including job

titles such as cleaner, caretaker, accountant and

computer support. Employees across all job

titles in the organization, who have temporary or

part-time contracts, had higher self-reported

growth motivation and commitment to work and

organization than their established colleagues.

(No Transcript)

Links

- B-Course http//b-course.cs.helsinki.fi
- BayMiner http//www.bayminer.com

References

- Anderson, J. (1995). Cognitive Psychology and its

Implications. New York Freeman. - Bayes, T. (1763). An essay towards solving a

problem in the doctrine of chances. Philosophical

Transactions of the Royal Society, 53, 370-418. - Bernardo, J., Smith, A. (2000). Bayesian

theory. New York Wiley. - Congdon, P. (2001). Bayesian Statistical

Modelling. Chichester John Wiley Sons. - Friedman, J. (1987). Exploratory Projection

Pursuit. Journal of American Statistical

Association, 82, 249-266. - Gigerenzer, G. (2000). Adaptive thinking. New

York Oxford University Press. - Gigerenzer, G., Krauss, S., Vitouch, O. (2004).

The null ritual What you always wanted to know

about significance testing but were afraid to

ask. In D. Kaplan (Ed.), The SAGE handbook of

quantitative methodology for the social sciences

(pp. 391-408). Thousand Oaks Sage.

References

- Gill, J. (2002). Bayesian methods. A Social and

Behavioral Sciences Approach. Boca Raton Chapman

Hall/CRC. - Heckerman, D., Geiger, D., Chickering, D.

(1995). Learning Bayesian networks The

combination of knowledge and statistical data.

Machine Learning, 20(3), 197-243. - Hilario, M., Kalousisa, A., Pradosa, J., Binzb,

P.-A. (2004). Data mining for mass-spectra based

diagnosis and biomarker discovery. Drug Discovery

Today BIOSILICO, 2(5), 214-222. - Huberty, C. (1994). Applied Discriminant

Analysis. New York John Wiley Sons. - Hyvärinen, A., Oja, E. (2000). Independent

Component Analysis Algorithms and Applications.

Neural Networks, 13(4-5), 411-430. - Jensen, F. V. (1995). Paradigms of Expert

Systems. HUGIN Lite 7.4 User Manual.

References

- Kaski, S. (1997). Data exploration using

self-organizing maps. Doctoral dissertation. Acta

Polytechnica Scandinavica, Mathematics, Computing

and Management in Engineering Series No. 82.

Espoo Finnish Academy of Technology. - Kim, S., Kwon, S., Cook, D. (2000). Interactive

Visualization of Hierarchical Clusters Using MDS

and MST. Metrika, 51(1), 3951. - Kohonen, T. (1995). Self-Organizing Maps. Berlin

Springer. - Kontkanen, P., Lahtinen, J., Myllymäki, P.,

Silander, T., Tirri, H. (2000). Supervised

Model-based Visualization of High-dimensional

Data. Intelligent Data Analysis, 4, 213-227. - Kontkanen, P., Lahtinen, J., Myllymäki, P.,

Tirri, H. (2000). Unsupervised Bayesian

Visualization of High-Dimensional Data. In R.

Ramakrishnan, S. Stolfo, R. Bayardo, I. Parsa

(Eds.), Proceedings of the Sixth International

Conference on Knowledge Discovery and Data Mining

(pp. 325-329). New York, NY The Association for

Computing Machinery.

References

- Lavine, M. L. (1999). What is Bayesian Statistics

and Why Everything Else is Wrong. The Journal of

Undergraduate Mathematics and Its Applications,

20, 165-174. - Lindley, D. V. (1971). Making Decisions. London

Wiley. Lindley, D. V. (2001). Harold Jeffreys. In

C. C. Heyde E. Seneta (Eds.), Statisticians of

the Centuries, (pp. 402-405). New York Springer. - Murphy, K. R., Myors, B. (1998). Statistical

Power Analysis. A Simple and General Model for

Traditional and Modern Hypothesis Tests. Mahwah,

NJ Lawrence Erlbaum Associates. - Myllymäki, P., Silander, T., Tirri, H., Uronen,

P. (2002). B-Course A Web-Based Tool for

Bayesian and Causal Data Analysis. International

Journal on Artificial Intelligence Tools, 11(3),

369-387. - Myllymäki, P., Tirri, H. (1998).

Bayes-verkkojen mahdollisuudet Possibilities of

Bayesian Networks. Teknologiakatsaus 58/98.

Helsinki TEKES.

References

- Neapolitan, R. E., Morris, S. (2004).

Probabilistic Modeling Using Bayesian Networks.

In D. Kaplan (Ed.), The SAGE handbook of

quantitative methodology for the social sciences

(pp. 371-390). Thousand Oaks, CA Sage. - Nokelainen, P. (2008). Modeling of Professional

Growth and Learning Bayesian Approach. Tampere

Tampere University Press. - Nokelainen, P., Ruohotie, P. (2009).

Investigating Growth Prerequisites in a Finnish

Polytechnic for Higher Education. Journal of

Workplace Learning, 21(1), 36-57. - Nokelainen, P., Silander, T., Ruohotie, P.,

Tirri, H. (2007). Investigating the Number of

Non-linear and Multi-modal Relationships Between

Observed Variables Measuring A Growth-oriented

Atmosphere. Quality Quantity, 41(6), 869-890. - Nokelainen, P., Tirri, K. (2007). Empirical

Investigation of Finnish School Principals'

Emotional Leadership Competencies. In S. Saari

T. Varis (Eds.), Professional Growth (pp.

424-438). Hämeenlinna RCVE.

References

- Nokelainen, P., Ruohotie, P., Tirri, H. (1999).

Professional Growth Determinants-Comparing

Bayesian and Linear Approaches to Classification.

In P. Ruohotie, H. Tirri, P. Nokelainen, T.

Silander (Eds.), Modern Modeling of Professional

Growth, vol. 1 (pp. 85-120). Hämeenlinna RCVE. - Nokelainen, P., Tirri, K. (2010). Role of

Motivation in the Moral and Religious Judgment of

Mathematically Gifted Adolescents. High Ability

Studies, 21(2), 101-116. - Nokelainen, P., Tirri, K., Campbell, J. R.,

Walberg, H. (2004). Cross-cultural Factors that

Account for Adult Productivity. In J. R.

Campbell, K. Tirri, P. Ruohotie, H. Walberg

(Eds.), Cross-cultural Research Basic Issues,

Dilemmas, and Strategies (pp. 119-139).

Hämeenlinna RCVE. - Nokelainen, P., Tirri, K., Merenti-Välimäki,

H.-L. (2007). Investigating the Influence of

Attribution Styles on the Development of

Mathematical Talent. Gifted Child Quarterly,

51(1), 64-81. - Pylväs, L., Nokelainen, P., Roisko, H. (in

press). Modeling of Vocational Excellence in Air

Traffic Control. Submitted for review.

References

- Tirri, H. (1997). Plausible Prediction by

Bayesian Interface. Department of Computer

Science. Series of Publications A. Report

A-1997-1. Helsinki University of Helsinki. - Tukey, J. (1977). Exploratory Data Analysis.

Reading, MA Addison-Wesley. - Venables, W. N., Ripley, B. D. (2002). Mo