Title: petri.nokelainen@uta.fi School of Education University of Tampere, Finland
1petri.nokelainen_at_uta.fiSchool of Education
University of Tampere, Finland
Introduction to Discrete Bayesian Methods
Petri Nokelainen
2Outline
- Overview
- Introduction to Bayesian Modeling
- Bayesian Classification Modeling
- Bayesian Dependency Modeling
- Bayesian Unsupervised Model-based Visualization
3Overview
(Nokelainen, 2008.)
4Overview
BDM Bayesian Dependency Modeling BCM
Bayesian Classification Modeling BUMV Bayesian
Unsupervised Model-based Visualization
5Bayesian Classification Modeling
http//b-course.cs.helsinki.fi
The classification accuracy of the best model
found is 83.48 (58.57).
COMMON FACTORS PUB_T CC_PR CC_HE PA C_SHO C_FAIL
CC_AB CC_ES
6Bayesian Dependency Modeling
http//b-course.cs.helsinki.fi
7Bayesian Unsupervised Model-based Visualization
http//www.bayminer.com
8Outline
- Overview
- Introduction to Bayesian Modeling
- Bayesian Classification modeling
- Bayesian Dependency modeling
- Bayesian Unsupervised Model-based Visualization
9Introduction to Bayesian Modeling
- In the social science researchers point of view,
the requirements of traditional frequentistic
statistical analysis are very challenging. - For example, the assumption of normality of both
the phenomena under investigation and the data is
prerequisite for traditional parametric
frequentistic calculations.
10Introduction to Bayesian Modeling
- In situations where
- a latent construct cannot be appropriately
represented as a continuous variable, - ordinal or discrete indicators do not reflect
underlying continuous variables, - the latent variables cannot be assumed to be
normally distributed, - traditional Gaussian modeling is clearly not
appropriate. - In addition, normal distribution analysis sets
minimum requirements for the number of
observations, and the measurement level of
variables should be continuous.
11Introduction to Bayesian Modeling
- Frequentistic parametric statistical techniques
are designed for normally distributed (both
theoretically and empirically) indicators that
have linear dependencies. - Univariate normality
- Multivariate normality
- Bivariate linearity
12(Nokelainen, 2008, p. 119)
13- The upper part of the figure contains two
sections, namely parametric and
non-parametric divided into eight sub-sections
(DNIMMOCS OLD). - Parametric approach is viable only if
- 1) Both the phenomenon modeled and the sample
follow normal distribution. - 2) Sample size is large enough (at least 30
observations). - 3) Continuous indicators are used.
- 4) Dependencies between the observed variables
are linear. - Otherwise non-parametric techniques should be
applied.
D Design (ce controlled experiment, co
correlational study) N Sample size IO
Independent observations ML Measurement level
(c continuous, d discrete, n nominal) MD
Multivariate distribution (n normal, similar) O
Outliers C Correlations S Statistical
dependencies (l linear, nl non-linear)
14Introduction to Bayesian Modeling
N 11 500
15Introduction to Bayesian Modeling
- Bayesian method
- (1) is parameter-free and the user input is not
required, instead, prior distributions of the
model offer a theoretically justifiable method
for affecting the model construction - (2) works with probabilities and can hence be
expected to produce robust results with discrete
data containing nominal and ordinal attributes - (3) has no limit for minimum sample size
- (4) is able to analyze both linear and
non-linear dependencies - (5) assumes no multivariate normal model
- (6) allows prediction.
16Introduction to Bayesian Modeling
- Probability is a mathematical construct that
behaves in accordance with certain rules and can
be used to represent uncertainty. - The classical statistical inference is based on a
frequency interpretation of probability, and the
Bayesian inference is based on subjective or
degree of belief interpretation. - Bayesian inference uses conditional probabilities
to represent uncertainty. - P(H E,I) - the probability of unknown things or
hypothesis (H), given the evidence (E) and
background information (I).
17Introduction to Bayesian Modeling
- The essence of Bayesian inference is in the rule,
known as Bayes' theorem, that tells us how to
update our initial probabilities P(H) if we see
evidence E, in order to find out P(HE).
- A priori probability
- Conditional probability
- Posteriori probability
P(EH) P(H) P(HE) P(EH)P(H
) P(EH) P(H)
18Introduction to Bayesian Modeling
- The theorem was invented by an english reverend
Thomas Bayes (1701-1761) and published
posthumously (1763).
19Introduction to Bayesian Modeling
- Bayesian inference comprises the following three
principal steps - (1) Obtain the initial probabilities P(H) for the
unknown things. (Prior distribution.) - (2) Calculate the probabilities of the evidence E
(data) given different values for the unknown
things, i.e., P(E H). (Likelihood or
conditional distribution.) - (3) Calculate the probability distribution of
interest P(H E) using Bayes' theorem.
(Posterior distribution.) - Bayes' theorem can be used sequentially.
20Introduction to Bayesian Modeling
- If we first receive some evidence E (data), and
calculate the posterior P(H E), and at some
later point in time receive more data E', the
calculated posterior can be used in the role of
prior to calculate a new posterior P(H E,E')
and so on. - The posterior P(H E) expresses all the
necessary information to perform predictions. - The more evidence we get, the more certain we
will become of the unknowns, until all but one
value combination for the unknowns have
probabilities so close to zero that they can be
neglected.
21C_Example 1 Applying Bayes Theorem
- Company A is employing workers on short term
jobs that are well paid. - The job sets certain prerequisites to applicants
linguistic abilities. - Earlier all the applicants were interviewed, but
nowadays it has become an impossible task as both
the number of open vacancies and applicants has
increased enormously. - Personnel department of the company was ordered
to develop a questionnaire to preselect the most
suitable applicants for the interview.
22C_Example 1 Applying Bayes Theorem
- Psychometrician who developed the instrument
estimates that it would work out right on 90 out
of 100 applicants, if they are honest. - We know on the basis of earlier interviews that
the terms (linguistic abilities) are valid for
one per 100 person living in the target
population. - The question is If an applicant gets enough
points to participate in the interview, is he or
she hired for the job (after an interview)?
23C_Example 1 Applying Bayes Theorem
- A priori probability P(H) is described by the
number of those people in the target population
that really are able to meet the requirements of
the task (1 out of 100 .01). - Counter assumption of the a priori is P(H) that
equals to 1-P(H), thus it is .99. - Psychometricians beliefs about how the instrument
works is called conditional probability P(EH)
.9. - Instruments failure to indicate non-valid
applicants, i.e., those that are not able to
succeed in the following interview, is stated as
P(EH) that equals to .1. - These values need not to sum to one!
24C_Example 1 Applying Bayes Theorem
- A priori probability
- Conditional probability
- Posterior probability
- P(EH) P(H)
- P(HE)
- P(EH) P(H) P(EH) P(H)
(.9) (.01) P(HE) (.9)
(.01) (.1) (.99)
.08
25C_Example 1 Applying Bayes Theorem
26C_Example 1 Applying Bayes Theorem
- What if the measurement error of the
psychometricians instrument would have been 20
per cent? - P(EH)0.8 P(EH)0.2
27C_Example 1 Applying Bayes Theorem
28C_Example 1 Applying Bayes Theorem
- What if the measurement error of the
psychometricians instrument would have been only
one per cent? - P(EH)0.99 P(EH)0.01
29C_Example 1 Applying Bayes Theorem
30C_Example 1 Applying Bayes Theorem
- Quite often people tend to estimate the
probabilities to be too high or low, as they are
not able to update their beliefs even in simple
decision making tasks when situations change
dynamically (Anderson, 1995).
31C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- One of the most important rules educational
science scientific journals apply to judge the
scientific merits of any submitted manuscript is
that all the reported results should be based on
so called null hypothesis significance testing
procedure (NHSTP) and its featured product,
p-value. - Gigerenzer, Krauss and Vitouch (2004, p. 392)
describe the null ritual as follows - 1) Set up a statistical null hypothesis of no
mean difference or zero correlation. Dont
specify the predictions of your research or of
any alternative substantive hypotheses - 2) Use 5 per cent as a convention for rejecting
the null. If significant, accept your research
hypothesis - 3) Always perform this procedure.
32C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- A p-value is the probability of the observed data
(or of more extreme data points), given that the
null hypothesis H0 is true, P(DH0) (id.). - The first common misunderstanding is that the
p-value of, say t-test, would describe how
probable it is to have the same result if the
study is repeated many times (Thompson, 1994). - Gerd Gigerenzer and his colleagues (id., p. 393)
call this replication fallacy as P(DH0) is
confused with 1P(D).
33C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- The second misunderstanding, shared by both
applied statistics teachers and the students, is
that the p-value would prove or disprove H0.
However, a significance test can only provide
probabilities, not prove or disprove null
hypothesis. - Gigerenzer (id., p. 393) calls this fallacy an
illusion of certainty Despite wishful thinking,
p(DH0) is not the same as P(H0D), and a
significance test does not and cannot provide a
probability for a hypothesis. - A Bayesian statistics provide a way of
calculating a probability of a hypothesis
(discussed later in this section).
34C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- My statistics course grades (Autumn 2006, n 12)
ranged from one to five as follows 1) n 3 2)
n 2 3) n 4 4) n 2 5) n 1, showing that
the lowest grade frequency (1) from the course
is three (25.0). - Previous data from the same course (2000-2005)
shows that only five students out of 107 (4.7)
had the lowest grade. - Next, I will use the classical statistical
approach (the likelihood principle) and Bayesian
statistics to calculate if the number of the
lowest course grades is exceptionally high on my
latest course when compared to my earlier stat
courses.
35C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- There are numerous possible reasons behind such
development, for example, I have become more
critical on my assessment or the students are
less motivated in learning quantitative
techniques. - However, I believe that the most important
difference between the last and preceding courses
is that the assessment was based on a computer
exercise with statistical computations. - The preceding courses were assessed only with
essay answers.
36C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- I assume that the 12 students earned their grade
independently (independent observations) of each
other as the computer exercise was conducted
under my or my assistants supervision. - I further assume that the chance of getting the
lowest grade (?), is the same for each student. - Therefore X, the number of lowest grades (1) in
the scale from 1 to 5 among the 12 students in
the latest stat course, has a binomial (12, ?)
distribution X Bin(12, ?). - For any integer r between 0 and 12,
37C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- The expected number of lowest grades is 12(5/107)
0.561. - Theta is obtained by dividing the expected number
of lowest grades with the number of students
0.561 / 12 ? 0.05. - The null hypothesis is formulated as follows H0
? 0.05, stating that the rate of the lowest
grades from the current stat course is not a big
thing and compares to the previous courses rates.
38C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- Three alternative hypotheses are formulated to
address the concern of the increased number of
lowest grades (6, 7 and 8, respectively) H1 ?
0.06 H2 ? 0.07 H3 ? 0.08. - H1 12/(107/6) .67 -gt .67/12.056 ? .06
- H2 12/(107/7) .79 -gt .79/12.065 ? .07
- H3 12/(107/8) .90 -gt .90/12.075 ? .08
39C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- To compare the hypotheses, we calculate binomial
distributions for each value of ?. - For example, the null hypothesis (H0) calculation
yields
40C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- The results for the alternative hypotheses are as
follows - PH1(3.06, 12) ? .027
- PH2(3.07, 12) ? .039
- PH3(3.08, 12) ? .053.
- The ratio of the hypotheses is roughly 1223
and could be verbally interpreted with statements
like the second and third hypothesis explain the
data about equally well, or the fourth
hypothesis explains the data about three times as
well as the first hypothesis.
41C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- Lavine (1999) reminds that P(r?, n), as a
function of r (3) and ? .05 .06 .07 .08,
describes only how well each hypotheses explains
the data no value of r other than 3 is relevant.
- For example, P(4.05, 12) is irrelevant as it
does not describe how well any hypothesis
explains the data. - This likelihood principle, that is, to base
statistical inference only on the observed data
and not on a data that might have been observed,
is an essential feature of Bayesian approach.
42C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- The Fisherian, so called classical approach to
test the null hypothesis (H0 ? .05) against
the alternative hypothesis (H1 ? gt .05) is to
calculate the p-value that defines the
probability under H0 of observing an outcome at
least as extreme as the outcome actually
observed
43C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- As an example, the first part of the formula is
solved as follows
44C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- After calculations, the p-value of .02 would
suggest H0 rejection, if the rejection level of
significance is set at 5 per cent. - Calculation of p-value violates the likelihood
principle by using P(r?, n) for values of r
other than the observed value of r 3 (Lavine,
1999) - The summands of P(4.05, 12), P(5.05, 12), ,
P(12.05, 12) do not describe how well any
hypothesis explains observed data.
45C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- A Bayesian approach will continue from the same
point as the classical approach, namely
probabilities given by the binomial
distributions, but also make use of other
relevant sources of a priori information. - In this domain, it is plausible to think that the
computer test (SPSS exam) would make the number
of total failures more probable than in the
previous times when the evaluation was based
solely on the essays. - On the other hand, the computer test has only 40
per cent weight in the equation that defines the
final stat course grade .3(Essay_1)
.3(Essay_2) .4(Computer test)/3 Final grade.
46C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- Another aspect is to consider the nature of the
aforementioned tasks, as the essays are distance
work assignments while the computer test is to be
performed under observation. - Perhaps the course grades of my earlier stat
courses have a narrower dispersion due to
violence of the independent observation
assumption? - For example, some students may have copy-pasted
text from other sources or collaborated without a
permission. - As we see, there are many sources of a priori
information that I judge to be inconclusive and,
thus, define that null hypothesis is as likely to
be true or false.
47C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- This a priori judgment is expressed
mathematically as P(H0) ? 1/2 ? P(H1) P(H2)
P(H3). - I further assume that the alternative hypotheses
H1, H2 or H3 share the same likelihood P(H1) ?
P(H2) ? P(H3) ? 1/6. - These prior distributions summarize the knowledge
about ? prior to incorporating the information
from my course grades.
48C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- An application of Bayes' theorem yields
49C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- Similar calculations for the alternative
hypotheses yields P(H1r3) ? .16 P(H2r3) ?
.29 P(H3r3) ? .31. - These posterior distributions summarize the
knowledge about ? after incorporating the grade
information. - The four hypotheses seem to be about equally
likely (.30 vs. .16, .29, .31). - The odds are about 2 to 1 (.30 vs. .70) that the
latest stat course had higher rate of lowest
grades than 0.05.
50C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- The difference between the classical and Bayesian
statistics would be only philosophical
(probability vs. inverse probability) if they
would always lead to similar conclusions. - In this case the p-value would suggest rejection
of H0 (p .02). - Bayesian analysis would also suggest evidence
against ? .05 (.30 vs. .70, ratio of .43).
51C_Example 2 Comparison of Traditional
Frequentistic and Bayesian Approach
- What if the number of the lowest grades in the
last course would be two? - The classical approach would not anymore suggest
H0 rejection (p .12). - Bayesian result would still say that there is
more evidence against than for the H0 (.39 vs.
.61, ratio of .64).
52Outline
- Overview
- Introduction to Bayesian Modeling
- Bayesian Classification Modeling
- Bayesian Dependency Modeling
- Bayesian Unsupervised Model-based Visualization
53BCM Bayesian Classification Modeling BDM
Bayesian Dependency Modeling BUMV Bayesian
Unsupervised Model-based Visualization
B-Course
54Bayesian Classification Modeling
- Bayesian Classification Modeling (BCM) is
implemented in B-Course software that is based on
discrete Bayesian methods. - This also applies to Bayesial Dependency Modeling
that is discussed later. - Quantitative indicators with high measurement
lever (continuous, interval) lose more
information in the discretization process than
qualitative indicators (ordinal, nominal) as
they all are treated in the analysis as nominal
(discrete) indicators.
55Bayesian Classification Modeling
- For example, variable gender may include
numerical values 1 (Female) or 2 (Male) or
text values Female and Male in discrete
Bayesian analysis. - This will inevitably lead to a loss of power
(Cohen, 1988 Murphy Myors, 1998), however,
ensuring that sample size is large enough is a
simple way to address this problem.
56Sample size estimation
- N
- Population size.
- n
- Estimated sample size.
- Sampling error (e)
- Difference between the true (unknown) value and
observed values, if the survey were repeated
(sample collected) numerous times. - Confidence interval
- Spread of the observed values that would be seen
if the survey were repeated numerous times. - Confidence level
- How often the observed values would be within
sampling error of the true value if the survey
were repeated numerous times.
(Murphy Myors, 1998.)
57Bayesian Classification Modeling
- Aim of the BCM is to select the variables that
are best predictors for different class
memberships (e.g., gender, job title, level of
giftedness). - In the classification process, the automatic
search is looking for the best set of variables
to predict the class variable for each data item.
58Bayesian Classification Modeling
- The search procedure resembles the traditional
linear discriminant analysis (LDA, see Huberty,
1994), but the implementation is totally
different. - For example, a variable selection problem that is
addressed with forward, backward or stepwise
selection procedure in LDA is replaced with a
genetic algorithm approach (e.g., Hilario,
Kalousisa, Pradosa Binzb, 2004 Hsu, 2004) in
the Bayesian classification modeling.
59Bayesian Classification Modeling
- The genetic algorithm approach means that
variable selection is not limited to one (or two
or three) specific approach instead many
approaches and their combinations are exploited. - One possible approach is to begin with the
presumption that the models (i.e., possible
predictor variable combinations) that resemble
each other a lot (i.e., have almost same
variables and discretizations) are likely to be
almost equally good. - This leads to a search strategy in which models
that resemble the current best model are selected
for comparison, instead of picking models
randomly.
60Bayesian Classification Modeling
- Another approach is to abandon the habit of
always rejecting the weakest model and instead
collect a set of relatively good models. - The next step is to combine the best parts of
these models so that the resulting combined model
is better than any of the original models. - B-Course is capable of mobilizing many more
viable approaches, for example, rejecting the
better model (algorithms like hill climbing,
simulated annealing) or trying to avoid picking
similar model twice (tabu search).
61Bayesian Classification Modeling
Nokelainen, P., Ruohotie, P., Tirri, H. (1999).
62For an example of practical use of BCM, see
Nokelainen, Tirri, Campbell and Walberg (2007).
63The results of Bayesian classification modeling
showed that the estimated classification accuracy
of the best model found was 60. The left-hand
side of Figure 3 shows that only three variables,
Olympians Conducive Home Atmosphere (SA),
Olympians School Shortcomings (C_SHO), and
Computer literacy composite (COMP), were
successful predictors for the A or C group
membership. All the other variables that were
not accepted in the model are to be considered as
connective factors between the two groups. The
middle section of Figure 3 shows that the two
strongest predictors were Olympians Conducive
Home Atmosphere (20.9) and Olympians School
Shortcomings (22.6). The confusion matrix shows
that most of the A (25 correct out of 39) and the
C (29 out of 47) group members were correctly
classified. The matrix also shows that nine
participants of the group A were incorrectly
classified into group C and vice versa.
64(No Transcript)
65Figure 4 presents predictive modeling of the A
and C groups (A_C, A or C group membership)
by Olympians Conducive Home Atmosphere (SA),
Olympians School Shortcomings (C_SHO), and
Computer Literacy Composite (COMP). The
left-hand side of the figure presents the initial
model with no values fixed. The model in the
middle presents a scenario where all the A group
members are selected. When we compare this model
to the one on the right-hand side (i.e.,
presenting a situation where all the C group
members are selected), we notice, for example,
that conditional distribution of the Olympians
Conducive Home Atmosphere (SA) has changed. It
shows that highly productive Olympians have
reported more Conducive home atmosphere (54.0)
than the members of the low productivity group C
(23.0).
66(No Transcript)
67- Modeling of Vocational Excellence in Air Traffic
Control - This paper aims to describe the characteristics
and predictors that explain air traffic
controllers (ATCO) vocational expertise and
excellence. - The study analyzes the role of natural abilities,
self-regulative abilities and environmental
conditions in ATCOs vocational development.
(Pylväs, Nokelainen Roisko, in press.)
68- Modeling of Vocational Excellence in Air Traffic
Control - The target population of the study consisted of
ATCOs in Finland (N300) of which 28,
representing four different airports, were
interviewed. - The research data also included interviewees
aptitude test scoring, study records and employee
assessments.
69- Modeling of Vocational Excellence in Air Traffic
Control - The research questions were examined by using
theoretical concept analysis. - The qualitative data analysis was conducted with
content analysis and Bayesian classification
modeling.
70- Modeling of Vocational Excellence in Air Traffic
Control
71- Modeling of Vocational Excellence in Air Traffic
Control - (RQ1a)
- What are the differences in characteristics
between the air traffic controllers representing
vocational expertise and vocational excellence?
72- Modeling of Vocational Excellence in Air Traffic
Control - "the natural ambition of wanting to be good. Air
traffic controllers have perhaps generally a
strong professional pride." - Interesting and rewarding work, that is the
basis of wanting to stay in this work until
retiring.
73- Modeling of Vocational Excellence in Air Traffic
Control - "I read all the regulations and instructions
carefully and precisely, and try to think the
majority wave aside of them. It reflects on
work." "but still I consider myself more
precise than the majority a bad air traffic
controller have delays, good air traffic
controllers do not have delays which is something
that also pilots appreciate because of the strict
time limits.
74- Modeling of Vocational Excellence in Air Traffic
Control
75- Modeling of Vocational Excellence in Air Traffic
Control
76- Modeling of Vocational Excellence in Air Traffic
Control
77- Modeling of Vocational Excellence in Air Traffic
Control
78- Modeling of Vocational Excellence in Air Traffic
Control
79Classification accuracy 89.
80- Modeling of Vocational Excellence in Air Traffic
Control
81- Modeling of Vocational Excellence in Air Traffic
Control
82Outline
- Research Overview
- Introduction to Bayesian Modeling
- Investigating Non-linearities with Bayesian
Networks - Bayesian Classification Modeling
- Bayesian Dependency Modeling
- Bayesian Unsupervised Model-based Visualization
83BCM Bayesian Classification Modeling BDM
Bayesian Dependency Modeling BUMV Bayesian
Unsupervised Model-based Visualization
B-Course
84Bayesian Dependency Modeling
- Bayesian dependency modeling (BDM) is applied to
examine dependencies between variables by both
their visual representation and probability ratio
of each dependency - Graphical visualization of Bayesian network
contains two components - 1) Observed variables visualized as ellipses.
- 2) Dependences visualized as lines between nodes.
85C_Example 4 Calculation of Bayesian Score
- Bayesian score (BS), that is, the probability of
the model P(MD), allows the comparison of
different models.
Figure 9. An Example of Two Competing Bayesian
Network Structures
(Nokelainen, 2008, p. 121.)
86C_Example 4 Calculation of Bayesian Score
- Let us assume that we have the following data
- x1 x2
- 1 1
- 1 1
- 2 2
- 1 2
- 1 1
- Model 1 (M1) represents the two variables, x1 and
x2 respectively, without statistical dependency,
and the model 2 (M2) represents the two variables
with a dependency (i.e., with a connecting arc). - The binomial data might be a result of an
experiment, where the five participants have
drinked a nice cup of tea before (x1) and after
(x2) a test of geographic knowledge.
87C_Example 4 Calculation of Bayesian Score
- In order to calculate P(M1,2D), we need to solve
P(DM1,2) for the two models M1 and M2. - Probability of the data given the model is solved
by using the following marginal likelihood
equation (Congdon, 2001, p. 473 Myllymäki,
Silander, Tirri, Uronen, 2001 Myllymäki
Tirri, 1998, p. 63)
88C_Example 4 Calculation of Bayesian Score
- Nij describes the number of rows in the data
that have jth configuration for parents of ith
variable - Nijk describes how many rows in the
data have kth value for the ith variable also
have jth configuration for parents of ith
variable - N is the equivalent sample size set
to be the average number of values divided by two.
- In the Equation 4, following symbols are used
- n is the number of variables (i indexes variables
from 1 to n) - ri is the number of values in ith variable (k
indexes these values from 1 to ri - qi is the number of possible configurations of
parents of ith variable - The marginal likelihood equation produces a
Bayesian Dirichlet score that allows model
comparison (Heckerman et al., 1995 Tirri, 1997
Neapolitan Morris, 2004).
89C_Example 4 Calculation of Bayesian Score
- First, P(DM1) is calculated given the values of
variable x1
(2/2)/1
(2/2)/21
x1 x2 1 1 1 1 2 2 1 2 1 1
90C_Example 4 Calculation of Bayesian Score
- Second, the values for the x2 are calculated
x1 x2 1 1 1 1 2 2 1 2 1 1
91C_Example 4 Calculation of Bayesian Score
- The BS, probability for the first model P(M1D),
is 0.027 0.012 ? 0.000324.
92C_Example 4 Calculation of Bayesian Score
- Third, P(DM2) is calculated given the values of
variable x1
93C_Example 4 Calculation of Bayesian Score
- Fourth, the values for the first parent
configuration (x1 1) are calculated
94C_Example 4 Calculation of Bayesian Score
- Fifth, the values for the second parent
configuration (x1 2) are calculated
95C_Example 4 Calculation of Bayesian Score
- The BS, probability for the second model P(M2D),
is 0.027 0.027 0.500 ? 0.000365.
96C_Example 4 Calculation of Bayesian Score
- Bayes theorem enables the calculation of the
ratio of the two models, M1 and M2. - As both models share the same a priori
probability, P(M1) P(M2), both probabilities
are canceled out. - Also the probability of the data P(D) is canceled
out in the following equation as it appears in
both formulas in the same position
97C_Example 4 Calculation of Bayesian Score
- The result of model comparison shows that since
the ratio is less than 1, the M2 is more probable
than M1. - This result becomes explicit when we investigate
the sample data more closely. - Even a sample this small (n 5) shows that there
is a clear tendency between the values of x1 and
x2 (four out of five value pairs are identical).
x1 x2 1 1 1 1 2 2 1 2 1 1
98- How many models are there?
99For an example of practical use of BDM, see
Nokelainen and Tirri (2010).
100Our hypothesis regarding the first research
question was that intrinsic goal orientation
(INT) is positively related to moral judgment
(Batson Thompson, 2001 Kunda Schwartz,
1983). It was also hypothesized, based on
Blasis (1999) argumentation that emotions cannot
be predictors of moral action, that fear of
failure (affective motivational section) is not
related to moral judgment. Research evidence
showed support for both hypotheses firstly, only
intrinsic motivation was directly (positively)
related to moral judgment, and secondly,
affective motivational section was not present in
the predictive model.
(Nokelainen Tirri, 2010.)
101Conditioning the three levels of moral judgment
showed that there is a positive statistical
relationship between moral judgment and intrinsic
goal orientation. The probability of belonging to
the highest intrinsically motivated group three
(M 3.7 5.0) increases from 15 per cent to 90
per cent alongside with the moral judgment
abilities. There is also similar but less steep
increase in extrinsic goal orientation (from 5
to 12), but we believe that it is mostly tied to
increase in extrinsic goal orientation.
(Nokelainen Tirri, 2010.)
102For an example of practical use of BDM see
Nokelainen and Tirri (2007).
103(Nokelainen Tirri, 2007.)
104(Nokelainen Tirri, 2007.)
1052 vs. 90
21 vs. 78
EL_iv_17_49 In conflict situations, my superior
is able to draw out all parties and
understand the differing
perspectives. EL_ii_09_26 My superior sees
other people in positive rather than in negative
light. EL_ii_09_25 My superior has an
optimistic "glass half full" outlook.
(Nokelainen Tirri, 2007.)
10669
66
EL_iv_17_49 In conflict situations, my superior
is able to draw out all parties and understand
the differing perspectives. EL_ii_09_26 My
superior sees other people in positive rather
than in negative light. EL_ii_09_25 My
superior has an optimistic "glass half full"
outlook.
(Nokelainen Tirri, 2007.)
10795
85
EL_iv_17_49 In conflict situations, my superior
is able to draw out all parties and understand
the differing perspectives. EL_ii_09_26 My
superior sees other people in positive rather
than in negative light. EL_ii_09_25 My
superior has an optimistic "glass half full"
outlook.
(Nokelainen Tirri, 2007.)
108Outline
- Overview
- Introduction to Bayesian Modeling
- Bayesian Classification Modeling
- Bayesian Dependency Modeling
- Bayesian Unsupervised Model-based Visualization
109BCM Bayesian Classification Modeling BDM
Bayesian Dependency Modeling BUMV Bayesian
Unsupervised Model-based Visualization
BayMiner
110Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
UNSUPERVISED
VISUALIZATION TECH.
CLUSTER ANALYSIS
EFA
DISC. MULTIV. ANAL.
REDUCING
NON-REDUC.
PROJECTION TECH.
NON-LINEAR
LINEAR
NEUR.N.
MDS
PCA
SOM
PRIN.C.
ICA
BUMV
PROJ.PUR.
111Bayesian Unsupervised Model-based Visualization
- Supervised techniques, for example, linear
discriminant analysis (LDA) and supervised
Bayesian networks (BSMV, see Kontkanen, Lahtinen,
Myllymäki, Silander Tirri, 2000) assume a given
structure (Venables Ripley, 2002, p. 301). - Unsupervised techniques, for example, exploratory
factor analysis (EFA) discover variable structure
from the evidence of the data matrix. - Unsupervised techniques are further divided into
four sub categories 1) Visualization techniques
2) Cluster analysis 3) Factor analysis 4)
Discrete multivariate analysis.
112Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
UNSUPERVISED
VISUALIZATION TECH.
CLUSTER ANALYSIS
EFA
DISC. MULTIV. ANAL.
113Bayesian Unsupervised Model-based Visualization
- According to Venables and Ripley (id.),
visualization techniques are often more effective
than clustering techniques discovering
interesting groupings in the data, and they avoid
the danger of over-interpretation of the results
as researcher is not allowed to input the number
of expected latent dimensions. - In cluster analysis the centroids that represent
the clusters are still high-dimensional, and some
additional illustration techniques are needed for
visualization (Kaski, 1997), for example MDS
(Kim, Kwon Cook, 2000).
114Bayesian Unsupervised Model-based Visualization
- Several graphical means have been proposed for
visualizing high-dimensional data items directly,
by letting each dimension govern some aspect of
the visualization and then integrating the
results into one figure. - These techniques can be used to visualize any
kinds of high-dimensional data vectors, either
the data items themselves or vectors formed of
some descriptors of the data set like the
five-number summaries (Tukey, 1977).
115Bayesian Unsupervised Model-based Visualization
- Simplest technique to visualize a data set is to
plot a profile of each item, that is, a
two-dimensional graph in which the dimensions are
enumerated on the x-axis and the corresponding
values on the y-axis. - Other alternatives are scatter plots and pie
diagrams.
116Bayesian Unsupervised Model-based Visualization
- The major drawback that applies to all these
techniques is that they do not reduce the amount
of data. - If the data set is large, the display consisting
of all the data items portrayed separately will
be incomprehensible. (Kaski, 1997.) - Techniques reducing the dimensionality of the
data items are called projection techniques.
117Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
UNSUPERVISED
VISUALIZATION TECH.
CLUSTER ANALYSIS
EFA
DISC. MULTIV. ANAL.
REDUCING
NON-REDUC.
PROJECTION TECH.
118Bayesian Unsupervised Model-based Visualization
- The goal of the projection is to represent the
input data items in a lower-dimensional space in
such a way that certain properties of the
structure of the data set are preserved as
faithfully as possible. - The projection can be used to visualize the data
set if a sufficiently small output dimensionality
is chosen. (id.) - Projection techniques are divided into two major
groups, linear and non-linear projection
techniques.
119Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
UNSUPERVISED
VISUALIZATION TECH.
CLUSTER ANALYSIS
EFA
DISC. MULTIV. ANAL.
REDUCING
NON-REDUC.
PROJECTION TECH.
NON-LINEAR
LINEAR
120Bayesian Unsupervised Model-based Visualization
- Linear projection techniques consist of principal
component analysis (PCA) and projection pursuit. - In exploratory projection pursuit (Friedman,
1987) the data is projected linearly, but this
time a projection, which reveals as much of the
non-normally distributed structure of the data
set as possible is sought. - This is done by assigning a numerical
interestingness index to each possible
projection, and by maximizing the index. - The definition of interestingness is based on how
much the projected data deviates from normally
distributed data in the main body of its
distribution.
121Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
UNSUPERVISED
VISUALIZATION TECH.
CLUSTER ANALYSIS
EFA
DISC. MULTIV. ANAL.
REDUCING
NON-REDUC.
PROJECTION TECH.
NON-LINEAR
LINEAR
PCA
PROJ.PUR.
122Bayesian Unsupervised Model-based Visualization
- Non-linear unsupervised projection techniques
consist of multidimensional scaling, principal
curves and various other techniques including
SOM, neural networks and Bayesian unsupervised
networks (Kontkanen, Lahtinen, Myllymäki Tirri,
2000).
123Bayesian Unsupervised Model-based Visualization
LDA
BSMV
SUPERVISED
UNSUPERVISED
VISUALIZATION TECH.
CLUSTER ANALYSIS
EFA
DISC. MULTIV. ANAL.
REDUCING
NON-REDUC.
PROJECTION TECH.
NON-LINEAR
LINEAR
NEUR.N.
MDS
PCA
SOM
PRIN.C.
ICA
BUMV
PROJ.PUR.
124Bayesian Unsupervised Model-based Visualization
- Aforementioned PCA technique, despite its
popularity, cannot take into account non-linear
structures, structures consisting of arbitrarily
shaped clusters or curved manifolds since it
describes the data in terms of a linear subspace.
- Projection pursuit tries to express some
non-linearities, but if the data set is
high-dimensional and highly non-linear it may be
difficult to visualize it with linear projections
onto a low-dimensional display even if the
projection angle is chosen carefully (Friedman,
1987).
125Bayesian Unsupervised Model-based Visualization
- Several approaches have been proposed for
reproducing non-linear higher-dimensional
structures on a lower-dimensional display. - The most common techniques allocate a
representation for each data point in the
lower-dimensional space and try to optimize these
representations so that the distances between
them would be as similar as possible to the
original distances of the corresponding data
items. - The techniques differ in how the different
distances are weighted and how the
representations are optimized. (Kaski, 1997.)
126Bayesian Unsupervised Model-based Visualization
- Multidimensional scaling (MDS) is not one
specific tool, instead it refers to a group of
techniques that is widely used especially in
behavioral, econometric, and social sciences to
analyze subjective evaluations of pairwise
similarities of entities. - The starting point of MDS is a matrix consisting
of the pairwise dissimilarities of the entities. - The basic idea of the MDS technique is to
approximate the original set of distances with
distances corresponding to a configuration of
points in a Euclidean space.
127Bayesian Unsupervised Model-based Visualization
- MDS can be considered to be an alternative to
factor analysis. - In general, the goal of the analysis is to detect
meaningful underlying dimensions that allow the
researcher to explain observed similarities or
dissimilarities (distances) between the
investigated objects. - In factor analysis, the similarities between
objects (e.g., variables) are expressed in the
correlation matrix.
128Bayesian Unsupervised Model-based Visualization
- With MDS we may analyze any kind of similarity or
dissimilarity matrix, in addition to correlation
matrices, specifying that we want to reproduce
the distances based on n dimensions. - After formation of matrix MDS attempts to arrange
objects (e.g., factors of growth-oriented
atmosphere) in a space with a particular number
of dimensions so as to reproduce the observed
distances. - As a result, the distances are explained in terms
of underlying dimensions.
129Bayesian Unsupervised Model-based Visualization
- MDS based on Euclidean distance do not generally
reflect properly to the properties of complex
problem domains. - In real-world situations the similarity of two
vectors is not a universal property in different
points of view they in the end may appear quite
dissimilar (Kontkanen, Lahtinen, Myllymäki,
Silander Tirri, 2000). - Another problem with the MDS techniques is that
they are computationally very intensive for large
data sets.
130Bayesian Unsupervised Model-based Visualization
- Bayesian unsupervised model-based visualization
(BUMV) is based on Bayesian Networks (BN). - BN is a representation of a probability
distribution over a set of random variables,
consisting of a directed acyclic graph (DAG),
where the nodes correspond to domain variables,
and the arcs define a set of independence
assumptions which allow the joint probability
distribution for a data vector to be factorized
as a product of simple conditional probabilities.
Two vectors are considered similar if they lead
to similar predictions, when given as input to
the same Bayesian network model. (Kontkanen,
Lahtinen, Myllymäki, Silander Tirri, 2000.)
131Bayesian Unsupervised Model-based Visualization
- Naturally, there are numerous viable options to
BUMV, such as Self-Organizing Map (SOM) and
Independent Component Analysis (ICA). - SOM is a neural network algorithm that has been
used for a wide variety of applications, mostly
for engineering problems but also for data
analysis (Kohonen, 1995). - SOM is based on neighborhood preserving
topological map tuned according to geometric
properties of sample vectors. - ICA minimizes the statistical dependence of the
components trying to find a transformation in
which the components are as statistically
independent as possible (Hyvärinen Oja, 2000). - The usage of ICA is comparable to PCA where the
aim is to present the data in a manner that
facilitates further analysis.
132Bayesian Unsupervised Model-based Visualization
- First major difference between Bayesian and
neural network approaches for educational science
researcher is that the former operates with a
familiar symmetrical probability range from 0 to
1 while the upper limit of asymmetrical
probability scale in the latter approach is
unknown. - The second fundamental difference between the two
types of networks is that a perceptron in the
hidden layers of neural networks does not in
itself have an interpretation in the domain of
the system, whereas all the nodes of a Bayesian
network represent concepts that are well defined
with respect to the domain (Jensen, 1995).
133Bayesian Unsupervised Model-based Visualization
- The meaning of a node and its probability table
can be subject to discussion, regardless of their
function in the network, but it does not make any
sense to discuss the meaning of the nodes and the
weights in a neural network Perceptrons in the
hidden layers only have a meaning in the context
of the functionality of the network. - Construction of a Bayesian network requires
detailed knowledge of the domain in question. - If such knowledge can only be obtained through a
series of examples (i.e., a data base of cases),
neural networks seem to be an easier approach.
This might be true in cases such as the reading
of handwritten letters, face recognition, and
other areas where the activity is a 'craftsman
like' skill based solely on experience.
(Jensen, 1995.)
134Bayesian Unsupervised Model-based Visualization
- It is often criticized that in order to construct
a Bayesian network you have to know too many
probabilities. - However, there is not a considerable difference
between this number and the number of weights and
thresholds that have to be known in order to
build a neural network, and these can only be
learnt by training. - A weakness of neural networks tis hat you are
unable to utilize the knowledge you might have in
advance. - Probabilities, on the other hand, can be assessed
using a combination of theoretical insight,
empiric studies independent of the constructed
system, training, and various more or less
subjective estimates.
(Jensen, 1995.)
135Bayesian Unsupervised Model-based Visualization
- In the construction of a neural network, it is
decided in advance about which relations
information is gathered, and which relations the
system is expected to compute (the route of
inference is fixed). - Bayesian networks are much more flexible in that
respect.
(Jensen, 1995.)
136For an example of practical use of BUMV, see
Nokelainen and Ruohotie (2009).
137Results showed that managers and teachers had
higher growth motivation and level of commitment
to work than other personnel, including job
titles such as cleaner, caretaker, accountant and
computer support. Employees across all job
titles in the organization, who have temporary or
part-time contracts, had higher self-reported
growth motivation and commitment to work and
organization than their established colleagues.
138(No Transcript)
139Links
- B-Course http//b-course.cs.helsinki.fi
- BayMiner http//www.bayminer.com
140References
- Anderson, J. (1995). Cognitive Psychology and its
Implications. New York Freeman. - Bayes, T. (1763). An essay towards solving a
problem in the doctrine of chances. Philosophical
Transactions of the Royal Society, 53, 370-418. - Bernardo, J., Smith, A. (2000). Bayesian
theory. New York Wiley. - Congdon, P. (2001). Bayesian Statistical
Modelling. Chichester John Wiley Sons. - Friedman, J. (1987). Exploratory Projection
Pursuit. Journal of American Statistical
Association, 82, 249-266. - Gigerenzer, G. (2000). Adaptive thinking. New
York Oxford University Press. - Gigerenzer, G., Krauss, S., Vitouch, O. (2004).
The null ritual What you always wanted to know
about significance testing but were afraid to
ask. In D. Kaplan (Ed.), The SAGE handbook of
quantitative methodology for the social sciences
(pp. 391-408). Thousand Oaks Sage.
141References
- Gill, J. (2002). Bayesian methods. A Social and
Behavioral Sciences Approach. Boca Raton Chapman
Hall/CRC. - Heckerman, D., Geiger, D., Chickering, D.
(1995). Learning Bayesian networks The
combination of knowledge and statistical data.
Machine Learning, 20(3), 197-243. - Hilario, M., Kalousisa, A., Pradosa, J., Binzb,
P.-A. (2004). Data mining for mass-spectra based
diagnosis and biomarker discovery. Drug Discovery
Today BIOSILICO, 2(5), 214-222. - Huberty, C. (1994). Applied Discriminant
Analysis. New York John Wiley Sons. - Hyvärinen, A., Oja, E. (2000). Independent
Component Analysis Algorithms and Applications.
Neural Networks, 13(4-5), 411-430. - Jensen, F. V. (1995). Paradigms of Expert
Systems. HUGIN Lite 7.4 User Manual.
142References
- Kaski, S. (1997). Data exploration using
self-organizing maps. Doctoral dissertation. Acta
Polytechnica Scandinavica, Mathematics, Computing
and Management in Engineering Series No. 82.
Espoo Finnish Academy of Technology. - Kim, S., Kwon, S., Cook, D. (2000). Interactive
Visualization of Hierarchical Clusters Using MDS
and MST. Metrika, 51(1), 3951. - Kohonen, T. (1995). Self-Organizing Maps. Berlin
Springer. - Kontkanen, P., Lahtinen, J., Myllymäki, P.,
Silander, T., Tirri, H. (2000). Supervised
Model-based Visualization of High-dimensional
Data. Intelligent Data Analysis, 4, 213-227. - Kontkanen, P., Lahtinen, J., Myllymäki, P.,
Tirri, H. (2000). Unsupervised Bayesian
Visualization of High-Dimensional Data. In R.
Ramakrishnan, S. Stolfo, R. Bayardo, I. Parsa
(Eds.), Proceedings of the Sixth International
Conference on Knowledge Discovery and Data Mining
(pp. 325-329). New York, NY The Association for
Computing Machinery.
143References
- Lavine, M. L. (1999). What is Bayesian Statistics
and Why Everything Else is Wrong. The Journal of
Undergraduate Mathematics and Its Applications,
20, 165-174. - Lindley, D. V. (1971). Making Decisions. London
Wiley. Lindley, D. V. (2001). Harold Jeffreys. In
C. C. Heyde E. Seneta (Eds.), Statisticians of
the Centuries, (pp. 402-405). New York Springer. - Murphy, K. R., Myors, B. (1998). Statistical
Power Analysis. A Simple and General Model for
Traditional and Modern Hypothesis Tests. Mahwah,
NJ Lawrence Erlbaum Associates. - Myllymäki, P., Silander, T., Tirri, H., Uronen,
P. (2002). B-Course A Web-Based Tool for
Bayesian and Causal Data Analysis. International
Journal on Artificial Intelligence Tools, 11(3),
369-387. - Myllymäki, P., Tirri, H. (1998).
Bayes-verkkojen mahdollisuudet Possibilities of
Bayesian Networks. Teknologiakatsaus 58/98.
Helsinki TEKES.
144References
- Neapolitan, R. E., Morris, S. (2004).
Probabilistic Modeling Using Bayesian Networks.
In D. Kaplan (Ed.), The SAGE handbook of
quantitative methodology for the social sciences
(pp. 371-390). Thousand Oaks, CA Sage. - Nokelainen, P. (2008). Modeling of Professional
Growth and Learning Bayesian Approach. Tampere
Tampere University Press. - Nokelainen, P., Ruohotie, P. (2009).
Investigating Growth Prerequisites in a Finnish
Polytechnic for Higher Education. Journal of
Workplace Learning, 21(1), 36-57. - Nokelainen, P., Silander, T., Ruohotie, P.,
Tirri, H. (2007). Investigating the Number of
Non-linear and Multi-modal Relationships Between
Observed Variables Measuring A Growth-oriented
Atmosphere. Quality Quantity, 41(6), 869-890. - Nokelainen, P., Tirri, K. (2007). Empirical
Investigation of Finnish School Principals'
Emotional Leadership Competencies. In S. Saari
T. Varis (Eds.), Professional Growth (pp.
424-438). Hämeenlinna RCVE.
145References
- Nokelainen, P., Ruohotie, P., Tirri, H. (1999).
Professional Growth Determinants-Comparing
Bayesian and Linear Approaches to Classification.
In P. Ruohotie, H. Tirri, P. Nokelainen, T.
Silander (Eds.), Modern Modeling of Professional
Growth, vol. 1 (pp. 85-120). Hämeenlinna RCVE. - Nokelainen, P., Tirri, K. (2010). Role of
Motivation in the Moral and Religious Judgment of
Mathematically Gifted Adolescents. High Ability
Studies, 21(2), 101-116. - Nokelainen, P., Tirri, K., Campbell, J. R.,
Walberg, H. (2004). Cross-cultural Factors that
Account for Adult Productivity. In J. R.
Campbell, K. Tirri, P. Ruohotie, H. Walberg
(Eds.), Cross-cultural Research Basic Issues,
Dilemmas, and Strategies (pp. 119-139).
Hämeenlinna RCVE. - Nokelainen, P., Tirri, K., Merenti-Välimäki,
H.-L. (2007). Investigating the Influence of
Attribution Styles on the Development of
Mathematical Talent. Gifted Child Quarterly,
51(1), 64-81. - Pylväs, L., Nokelainen, P., Roisko, H. (in
press). Modeling of Vocational Excellence in Air
Traffic Control. Submitted for review.
146References
- Tirri, H. (1997). Plausible Prediction by
Bayesian Interface. Department of Computer
Science. Series of Publications A. Report
A-1997-1. Helsinki University of Helsinki. - Tukey, J. (1977). Exploratory Data Analysis.
Reading, MA Addison-Wesley. - Venables, W. N., Ripley, B. D. (2002). Mo