Introduction to Clinical Biostatistics for Medical Students presentation

About This Presentation

Transcript and Presenter's Notes

Title: Introduction to Clinical Biostatistics for Medical Students

1
Introduction to Clinical Biostatistics for
Medical Students

Atif Zafar, MD
Department of Medicine

2
Overview of Presentation

Introductory Concepts (Review)
Hypothesis Testing
Linear Regression and Correlation
Analysis of Variance (ANOVA)
Nonparametric Statistics
Survival Analysis

3
Introductory Concepts
4
Introductory Concepts

Types of Data
Presenting Data
Descriptive Measures
Probability and Distributions
Estimation Techniques

5
Types of Data

Data are usually Discrete or Continuous
Discrete Variables take on a finite set of values
that can be counted
Race, Gender, Year in School etc.
Continuous Variables take on an infinite set of
values
Age, Height/Weight, Blood Pressure

6
Types of Data

A Special type of Discrete Variable is the Binary
Variable which takes on exactly 2 possible values
Gender (M/F)
Pregnant? (Y/N)
Hypertensive? (Y/N)

7
Types of Data

Sometimes, discrete variables have a natural
ordering to them
For example, names of consecutive days in a week
(M, Tu, Wed, Thurs, Fri, Sat, Sun)
Other types of discrete variables do not have a
natural order and are called Nominal Variables
Race (African American, Caucasian, Asian,
Hispanic etc.)

8
Types of Data

If in an experiment you measure a single
variable, it is called a Univariate experiment
If you measure 2 variables, it is called a
Bivariate experiment
And if you measure multiple variables, it is
called a Multivariate experiment

9
Types of Data

A Random variable is one whose value is
determined by chance or random event
Typically, a variable X is random if it is the
outcome of an experiment where results can occur
by chance or are not completely predictable

10
Types of Data

Nonparametric Variables
Many times in clinical studies, we seek opinion
data (I.e. patient satisfaction scores, relative
value scales etc.)
The data can be ranked but has no absolute scale
that is comparable
This type of data is called nonparametric data

11
Presenting Data

There are many ways to present data
Frequency Tables
Pie Charts
Bar Graphs (Histograms)
Line Graphs
Scatter Plots (Scattergrams)
Stem and Leaf Displays
Box Plots

12
Presenting Data

Scatter Plots (Plot of a Bivariate experiment)

13
Presenting Data

Stem and Leaf Displays
Presents a histogram like picture of the data,
while retaining the original data values
Dataset 8520 9274 8142 11298 10624 7987
11172 12899 10737 9198 13625 9462 11847
10178 12240 11690 10069 11240 12745 12995

14
Presenting Data

Boxplots
Complex visual data structures that combine
various measures
Maximum and Minimum Data Points
1st and 3rd Quartile Points
Sort the data points from lowest to highest
Divide the number of data points into 2 halves
Take the Median value of each half and those are
the 1st and 3rd quartiles (Q1,
Q3)
Computer the Inter Quartile Range (IQR)
IQR Q3-Q1
Compute 1.5 x IQR. Compute Q31.5IQR and
Q1-1.5IQR
Data points lying outside this range are called
Outliers

15
Presenting Data

Boxplots

16
Descriptive Measures

Now that we have displayed our data, we want to
be able to characterize it quantitatively
Measures of Central Tendency
Mean, Median, Mode
Measures of Variability
Range, Variance, Standard Deviation
Measures of Relative Standing
Z-Scores, Percentiles, Quartiles

17
Measures of Central Tendency

Mean
Arithmetic Average of a sample of data
Median
If you order the data from smallest to highest,
the median is the middle value, assuming an odd
number of data elements
If you have an even number of elements, it is the
average of the 2 middle numbers.
Mode
The most common value in a set of values

18
Measures of Variability

Once we have located the center of a set of data
points, we want to know how dispersed they are

19
Measures of Variability

Range
This is the difference between the highest and
lowest value
Variance
Defined to be the average of the square of the
deviations of the individual data points about
their mean
Standard Deviation
This is defined as the square root of the
variance

20
Measures of Relative Standing

Sometimes we want to know the position of a
particular observation relative to others in a
data set
Ex How you performed with respect to your
classmates on an exam
The Z-Score measures this as follows

21
Measures of Relative Standing

Percentiles and Quartiles also indicate relative
standing but in terms of the categories of scores
from lowest to highest
Given a set of n measurements x1, , Xn the pth
percentile is defined to be the value of x that
exceeds p of the measurements and is less than
(100-p) of the values.
Ex Scores of 20, 30, 50, 60, 67, 67, 70, 80,
90, 95
The score 50 is in the 30th percentile, meaning
that 30 of the scores were lower than yours and
70 were higher than yours.
Quartiles similarly reflect in which quarter of
the set of values a particular observation lies
Ex Scores of 20, 30, 50, 60, 67, 67, 70, 80, 90,
95
1st Quartiles 50, 3rd Quartile 80

22
Probability

Suppose you do an experiment with a finite number
of possible outcomes (ex coin toss)
The Probability of an event E (H/T) is the chance
() that the event will turn out in a given way
in the next repetition of the experiment
Probabilities values are always between 0 and 1
The notation for probabilities is as follows
Given our coin toss experiment,
P(H) Probability that a Head will be tossed in
the next round
P(T) Probability that a Tail will be tossed in
the next round
One can estimate probabilities by repeating the
event many times and observing the outcomes

23
Probabilities Some Simple Rules

Arithmetically, one can combine probabilities of
simple and sequential events
Given a complex event composed of N simple
events, the probability of the complex event is
equal to the sum of the probabilities of each of
the simple events
Ex Coin toss 1 and Coin toss 2
Event First Coin Second Coin P(Ei)
E1 Heads Heads ¼
E2 Heads Tails ¼
E3 Tails Heads ¼
E4 Tails Tails ¼
Let A E2, E3. Then P(A) P(E2)P(E3) ½

24
Probability Distributions

Given a random variable X (either discrete or
continuous), the Probability Distribution gives a
table or formula or graph of the probabilities of
each potential value of X
For a Probability Distribution P(x) the following
must hold
0 lt P(x) lt 1
Sum (all P(x) over all x) 1

25
Probability Distributions

There are many kinds of probability
distributions
Binomial Distribution
Applies to binary variable experiments where only
2 outcomes are possible
Poisson Distribution
Applies to variables that represent the number of
occurrences of a specified event in a given unit
of time or space
Hypergeometric Distribution
Applies to experiments where the numbers of
elements in the population is small in comparison
to the sample size and thus the success of a
trial depends on the outcomes of preceding trials

26
Probability Distributions

Normal Distribution (N)
Applies to continuous random variables
Standard Normal Distribution (Z)
A Normal Distribution with
Mean of 0
Standard Deviation of 1

27
Estimation Techniques

So now that we know that certain experiments
can have results distributed in certain ways, how
can we predict the result of this experiment?
This process is called Statistical Inference,
where we can estimate the quality of a larger
population by analyzing a small sample

28
Estimation Techniques

Populations and Samples
A Population is the larger set of objects we wish
to study
Ex The number of democrats in the country
A Sample is a set of representative objects we
choose in order to estimate the characteristics
of the larger set of objects
Ex Take 100 people from each state and determine
whether they are democrats

29
Estimation Techniques

Parameters and Statistics
A Parameter is the quality of the population we
are trying to estimate
In order to estimate the parameter we measure the
quality in a sample. This sample quality is
called its statistic

30
Estimation Techniques

Many types of samples can be taken
Completely Random Sample
Stratified Random Sample
Divide the population into strata (groups)
Take a sample from each group
Ex Party loyalties of teenagers, adults and
elderly
Cluster Sample
Take a simple random sample of clusters from the
available clusters in a population
Ex Urban vs. Rural sampling

31
Hypothesis Testing

Large Sample Estimation Techniques

32
Introduction to Estimating Techniques

Before we begin, lets review some common terms
Point Estimate When we do an experiment and
generate a result, the result at one point in
time for one run of the experiment is called a
point estimate (mean, etc.). Since each
experiment has some error, there is a margin of
error for every point estimate
Interval Estimate Now if we repeat the
experiment many times over we will get sense of
how far off we are from running a perfect
experiment. This sense of confidence in our
experimental ability is called an interval
estimate or a confidence interval.

33
Confidence Intervals

Typically, the confidence interval is defined as
follows
CI Mean /- 1.96 x Variance / sqrt(N)
It tells us that if we repeat the experiment
many times over, 95 of the time our values for
the Mean will lie in the limits specified here

34
Significance Value (a)

Statisticians arbitrarily choose a value of 5 to
represent events that can occur by chance alone
So if an event occurs more than 5 of the time,
it is considered statistically significant
The 5 value is called a significance value, or a

35
P-Values

A P-value is a useful way to represent the
probability of a certain event and is seen
extensively in the medical literature
Definition
The P-Value is simply the probability that an
event occurs by chance alone
Given our significance level of 5 for chance, we
want P-values to be less than 5 or .05 to be
considered statistically significant

36
Comparing Means

Many times we wish to compare the means of two
subsets of a population
Ex MCAT scores for Biology vs. Chemistry majors
To do this we would sample MCAT scores from
random samples of biology and chemistry majors
across the country
We would compute the mean of all these samples
We would compare the means to determine if they
are significantly different.
This kind of analysis is exactly what is done by
Hypothesis Testing (we hypothesize there is no
difference and then refute this hypothesis)

37
Hypothesis Testing

A statistical test of hypothesis consists of 4
parts
A NULL Hypothesis, termed Ho
An Alternate Hypothesis, termed Ha
A test statistic
A rejection region
The NULL hypothesis is what we want to refute
The Alternate hypothesis is what we want to
support
The test statistic is what we will use to compare
the NULL and the Alternate Hypotheses
The Rejection Region is the value of the test
statistic for which Ho will be rejected

38
Hypothesis Testing

So what does this all mean IN LAYMANS TERMS?
Basically we are asking the question that given a
test statistic we specify, what is the
probability that the hypothesis in question (Ha)
is due to chance alone?
We convert the test statistic into a probability
value by looking it up in a table that specifies
the respective probabilities associates with that
particular statistic value

39
Constructing a Hypothesis

Consider the following question
We wish to show that the hourly wages of
construction workers in California is larger than
the national average of 14
The hypothesis will be written down as
Ha ? ltgt 14
Ho ? 14
Test statistic Z-value X Uo /
(Var/sqrt(N))
Rejection region 0.05 (a value)

40
Testing a Hypothesis

The average weekly earnings for men in managerial
and professional positions is 725. Do women in
the same position have average weekly earnings
that are less than those for men?
A random sample of N40 women in managerial
positions showed X670 and Var 102. Test the
appropriate hypothesis using a 0.01
Solution Ho U 725 Ha U lt 725
Z X U / (Var/sqrt(N))
Z 670 725 / (102 / sqrt(40)) -3.41
Since -3.41 lt 0.01 we conclude that Ho is false
and the average weekly salary for women is
significantly less than for men and the
probability that we have made an incorrect
decision is 0.01

41
Confidence in our Test Result

So what is our confidence in our result?
Well, we can have 2 types of errors
Type I error Rejecting Ho when Ho is true a
Type II error Accepting Ho when Ho is false b
To compute a confidence value, we calculate the
Power of the Test which is the probability of
correctly rejecting the NULL hypothesis
Power (1-b)

42
Types of Tests

Given the kinds of data we have and the types of
information we seek there are different types of
tests available to us
Students T-Test
Used to compare MEANS of two populations
Works for small samples (Nlt30)
Chi-Square Test
Used to estimate a populations VARIANCE
F-Test
Used to compare the VARIANCES of 2 populations

43
Types of Tests

We can do these tests in different ways
We can have one-tailed and two-tailed tests
A One-tailed test occurs when our hypothesis mean
is on one side (either less or greater) than the
null hypothesis mean
A Two-tailed test occurs when we can say that the
hypothesis mean can be on either side of the null
value
We can also do Paired Tests, where we do 2 tests
in a specific sequential order

44
T-tests Small Sample Testing

Up to now we have assumed the sample size to be
large (Ngt30) in order to achieve good power. But
what happens when the sample size is small
(Nlt30).
Well, in this case the shape of the normal
distribution looks somewhat different it is
shorter and wider and is called the
T-Distribution
Every T-distribution has an associated Degree of
Freedom (df) which is equal to N-1
A T-Table is consulted to get the appropriate
values of the T-statistic when doing a T-test.
You need the df and the significance level to
look up the T-values.

45
Chi-Square Distribution

Remember that the T-test compares population
Means. What if we want to estimate a population
variance?
In this case, we would use a Chi-Square
distribution and our test statistic will be a
chi-square value
X2 (n-1)s2 / oo2
where n sample size
s sample variance
oo Population Variance that we are trying to
estimate
A variant of the Chi-Square Distribution is
called the Mantel-Haenszel Test
It is a test of association between 2 ordinal
variables (frequency data)

46
F-Distribution

What if we want to compare the population
variances of two different populations?
In this case we use an F-Distribution and an
F-statistic
F s12/s22, where s1 and s2 are variances of
Samples 1 and 2
Typically we will have 2 degrees of freedom (v1
and v2) with F-tests

47
Linear Regression and Correlation
48
Linear Regression and Correlation

In many situations in clinical studies we wish to
attempt to answer the question How is the
random variable X related to the random variable
Y?
Ex How is smoking related to lung cancer?
Ex How is age related to development of
Alzheimers Disease?
Ex How is hypertriglyceridemia related to
metabolic syndrome?
Such questions are answered statistically using
the concepts of Regression Analysis which looks
for relationships among different variables
(either negatively or positively) and
Correlations, the strengths of the relationships
Relationships may have many forms
Related linearly
Related curvilinearly
Related colinearly
Associations but not Correlations

49
Linear Regression

The Linear Regression model postulates that two
random variables X and Y are related by a
straight line as follows
Y a bX e
Where
Y is the dependent variable
X is the independent variable
a is the intercept
b is the slope
e is the residual value

50
Linear Regression

Residual Value (e)
Given that the regression analysis procedure is
itself a statistical approach, it is expected to
have some degree of error associated with it
Thus we add a value called the residual value (e)
to any regression equation to account for random
errors in the process
Scatterplots
In order to perform regression analysis visually,
it helps to graph the points on a scatterplot
A visual relationship can often be observed when
looking at these plots

51
Method of Least Squares

So, assuming that 2 variables are linearly
related, how do we best fit a line through a
series of points on a scatterplot the
regression line.
One way is to use a goodness of fit estimator
called the Sum of Squares for Error (S) which we
want to minimize

f(xi)
yi
52
Inferences Concerning Slopes

The initial question once we have a regression
line is whether the data present sufficient
evidence to indicate that Y increases or
decreases linearly as X increases over the
observed region?
So we use the variability of the points about the
line to estimate this
Variance s2 S / n 2
S Sum of squares for error
n Sample size

53
Inferences Concerning Slopes

Given that we can use S for estimating the
population variance, we can formulate our
hypothesis using a T-test to compare means as
follows
Null Hypothesis Ho b bo
Alternate Hypothesis Ha b lt bo or b gt bo
Test Statistic t-value b bo / (s /
sqrt(Sxx))
b regression line slope
bo slope to test with
s variance
Sxx Standard Error for Xis Sum over all i
(Xi Xmean)2

54
Inferences Concerning Slopes

So how do we do the T-test and reach a conclusion
or calculate a P-value?
Well, the T-table has several features
Df Degrees of Freedom n 1
T-values listed for various significance levels
The procedure for using a T-Table is as follows
Compute the T-value using the statistic in your
test
Lookup the appropriate T-value in the table given
your degree of freedom (n 1)
Then look up the column to whichever significance
level it belongs to and the P will be less than
that significance level

55
Linear Regression

So, graphically what does it look like?

56
Other Regressions

Given the types of data you have, there are other
methods for fitting the data to a geometric
shape
For example, there is Curvilinear Regression
Cubic Spline Interpolation
Quadratic Interpolation
Higher Order Interpolation
Logarithmic Regression
This is useful when you have categorical data
(non-numeric)
For example, when you have a binomial random
variable such as HTN (y/n), Gender(M/F) or Race

57
Correlation

As opposed to finding the best fit line through
a set of data points, Correlation seeks to
understand the strength of the relationship.
R 0.17 R 0.85 R -0.94

58
Correlation

We compute the Pearson Product Moment Coefficient
of Correlation (R) as follows
R Sxy / sqrt (Sxx X Syy)
where
Sxy Sum over all i (Xi Yi)2
Sxx Sum over all i (Xi Xmean)2
Syy Sum over all i (Yi Ymean)2
0 lt R lt 1, the larger the R the stronger the
correlation

59
Multiple Linear Regression

So far we discussed how one variable is related
to another in a study.
But in real life, a study typically has many
variables that it is trying to compare as they
related to an outcome
Ex CAD as f(HTN, DM, Smoking, Hyperchol.,
Obesity, Age)
In order to do this type of analysis, we can
extend the general notion of linear regression to
multiple variables.
We have an intercept as usual but partial slopes
(or partial regression coefficients), each one
representing a different variable

60
Multiple Linear Regression

The General Linear Model (GLM) is then stated as
follows
Y b0 b1x1 b2x2 b3x3 .
bnxn e
With the following assumptions
1. Y is the response variable you wish to
predict
2. b0, b1 . bn are unknown constants
3. x1, x2 . xn are independent predictor
variables that are measured without error
4. e is a random error that for any set of
predictors is normally distributed
5. The random errors associated with any pair of
Y values are independent

61
Multiple Linear Regression

Note that you can use qualitative (categorical)
and quantitative variables in a GLM.
Categorical Variables look like
X1 1, if Group A, 0 if not Group A
Typically computing p-values and regression
equations in a GLM is hard to do by hand so most
people will do it using computer software
SAS has a procedure called Proc GLM
SPSS/PC
StatSoft
HyperStat

62
Multiple Linear Regression

Problems that can occur when using GLM
Multicolinearity
This happens when 2 of the independent variables
xi, xj are themselves related and occurrence in a
model overestimates the true effect size
Also known as Covariants or Confounding Factors
Interaction Terms
When 2 variables in a model are co-related then
we must add an interaction term to the model
For example, suppose you want to study the salary
of a professor with respect to of years of
service. Well, this may differ slightly whether
you are a male or female.
Thus, the salary slope for males may be slightly
higher than the salary slope for females despite
the same number of years of service.
This type of relationship is called an
Interaction (between gender and years of service
because the slope varies depending on whether a
male or female is selected) and we must add a
term of the type
Y b0 b1x1 b2x2 b3x1x2

63
Logistic Regression

What happens when you have data in the form of
proportions (or frequency data) of categorical
variables?
The tool for analysis of this type of data is
called a Logistic Regression
It is based on the Chi-Square Distribution and
the model is described as follows
lnp/(1-p) a BX e or p/(1-p) expa
expB X exp e
where
ln is the natural logarithm, logexp, where
exp2.71828
p is the probability that the event Y occurs,
p(Y1)
p/(1-p) is the "odds ratio"
lnp/(1-p) is the log odds ratio, or "logit"
all other components of the model are the same.

64
The ANalysis Of VAriance

Also known as ANOVA

65
ANOVA

Suppose you want to compare the mean
reimbursement rates from 5 different health plans
You could do t-tests among all combinations of
the 5 plans, or 10 t-tests
Suppose all the means are equal. When this
procedure is repeated 10 times, the probability
of incorrectly concluding that at least one pair
of means differ is quite high and you reach an
erroneous decision
Thus we want one test which could compare means
for all 5 groups at the same time
This is exactly what ANOVA provides

66
ANOVA

ANOVA is a powerful procedure which allows you to
do 2 things
Compare the variance between the means of 2 or
more groups
Compare the variance in data values within each
group

67
ANOVA

ANOVA procedures can be done with different study
designs
Completely Randomized Design
Random samples are independently selected from
each of k populations.
Assumes that the data is homogeneously
distributed with a fixed variation
Randomized Block Design
Assumes that subsets of the population have
different variances
Within each subset, however, the variability is
the same
Each subset is called a block.
Random samples are then taken from each block

68
ANOVA for Completely Randomized Designs

Suppose we want to compare k population means
u1..uk based on random independent samples of
n1..nk observations selected from populations
1..k respectively
Ex Suppose we have 10 observations of
reimbursement figures from each of 5 health plans
then we will have 50 total values
Then let
Xij represent the jth measurement in the ith
group
We define an entity called the Total Sum of
Squares (SS) as follows
k ni
Total SS Sxx ? ? (xij x)2
i1 j1

69
ANOVA for Completely Randomized Designs

It can be shown that the sum of squares of
deviations of all values about the overall mean
the Total SS - (of all 50 values) can be
partitioned into 2 components
SST Sum of Squares for Treatments
SSE Sum of Squares for Error (measures
variation within samples)
We have
Total SS SST SSE

70
ANOVA for Completely Randomized Designs

Now, we can also compute SSE readily and it is
n1 n2 nk
SSE ? (x1j x1)2 ? (x2j x2)2 ? (xkj
xk)2
j1 j1 j1
Knowing SSE and SS, we can compute SST
We then compute the Mean Squares of these as
follows
MST SST / k-1
MSE SSE / n-k
The final step is to compute an F-statistic as
follows
F MST / MSE

71
ANOVA for Completely Randomized Designs

Now, F-tests have 2 degrees of freedom v1 and v2
In the case of ANOVA,
v1 k 1
v2 n k
We can then our usual hypothesis testing using
this F-statistic as our test
Ho u1 u2 u3 uk
Ha One of more pairs of population means differ
F-Statistic MST/MSE with df v1(k-1), v2(n-k)
Rejection Region Reject Ho if F gt Fa (found
from the table using v1, v2 and a)

72
ANOVA for Randomized Block Designs

The computational steps are very similar to those
of a completely randomized design except that we
add a third term, the sum of squares for BLOCKS
(with b blocks)
Total SS SST SSE SSB
We then perform 2 different hypothesis tests
(1) For comparing Treatment Means
F MST/MSE, v1k-1, v2n-b-k1
(2) For comparing BLOCK Means
F MSB/MSE, v1b-1, v2n-b-k1

73
Nonparametric Statistics

Analysis of Ranked Data

74
Nonparametric Statistics

What do we do when we have oppinion data?
For example, suppose a judge is employed to
evaluate and rank the sales abilities of 4
salesmen, the edibility of 5 brands of Corn
Flakes or the relative appeal of 5 brands or
automobiles
Clearly it is impossible to give an exact measure
of sales competence, the palatability of food or
design appeal
But, it is possible to rank the salespeople, food
or design choices based on our own oppinions.
Many, Many types of studies in medicine use this
kind of data gathering (patient satisfaction is
one example)

75
Nonparametric Statistics

There are many tests available for studying this
kind of data
The Sign Test
The Mann-Whitney U Test
The Wilcoxon Signed-Rank Test for a Paired
Experiment
The Kruskal-Wallis H Test for Completely
Randomized Designs
The Friedman Fr Test for Randomized Block Designs
Spearmans Rank Correlation Test

76
The Sign Test

Compares 2 populations with respect to how they
differ in the responses to qualitative questions
Compute the number of responses that were the
same
Then compute the number of responses that
differed
Finally compute X, the number of times responses
from population A was greater than responses from
population B
This gives you the number of times (A-B) is
positive (i.e. has a positive sign hence the
name)
This is your test statistic
You then use a Binomial Probability Distribution
to do a hypothesis test

77
Mann-Whitney U Test

Analogous to the T-test for nonparametric data
Suppose you have 2 populations from which 2
samples n1 and n2 are obtained
You should rank all samples (n1n2) into
ascending order assigning rank values 1, 2, 3 to
all observations
Tied observations are handled by averaging the
ranks assigned to both of the tied observations
Then calculate the sum of the ranks T1 and T2 for
both of the samples

78
Mann-Whitney U Test

Now compute the U statistic as follows
U1 n1n2 (n1(n11)/2) T1
U2 n1n2 (n2(n21)/2) T2
Look up the appropriate a value in the table
given n2
The Table will give you a value for Uo on the
left hand side corresponding to your n1
Your computed U (smaller of U1 or U2) should be
less than the U stated in the table in order to
reject the Null hypothesis (that the population
relative frequency distributions are identical)

79
Wilcoxon Signed Rank Test

Similar to the Mann-Whitney U Test
Allows you to compare paired differences
Given n pairs of observations from populations A
and B, compute the paired differences (xA-xB) for
each pair of values
Rank the positive differences and the negative
differences separately
Compute the sums T and T- of these rankings
For a one tailed test, use T- and for a two
tailed test, use the smaller of T or T-
Reject Ho if T lt To (critical value) obtained
from the Wilcoxon Table, given n and a values

80
Other Nonparametric Tests

Kruskal-Wallis H Test
Just as the Mann-Whitney U Test is the
nonparametric alternative to the Students T-Test
for comparing population means, the
Kruskal-Wallis H Test is the nonparametric
alternative to ANOVA for a completely randomized
design and is used to detect differences in
location among more than 2 population
distributions based on independent random
sampling
Friedman Fr Test for Randomized Block Designs
Is a nonparametric test for comparing the
distributions of measurements for k treatments
laid out in b blocks using a randomized block
design

81
Test of Association

Spearmans Rank Correlation Test
Tests whether there is an association between 2
populations
Assume n pairs (xi, yi) of observations from 2
populations X, Y
Rank each of the xi and yi in ascending order
Compute
Rs Sxy / sqrt (Sxx Syy)
Then given n and a, look up Ro in the Spearman
Table
Reject Ho (no association) if Rs gt Ro or Rs lt
-Ro

82
Survival Analysis
83
Introduction

There are many clinical studies that address the
question of time to an event
For example, we often want to know given risk
factors, what is a patients chance for an MI?
(I.e. time to MI)
This type of data is called censored data
Survival Analysis seeks to study this type of
question

84
Life Tables

The most straightforward way to compute a data
structure known as a Life Table
The entire lifetime of a study object is divided
into intervals of specified length
For each interval, the number of subjects
surviving or died within that interval is
determined and plotted
Based on this number, we can compute several
types of statistics
Numbers of cases at risk
Proportion Failing or Proportion Surviving
Probability Density or Hazard Rate
Median Survival Time
Required Sample Sizes

85
Survival Analysis

Although life tables give us a good estimate of
the risk of adverse events, it is desirable to
understand the underlying survival function
algorithmically for prediction purposes
The three distributions proposed for this are
the
Exponential (linear exponential) distribution
Weibull Distribution
Gompertz Distribution
The parameter estimation procedure is then a
modified version of the least-squares model
And the statistic used to study it is an
incremental Chi-Square Statistic

86
Kaplan-Meier Product Limit Estimator

Rather than classify the survival into a
life-table, the KM estimator computes a survival
function directly from continuous survival or
failure times
Imagine creating a life table with exactly one
observation for each interval
Then we avoid the effect of grouping
observations together into interval categories
Then S(t) Productj ((n-j)/(n-j1))d(j)
n total observations
d(j) 1 if censored, 0 if not in interval j

87
Comparing Survival Times

Often we wish to compare survival times in 2 or
more populations
There are several tests available for this
purpose
Gehans Generalized Wilcoxon Test
Cox-Mantel Test
Coxs F-Test
Log-Rank Test
Peto and Petos Wilcoxon Test
These are mostly nonparametric tests that
generate Z-values for comparing means

88
Regression Models

We also want to be able to predict survival time
given some independent risk factors
This is very common in the medical literature
The regression test of choice is the
Cox-Proportional Hazards Model
The model is written as
h(t), (z1, z2, ..., zm) h0(t)exp(b1z1
... bmzm)
(where h(t,...) denotes the resultant hazard,
given the values of the m covariates for the
respective case (z1, z2, ..., zm) and the
respective survival time (t). The term h0(t) is
called the baseline hazard it is the hazard for
the respective individual when all independent
variable values are equal to zero). We can
linearize this model by dividing both sides of
the equation by h0(t) and then taking the natural
logarithm of both sides
logh(t), (z...)/h0(t) b1z1 ...
bmzm
We now have a fairly "simple" linear model that
can be readily estimated)

89
Useful Links

http//hesweb1.med.virginia.edu/biostat/teaching/h
andouts.html
http//stat.tamu.edu/stat30x/notes/trydouble2.html
http//www.statsoft.com/textbook/stathome.html
http//davidmlane.com/hyperstat/index.html
http//members.aol.com/johnp71/javastat.html
http//www.helsinki.fi/jpuranen/links.html
http//ubmail.ubalt.edu/harsham/statistics/REFSTA
T.HTMrgenRes
http//trochim.human.cornell.edu/kb/index.htm

Introduction to Clinical Biostatistics for Medical Students PowerPoint PPT Presentation