Title: Introduction to Clinical Biostatistics for Medical Students
 1Introduction to Clinical Biostatistics for 
Medical Students
- Atif Zafar, MD 
 - Department of Medicine
 
  2Overview of Presentation
- Introductory Concepts (Review) 
 - Hypothesis Testing 
 - Linear Regression and Correlation 
 - Analysis of Variance (ANOVA) 
 - Nonparametric Statistics 
 - Survival Analysis 
 
  3Introductory Concepts 
 4Introductory Concepts
- Types of Data 
 - Presenting Data 
 - Descriptive Measures 
 - Probability and Distributions 
 - Estimation Techniques
 
  5Types of Data
- Data are usually Discrete or Continuous 
 - Discrete Variables take on a finite set of values 
that can be counted  - Race, Gender, Year in School etc. 
 - Continuous Variables take on an infinite set of 
values  - Age, Height/Weight, Blood Pressure
 
  6Types of Data
- A Special type of Discrete Variable is the Binary 
Variable which takes on exactly 2 possible values  - Gender (M/F) 
 - Pregnant? (Y/N) 
 - Hypertensive? (Y/N) 
 
  7Types of Data
- Sometimes, discrete variables have a natural 
ordering to them  - For example, names of consecutive days in a week 
(M, Tu, Wed, Thurs, Fri, Sat, Sun)  - Other types of discrete variables do not have a 
natural order and are called Nominal Variables  - Race (African American, Caucasian, Asian, 
Hispanic etc.) 
  8Types of Data
- If in an experiment you measure a single 
variable, it is called a Univariate experiment  - If you measure 2 variables, it is called a 
Bivariate experiment  - And if you measure multiple variables, it is 
called a Multivariate experiment 
  9Types of Data
- A Random variable is one whose value is 
determined by chance or random event  - Typically, a variable X is random if it is the 
outcome of an experiment where results can occur 
by chance or are not completely predictable 
  10Types of Data
- Nonparametric Variables 
 - Many times in clinical studies, we seek opinion 
data (I.e. patient satisfaction scores, relative 
value scales etc.)  - The data can be ranked but has no absolute scale 
that is comparable  - This type of data is called nonparametric data
 
  11Presenting Data
- There are many ways to present data 
 - Frequency Tables 
 - Pie Charts 
 - Bar Graphs (Histograms) 
 - Line Graphs 
 - Scatter Plots (Scattergrams) 
 - Stem and Leaf Displays 
 - Box Plots
 
  12Presenting Data
- Scatter Plots (Plot of a Bivariate experiment) 
 
  13Presenting Data
- Stem and Leaf Displays 
 - Presents a histogram like picture of the data, 
while retaining the original data values  - Dataset 8520 9274 8142 11298 10624 7987 
11172 12899 10737 9198 13625 9462 11847 
10178 12240 11690 10069 11240 12745 12995  
  14Presenting Data
- Boxplots 
 - Complex visual data structures that combine 
various measures  - Maximum and Minimum Data Points 
 - 1st and 3rd Quartile Points 
 - Sort the data points from lowest to highest 
 - Divide the number of data points into 2 halves 
 - Take the Median value of each half and those are 
 the 1st and 3rd quartiles (Q1, 
Q3)  - Computer the Inter Quartile Range (IQR) 
 - IQR  Q3-Q1 
 - Compute 1.5 x IQR. Compute Q31.5IQR and 
Q1-1.5IQR  - Data points lying outside this range are called 
Outliers 
  15Presenting Data
  16Descriptive Measures
- Now that we have displayed our data, we want to 
be able to characterize it quantitatively  - Measures of Central Tendency 
 - Mean, Median, Mode 
 - Measures of Variability 
 - Range, Variance, Standard Deviation 
 - Measures of Relative Standing 
 - Z-Scores, Percentiles, Quartiles
 
  17Measures of Central Tendency
- Mean 
 - Arithmetic Average of a sample of data 
 - Median 
 - If you order the data from smallest to highest, 
the median is the middle value, assuming an odd 
number of data elements  - If you have an even number of elements, it is the 
average of the 2 middle numbers.  - Mode 
 - The most common value in a set of values
 
  18Measures of Variability
- Once we have located the center of a set of data 
points, we want to know how dispersed they are  
  19Measures of Variability
- Range 
 - This is the difference between the highest and 
lowest value  - Variance 
 - Defined to be the average of the square of the 
deviations of the individual data points about 
their mean  - Standard Deviation 
 - This is defined as the square root of the 
variance  
  20Measures of Relative Standing
- Sometimes we want to know the position of a 
particular observation relative to others in a 
data set  - Ex How you performed with respect to your 
classmates on an exam  - The Z-Score measures this as follows 
 
  21Measures of Relative Standing
- Percentiles and Quartiles also indicate relative 
standing but in terms of the categories of scores 
from lowest to highest  - Given a set of n measurements x1, , Xn the pth 
percentile is defined to be the value of x that 
exceeds p of the measurements and is less than 
(100-p) of the values.  - Ex Scores of 20, 30, 50, 60, 67, 67, 70, 80, 
90, 95  - The score 50 is in the 30th percentile, meaning 
that 30 of the scores were lower than yours and 
70 were higher than yours.  - Quartiles similarly reflect in which quarter of 
the set of values a particular observation lies  - Ex Scores of 20, 30, 50, 60, 67, 67, 70, 80, 90, 
95  - 1st Quartiles  50, 3rd Quartile  80
 
  22Probability
- Suppose you do an experiment with a finite number 
of possible outcomes (ex coin toss)  - The Probability of an event E (H/T) is the chance 
() that the event will turn out in a given way 
in the next repetition of the experiment  - Probabilities values are always between 0 and 1 
 - The notation for probabilities is as follows 
 - Given our coin toss experiment, 
 - P(H)  Probability that a Head will be tossed in 
the next round  - P(T)  Probability that a Tail will be tossed in 
the next round  - One can estimate probabilities by repeating the 
event many times and observing the outcomes 
  23Probabilities Some Simple Rules
- Arithmetically, one can combine probabilities of 
simple and sequential events  - Given a complex event composed of N simple 
events, the probability of the complex event is 
equal to the sum of the probabilities of each of 
the simple events  - Ex Coin toss 1 and Coin toss 2 
 -  Event First Coin Second Coin P(Ei) 
 -  E1 Heads Heads ¼ 
 -  E2 Heads Tails ¼ 
 -  E3 Tails Heads ¼ 
 -  E4 Tails Tails ¼ 
 -  Let A  E2, E3. Then P(A)  P(E2)P(E3)  ½ 
 
  24Probability Distributions
- Given a random variable X (either discrete or 
continuous), the Probability Distribution gives a 
table or formula or graph of the probabilities of 
each potential value of X  - For a Probability Distribution P(x) the following 
must hold  - 0 lt P(x) lt 1 
 - Sum (all P(x) over all x)  1
 
  25Probability Distributions
- There are many kinds of probability 
distributions  - Binomial Distribution 
 - Applies to binary variable experiments where only 
2 outcomes are possible  - Poisson Distribution 
 - Applies to variables that represent the number of 
occurrences of a specified event in a given unit 
of time or space  - Hypergeometric Distribution 
 - Applies to experiments where the numbers of 
elements in the population is small in comparison 
to the sample size and thus the success of a 
trial depends on the outcomes of preceding trials 
  26Probability Distributions
- Normal Distribution (N) 
 - Applies to continuous random variables 
 - Standard Normal Distribution (Z) 
 - A Normal Distribution with 
 - Mean of 0 
 - Standard Deviation of 1
 
  27Estimation Techniques
- So now that we know that certain experiments 
can have results distributed in certain ways, how 
can we predict the result of this experiment?  - This process is called Statistical Inference, 
where we can estimate the quality of a larger 
population by analyzing a small sample 
  28Estimation Techniques
- Populations and Samples 
 - A Population is the larger set of objects we wish 
to study  - Ex The number of democrats in the country 
 - A Sample is a set of representative objects we 
choose in order to estimate the characteristics 
of the larger set of objects  - Ex Take 100 people from each state and determine 
whether they are democrats 
  29Estimation Techniques
- Parameters and Statistics 
 - A Parameter is the quality of the population we 
are trying to estimate  - In order to estimate the parameter we measure the 
quality in a sample. This sample quality is 
called its statistic 
  30Estimation Techniques
- Many types of samples can be taken 
 - Completely Random Sample 
 - Stratified Random Sample 
 - Divide the population into strata (groups) 
 - Take a sample from each group 
 - Ex Party loyalties of teenagers, adults and 
elderly  - Cluster Sample 
 - Take a simple random sample of clusters from the 
available clusters in a population  - Ex Urban vs. Rural sampling
 
  31Hypothesis Testing
- Large Sample Estimation Techniques
 
  32Introduction to Estimating Techniques
- Before we begin, lets review some common terms 
 - Point Estimate When we do an experiment and 
generate a result, the result at one point in 
time for one run of the experiment is called a 
point estimate (mean, etc.). Since each 
experiment has some error, there is a margin of 
error for every point estimate  - Interval Estimate Now if we repeat the 
experiment many times over we will get sense of 
how far off we are from running a perfect 
experiment. This sense of confidence in our 
experimental ability is called an interval 
estimate or a confidence interval. 
  33Confidence Intervals
- Typically, the confidence interval is defined as 
follows  -  CI  Mean /- 1.96 x Variance / sqrt(N) 
 -  It tells us that if we repeat the experiment 
many times over, 95 of the time our values for 
the Mean will lie in the limits specified here 
  34Significance Value (a)
- Statisticians arbitrarily choose a value of 5 to 
represent events that can occur by chance alone  - So if an event occurs more than 5 of the time, 
it is considered statistically significant  - The 5 value is called a significance value, or a 
 
  35P-Values
- A P-value is a useful way to represent the 
probability of a certain event and is seen 
extensively in the medical literature  - Definition 
 - The P-Value is simply the probability that an 
event occurs by chance alone  - Given our significance level of 5 for chance, we 
want P-values to be less than 5 or .05 to be 
considered statistically significant 
  36Comparing Means
- Many times we wish to compare the means of two 
subsets of a population  - Ex MCAT scores for Biology vs. Chemistry majors 
 - To do this we would sample MCAT scores from 
random samples of biology and chemistry majors 
across the country  - We would compute the mean of all these samples 
 - We would compare the means to determine if they 
are significantly different.  - This kind of analysis is exactly what is done by 
Hypothesis Testing (we hypothesize there is no 
difference and then refute this hypothesis) 
  37Hypothesis Testing
- A statistical test of hypothesis consists of 4 
parts  - A NULL Hypothesis, termed Ho 
 - An Alternate Hypothesis, termed Ha 
 - A test statistic 
 - A rejection region 
 - The NULL hypothesis is what we want to refute 
 - The Alternate hypothesis is what we want to 
support  - The test statistic is what we will use to compare 
the NULL and the Alternate Hypotheses  - The Rejection Region is the value of the test 
statistic for which Ho will be rejected 
  38Hypothesis Testing
- So what does this all mean IN LAYMANS TERMS? 
 - Basically we are asking the question that given a 
test statistic we specify, what is the 
probability that the hypothesis in question (Ha) 
is due to chance alone?  - We convert the test statistic into a probability 
value by looking it up in a table that specifies 
the respective probabilities associates with that 
particular statistic value 
  39Constructing a Hypothesis
- Consider the following question 
 - We wish to show that the hourly wages of 
construction workers in California is larger than 
the national average of 14  - The hypothesis will be written down as 
 -  Ha ? ltgt 14 
 -  Ho ?  14 
 -  Test statistic  Z-value  X  Uo / 
(Var/sqrt(N))  -  Rejection region  0.05 (a value)
 
  40Testing a Hypothesis
- The average weekly earnings for men in managerial 
and professional positions is 725. Do women in 
the same position have average weekly earnings 
that are less than those for men?  - A random sample of N40 women in managerial 
positions showed X670 and Var  102. Test the 
appropriate hypothesis using a  0.01  -  Solution Ho U  725 Ha U lt 725 
 -  Z  X  U / (Var/sqrt(N)) 
 -  Z  670  725 / (102 / sqrt(40))  -3.41 
 -  Since -3.41 lt 0.01 we conclude that Ho is false 
and the average weekly salary for women is 
significantly less than for men and the 
probability that we have made an incorrect 
decision is 0.01 
  41Confidence in our Test Result
- So what is our confidence in our result? 
 - Well, we can have 2 types of errors 
 - Type I error  Rejecting Ho when Ho is true  a 
 - Type II error  Accepting Ho when Ho is false  b 
 - To compute a confidence value, we calculate the 
Power of the Test which is the probability of 
correctly rejecting the NULL hypothesis  - Power  (1-b) 
 
  42Types of Tests
- Given the kinds of data we have and the types of 
information we seek there are different types of 
tests available to us  - Students T-Test 
 - Used to compare MEANS of two populations 
 - Works for small samples (Nlt30) 
 - Chi-Square Test 
 - Used to estimate a populations VARIANCE 
 - F-Test 
 - Used to compare the VARIANCES of 2 populations
 
  43Types of Tests
- We can do these tests in different ways 
 - We can have one-tailed and two-tailed tests 
 - A One-tailed test occurs when our hypothesis mean 
is on one side (either less or greater) than the 
null hypothesis mean  - A Two-tailed test occurs when we can say that the 
hypothesis mean can be on either side of the null 
value  - We can also do Paired Tests, where we do 2 tests 
in a specific sequential order 
  44T-tests Small Sample Testing
- Up to now we have assumed the sample size to be 
large (Ngt30) in order to achieve good power. But 
what happens when the sample size is small 
(Nlt30).  - Well, in this case the shape of the normal 
distribution looks somewhat different  it is 
shorter and wider and is called the 
T-Distribution  - Every T-distribution has an associated Degree of 
Freedom (df) which is equal to N-1  - A T-Table is consulted to get the appropriate 
values of the T-statistic when doing a T-test. 
You need the df and the significance level to 
look up the T-values. 
  45Chi-Square Distribution
- Remember that the T-test compares population 
Means. What if we want to estimate a population 
variance?  - In this case, we would use a Chi-Square 
distribution and our test statistic will be a 
chi-square value  - X2  (n-1)s2 / oo2 
 - where n  sample size 
 - s  sample variance 
 - oo  Population Variance that we are trying to 
estimate  - A variant of the Chi-Square Distribution is 
called the Mantel-Haenszel Test  - It is a test of association between 2 ordinal 
variables (frequency data) 
  46F-Distribution
- What if we want to compare the population 
variances of two different populations?  - In this case we use an F-Distribution and an 
F-statistic  -  
 - F  s12/s22, where s1 and s2 are variances of 
Samples 1 and 2  - Typically we will have 2 degrees of freedom (v1 
and v2) with F-tests 
  47Linear Regression and Correlation 
 48Linear Regression and Correlation
- In many situations in clinical studies we wish to 
attempt to answer the question How is the 
random variable X related to the random variable 
Y?  - Ex How is smoking related to lung cancer? 
 - Ex How is age related to development of 
Alzheimers Disease?  - Ex How is hypertriglyceridemia related to 
metabolic syndrome?  - Such questions are answered statistically using 
the concepts of Regression Analysis which looks 
for relationships among different variables 
(either negatively or positively) and 
Correlations, the strengths of the relationships  - Relationships may have many forms 
 - Related linearly 
 - Related curvilinearly 
 - Related colinearly 
 - Associations but not Correlations
 
  49Linear Regression
- The Linear Regression model postulates that two 
random variables X and Y are related by a 
straight line as follows  - Y  a  bX  e 
 - Where 
 -  
 -  Y is the dependent variable 
 -  X is the independent variable 
 -  a is the intercept 
 -  b is the slope 
 -  e is the residual value 
 
  50Linear Regression
- Residual Value (e) 
 - Given that the regression analysis procedure is 
itself a statistical approach, it is expected to 
have some degree of error associated with it  - Thus we add a value called the residual value (e) 
to any regression equation to account for random 
errors in the process  - Scatterplots 
 - In order to perform regression analysis visually, 
it helps to graph the points on a scatterplot  - A visual relationship can often be observed when 
looking at these plots 
  51Method of Least Squares
- So, assuming that 2 variables are linearly 
related, how do we best fit a line through a 
series of points on a scatterplot  the 
regression line.  - One way is to use a goodness of fit estimator 
called the Sum of Squares for Error (S) which we 
want to minimize  -  
 
f(xi)
yi 
 52Inferences Concerning Slopes
- The initial question once we have a regression 
line is whether the data present sufficient 
evidence to indicate that Y increases or 
decreases linearly as X increases over the 
observed region?  - So we use the variability of the points about the 
line to estimate this  -  Variance  s2  S / n  2 
 -  S  Sum of squares for error 
 -  n  Sample size
 
  53Inferences Concerning Slopes
- Given that we can use S for estimating the 
population variance, we can formulate our 
hypothesis using a T-test to compare means as 
follows  -  Null Hypothesis Ho b  bo 
 -  Alternate Hypothesis Ha b lt bo or b gt bo 
 -  Test Statistic  t-value  b  bo / (s / 
sqrt(Sxx))  -  b  regression line slope 
 -  bo  slope to test with 
 -  s  variance 
 -  Sxx  Standard Error for Xis  Sum over all i 
(Xi  Xmean)2 
  54Inferences Concerning Slopes
- So how do we do the T-test and reach a conclusion 
or calculate a P-value?  - Well, the T-table has several features 
 - Df  Degrees of Freedom  n  1 
 - T-values listed for various significance levels 
 - The procedure for using a T-Table is as follows 
 - Compute the T-value using the statistic in your 
test  - Lookup the appropriate T-value in the table given 
your degree of freedom (n  1)  - Then look up the column to whichever significance 
level it belongs to and the P will be less than 
that significance level 
  55Linear Regression
- So, graphically what does it look like?
 
  56Other Regressions
- Given the types of data you have, there are other 
methods for fitting the data to a geometric 
shape  - For example, there is Curvilinear Regression 
 - Cubic Spline Interpolation 
 - Quadratic Interpolation 
 - Higher Order Interpolation 
 - Logarithmic Regression 
 - This is useful when you have categorical data 
(non-numeric)  - For example, when you have a binomial random 
variable such as HTN (y/n), Gender(M/F) or Race 
  57Correlation
- As opposed to finding the best fit line through 
a set of data points, Correlation seeks to 
understand the strength of the relationship.  -  R  0.17 R  0.85 R  -0.94
 
  58Correlation
- We compute the Pearson Product Moment Coefficient 
of Correlation (R) as follows  -  R  Sxy / sqrt (Sxx X Syy) 
 -  where 
 -  Sxy  Sum over all i (Xi  Yi)2 
 -  Sxx  Sum over all i (Xi  Xmean)2 
 -  Syy  Sum over all i (Yi  Ymean)2 
 -  0 lt R lt 1, the larger the R the stronger the 
correlation 
  59Multiple Linear Regression
- So far we discussed how one variable is related 
to another in a study.  - But in real life, a study typically has many 
variables that it is trying to compare as they 
related to an outcome  - Ex CAD as f(HTN, DM, Smoking, Hyperchol., 
Obesity, Age)  - In order to do this type of analysis, we can 
extend the general notion of linear regression to 
multiple variables.  - We have an intercept as usual but partial slopes 
(or partial regression coefficients), each one 
representing a different variable  
  60Multiple Linear Regression
- The General Linear Model (GLM) is then stated as 
follows  -  Y  b0  b1x1  b2x2  b3x3  .  
bnxn  e  - With the following assumptions 
 -  1. Y is the response variable you wish to 
predict  -  2. b0, b1 . bn are unknown constants 
 -  3. x1, x2 . xn are independent predictor 
variables that are measured without error  -  4. e is a random error that for any set of 
predictors is normally distributed  -  5. The random errors associated with any pair of 
Y values are independent 
  61Multiple Linear Regression
- Note that you can use qualitative (categorical) 
and quantitative variables in a GLM.  - Categorical Variables look like 
 - X1  1, if Group A, 0 if not Group A 
 - Typically computing p-values and regression 
equations in a GLM is hard to do by hand so most 
people will do it using computer software  - SAS has a procedure called Proc GLM 
 - SPSS/PC 
 - StatSoft 
 - HyperStat
 
  62Multiple Linear Regression
- Problems that can occur when using GLM 
 - Multicolinearity 
 - This happens when 2 of the independent variables 
xi, xj are themselves related and occurrence in a 
model overestimates the true effect size  - Also known as Covariants or Confounding Factors 
 - Interaction Terms 
 - When 2 variables in a model are co-related then 
we must add an interaction term to the model  - For example, suppose you want to study the salary 
of a professor with respect to  of years of 
service. Well, this may differ slightly whether 
you are a male or female.  - Thus, the salary slope for males may be slightly 
higher than the salary slope for females despite 
the same number of years of service.  - This type of relationship is called an 
Interaction (between gender and years of service 
because the slope varies depending on whether a 
male or female is selected) and we must add a 
term of the type  -  Y  b0  b1x1  b2x2  b3x1x2
 
  63Logistic Regression
- What happens when you have data in the form of 
proportions (or frequency data) of categorical 
variables?  - The tool for analysis of this type of data is 
called a Logistic Regression  - It is based on the Chi-Square Distribution and 
the model is described as follows  -  lnp/(1-p)  a  BX  e or p/(1-p)  expa 
expB X exp e  -  where 
 -  ln is the natural logarithm, logexp, where 
exp2.71828  -  p is the probability that the event Y occurs, 
p(Y1)  -  p/(1-p) is the "odds ratio" 
 -  lnp/(1-p) is the log odds ratio, or "logit" 
 -  all other components of the model are the same. 
 
  64The ANalysis Of VAriance
  65ANOVA
- Suppose you want to compare the mean 
reimbursement rates from 5 different health plans  - You could do t-tests among all combinations of 
the 5 plans, or 10 t-tests  - Suppose all the means are equal. When this 
procedure is repeated 10 times, the probability 
of incorrectly concluding that at least one pair 
of means differ is quite high and you reach an 
erroneous decision  - Thus we want one test which could compare means 
for all 5 groups at the same time  - This is exactly what ANOVA provides
 
  66ANOVA
- ANOVA is a powerful procedure which allows you to 
do 2 things  - Compare the variance between the means of 2 or 
more groups  - Compare the variance in data values within each 
group  
  67ANOVA
- ANOVA procedures can be done with different study 
designs  - Completely Randomized Design 
 - Random samples are independently selected from 
each of k populations.  - Assumes that the data is homogeneously 
distributed with a fixed variation  - Randomized Block Design 
 - Assumes that subsets of the population have 
different variances  - Within each subset, however, the variability is 
the same  - Each subset is called a block. 
 - Random samples are then taken from each block
 
  68ANOVA for Completely Randomized Designs
- Suppose we want to compare k population means 
u1..uk based on random independent samples of 
n1..nk observations selected from populations 
1..k respectively  - Ex Suppose we have 10 observations of 
reimbursement figures from each of 5 health plans 
then we will have 50 total values  - Then let 
 - Xij represent the jth measurement in the ith 
group  - We define an entity called the Total Sum of 
Squares (SS) as follows  -  k ni 
 -  Total SS  Sxx  ? ? (xij  x)2 
 -  i1 j1
 
  69ANOVA for Completely Randomized Designs
- It can be shown that the sum of squares of 
deviations of all values about the overall mean  
the Total SS - (of all 50 values) can be 
partitioned into 2 components  - SST  Sum of Squares for Treatments 
 - SSE  Sum of Squares for Error (measures 
variation within samples)  - We have 
 - Total SS  SST  SSE 
 
  70ANOVA for Completely Randomized Designs
- Now, we can also compute SSE readily and it is 
 -  n1 n2 nk 
 - SSE  ? (x1j  x1)2  ? (x2j  x2)2    ? (xkj 
 xk)2  -  j1 j1 j1 
 -  Knowing SSE and SS, we can compute SST 
 -  We then compute the Mean Squares of these as 
follows  -  MST  SST / k-1 
 -  MSE  SSE / n-k 
 -  The final step is to compute an F-statistic as 
follows  -  F  MST / MSE
 
  71ANOVA for Completely Randomized Designs
- Now, F-tests have 2 degrees of freedom v1 and v2 
 - In the case of ANOVA, 
 - v1  k  1 
 - v2  n  k 
 - We can then our usual hypothesis testing using 
this F-statistic as our test  -  Ho u1  u2  u3    uk 
 -  Ha One of more pairs of population means differ 
 -  F-Statistic  MST/MSE with df v1(k-1), v2(n-k) 
 -  Rejection Region Reject Ho if F gt Fa (found 
from the table using v1, v2 and a) 
  72ANOVA for Randomized Block Designs
- The computational steps are very similar to those 
of a completely randomized design except that we 
add a third term, the sum of squares for BLOCKS 
(with b blocks)  - Total SS  SST  SSE  SSB 
 - We then perform 2 different hypothesis tests 
 -  (1) For comparing Treatment Means 
 -  F  MST/MSE, v1k-1, v2n-b-k1 
 -  (2) For comparing BLOCK Means 
 -  F  MSB/MSE, v1b-1, v2n-b-k1
 
  73Nonparametric Statistics
  74Nonparametric Statistics
- What do we do when we have oppinion data? 
 - For example, suppose a judge is employed to 
evaluate and rank the sales abilities of 4 
salesmen, the edibility of 5 brands of Corn 
Flakes or the relative appeal of 5 brands or 
automobiles  - Clearly it is impossible to give an exact measure 
of sales competence, the palatability of food or 
design appeal  - But, it is possible to rank the salespeople, food 
or design choices based on our own oppinions.  - Many, Many types of studies in medicine use this 
kind of data gathering (patient satisfaction is 
one example) 
  75Nonparametric Statistics
- There are many tests available for studying this 
kind of data  - The Sign Test 
 - The Mann-Whitney U Test 
 - The Wilcoxon Signed-Rank Test for a Paired 
Experiment  - The Kruskal-Wallis H Test for Completely 
Randomized Designs  - The Friedman Fr Test for Randomized Block Designs 
 - Spearmans Rank Correlation Test
 
  76The Sign Test
- Compares 2 populations with respect to how they 
differ in the responses to qualitative questions  - Compute the number of responses that were the 
same  - Then compute the number of responses that 
differed  - Finally compute X, the number of times responses 
from population A was greater than responses from 
population B  - This gives you the number of times (A-B) is 
positive (i.e. has a positive sign  hence the 
name)  - This is your test statistic 
 - You then use a Binomial Probability Distribution 
to do a hypothesis test 
  77Mann-Whitney U Test
- Analogous to the T-test for nonparametric data 
 - Suppose you have 2 populations from which 2 
samples n1 and n2 are obtained  - You should rank all samples (n1n2) into 
ascending order assigning rank values 1, 2, 3 to 
all observations  - Tied observations are handled by averaging the 
ranks assigned to both of the tied observations  - Then calculate the sum of the ranks T1 and T2 for 
both of the samples 
  78Mann-Whitney U Test
- Now compute the U statistic as follows 
 - U1  n1n2  (n1(n11)/2)  T1 
 - U2  n1n2  (n2(n21)/2)  T2 
 - Look up the appropriate a value in the table 
given n2  - The Table will give you a value for Uo on the 
left hand side corresponding to your n1  - Your computed U (smaller of U1 or U2) should be 
less than the U stated in the table in order to 
reject the Null hypothesis (that the population 
relative frequency distributions are identical) 
  79Wilcoxon Signed Rank Test
- Similar to the Mann-Whitney U Test 
 - Allows you to compare paired differences 
 - Given n pairs of observations from populations A 
and B, compute the paired differences (xA-xB) for 
each pair of values  - Rank the positive differences and the negative 
differences separately  - Compute the sums T and T- of these rankings 
 - For a one tailed test, use T- and for a two 
tailed test, use the smaller of T or T-  - Reject Ho if T lt To (critical value) obtained 
from the Wilcoxon Table, given n and a values 
  80Other Nonparametric Tests
- Kruskal-Wallis H Test 
 - Just as the Mann-Whitney U Test is the 
nonparametric alternative to the Students T-Test 
for comparing population means, the 
Kruskal-Wallis H Test is the nonparametric 
alternative to ANOVA for a completely randomized 
design and is used to detect differences in 
location among more than 2 population 
distributions based on independent random 
sampling  - Friedman Fr Test for Randomized Block Designs 
 - Is a nonparametric test for comparing the 
distributions of measurements for k treatments 
laid out in b blocks using a randomized block 
design 
  81Test of Association
- Spearmans Rank Correlation Test 
 - Tests whether there is an association between 2 
populations  - Assume n pairs (xi, yi) of observations from 2 
populations X, Y  - Rank each of the xi and yi in ascending order 
 - Compute 
 - Rs  Sxy / sqrt (Sxx Syy) 
 - Then given n and a, look up Ro in the Spearman 
Table  - Reject Ho (no association) if Rs gt Ro or Rs lt 
-Ro 
  82Survival Analysis 
 83Introduction
- There are many clinical studies that address the 
question of time to an event  - For example, we often want to know given risk 
factors, what is a patients chance for an MI? 
(I.e. time to MI)  - This type of data is called censored data 
 - Survival Analysis seeks to study this type of 
question 
  84Life Tables
- The most straightforward way to compute a data 
structure known as a Life Table  - The entire lifetime of a study object is divided 
into intervals of specified length  - For each interval, the number of subjects 
surviving or died within that interval is 
determined and plotted  - Based on this number, we can compute several 
types of statistics  - Numbers of cases at risk 
 - Proportion Failing or Proportion Surviving 
 - Probability Density or Hazard Rate 
 - Median Survival Time 
 - Required Sample Sizes 
 
  85Survival Analysis
- Although life tables give us a good estimate of 
the risk of adverse events, it is desirable to 
understand the underlying survival function 
algorithmically for prediction purposes  - The three distributions proposed for this are 
the  - Exponential (linear exponential) distribution 
 - Weibull Distribution 
 - Gompertz Distribution 
 - The parameter estimation procedure is then a 
modified version of the least-squares model  - And the statistic used to study it is an 
incremental Chi-Square Statistic 
  86Kaplan-Meier Product Limit Estimator
- Rather than classify the survival into a 
life-table, the KM estimator computes a survival 
function directly from continuous survival or 
failure times  - Imagine creating a life table with exactly one 
observation for each interval  - Then we avoid the effect of grouping 
observations together into interval categories  - Then S(t)  Productj ((n-j)/(n-j1))d(j) 
 -  n  total  observations 
 -  d(j)  1 if censored, 0 if not in interval j
 
  87Comparing Survival Times
- Often we wish to compare survival times in 2 or 
more populations  - There are several tests available for this 
purpose  - Gehans Generalized Wilcoxon Test 
 - Cox-Mantel Test 
 - Coxs F-Test 
 - Log-Rank Test 
 - Peto and Petos Wilcoxon Test 
 - These are mostly nonparametric tests that 
generate Z-values for comparing means 
  88Regression Models
- We also want to be able to predict survival time 
given some independent risk factors  - This is very common in the medical literature 
 - The regression test of choice is the 
Cox-Proportional Hazards Model  - The model is written as 
 -  h(t), (z1, z2, ..., zm)  h0(t)exp(b1z1  
...  bmzm)  -  (where h(t,...) denotes the resultant hazard, 
given the values of the m covariates for the 
respective case (z1, z2, ..., zm) and the 
respective survival time (t). The term h0(t) is 
called the baseline hazard it is the hazard for 
the respective individual when all independent 
variable values are equal to zero). We can 
linearize this model by dividing both sides of 
the equation by h0(t) and then taking the natural 
logarithm of both sides  -  logh(t), (z...)/h0(t)  b1z1  ...  
bmzm  -  We now have a fairly "simple" linear model that 
can be readily estimated) 
  89Useful Links
- http//hesweb1.med.virginia.edu/biostat/teaching/h
andouts.html  - http//stat.tamu.edu/stat30x/notes/trydouble2.html
  - http//www.statsoft.com/textbook/stathome.html 
 - http//davidmlane.com/hyperstat/index.html 
 - http//members.aol.com/johnp71/javastat.html 
 - http//www.helsinki.fi/jpuranen/links.html 
 - http//ubmail.ubalt.edu/harsham/statistics/REFSTA
T.HTMrgenRes  - http//trochim.human.cornell.edu/kb/index.htm 
 
  90Questions