Introduction to Clinical Biostatistics for Medical Students - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Introduction to Clinical Biostatistics for Medical Students

Description:

This kind of analysis is exactly what is done by Hypothesis Testing (we ... Given that the regression analysis procedure is itself a statistical approach, ... – PowerPoint PPT presentation

Number of Views:2939
Avg rating:3.0/5.0
Slides: 91
Provided by: AzA90
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Clinical Biostatistics for Medical Students


1
Introduction to Clinical Biostatistics for
Medical Students
  • Atif Zafar, MD
  • Department of Medicine

2
Overview of Presentation
  • Introductory Concepts (Review)
  • Hypothesis Testing
  • Linear Regression and Correlation
  • Analysis of Variance (ANOVA)
  • Nonparametric Statistics
  • Survival Analysis

3
Introductory Concepts
4
Introductory Concepts
  • Types of Data
  • Presenting Data
  • Descriptive Measures
  • Probability and Distributions
  • Estimation Techniques

5
Types of Data
  • Data are usually Discrete or Continuous
  • Discrete Variables take on a finite set of values
    that can be counted
  • Race, Gender, Year in School etc.
  • Continuous Variables take on an infinite set of
    values
  • Age, Height/Weight, Blood Pressure

6
Types of Data
  • A Special type of Discrete Variable is the Binary
    Variable which takes on exactly 2 possible values
  • Gender (M/F)
  • Pregnant? (Y/N)
  • Hypertensive? (Y/N)

7
Types of Data
  • Sometimes, discrete variables have a natural
    ordering to them
  • For example, names of consecutive days in a week
    (M, Tu, Wed, Thurs, Fri, Sat, Sun)
  • Other types of discrete variables do not have a
    natural order and are called Nominal Variables
  • Race (African American, Caucasian, Asian,
    Hispanic etc.)

8
Types of Data
  • If in an experiment you measure a single
    variable, it is called a Univariate experiment
  • If you measure 2 variables, it is called a
    Bivariate experiment
  • And if you measure multiple variables, it is
    called a Multivariate experiment

9
Types of Data
  • A Random variable is one whose value is
    determined by chance or random event
  • Typically, a variable X is random if it is the
    outcome of an experiment where results can occur
    by chance or are not completely predictable

10
Types of Data
  • Nonparametric Variables
  • Many times in clinical studies, we seek opinion
    data (I.e. patient satisfaction scores, relative
    value scales etc.)
  • The data can be ranked but has no absolute scale
    that is comparable
  • This type of data is called nonparametric data

11
Presenting Data
  • There are many ways to present data
  • Frequency Tables
  • Pie Charts
  • Bar Graphs (Histograms)
  • Line Graphs
  • Scatter Plots (Scattergrams)
  • Stem and Leaf Displays
  • Box Plots

12
Presenting Data
  • Scatter Plots (Plot of a Bivariate experiment)

13
Presenting Data
  • Stem and Leaf Displays
  • Presents a histogram like picture of the data,
    while retaining the original data values
  • Dataset 8520 9274 8142 11298 10624 7987
    11172 12899 10737 9198 13625 9462 11847
    10178 12240 11690 10069 11240 12745 12995

14
Presenting Data
  • Boxplots
  • Complex visual data structures that combine
    various measures
  • Maximum and Minimum Data Points
  • 1st and 3rd Quartile Points
  • Sort the data points from lowest to highest
  • Divide the number of data points into 2 halves
  • Take the Median value of each half and those are
    the 1st and 3rd quartiles (Q1,
    Q3)
  • Computer the Inter Quartile Range (IQR)
  • IQR Q3-Q1
  • Compute 1.5 x IQR. Compute Q31.5IQR and
    Q1-1.5IQR
  • Data points lying outside this range are called
    Outliers

15
Presenting Data
  • Boxplots

16
Descriptive Measures
  • Now that we have displayed our data, we want to
    be able to characterize it quantitatively
  • Measures of Central Tendency
  • Mean, Median, Mode
  • Measures of Variability
  • Range, Variance, Standard Deviation
  • Measures of Relative Standing
  • Z-Scores, Percentiles, Quartiles

17
Measures of Central Tendency
  • Mean
  • Arithmetic Average of a sample of data
  • Median
  • If you order the data from smallest to highest,
    the median is the middle value, assuming an odd
    number of data elements
  • If you have an even number of elements, it is the
    average of the 2 middle numbers.
  • Mode
  • The most common value in a set of values

18
Measures of Variability
  • Once we have located the center of a set of data
    points, we want to know how dispersed they are

19
Measures of Variability
  • Range
  • This is the difference between the highest and
    lowest value
  • Variance
  • Defined to be the average of the square of the
    deviations of the individual data points about
    their mean
  • Standard Deviation
  • This is defined as the square root of the
    variance

20
Measures of Relative Standing
  • Sometimes we want to know the position of a
    particular observation relative to others in a
    data set
  • Ex How you performed with respect to your
    classmates on an exam
  • The Z-Score measures this as follows

21
Measures of Relative Standing
  • Percentiles and Quartiles also indicate relative
    standing but in terms of the categories of scores
    from lowest to highest
  • Given a set of n measurements x1, , Xn the pth
    percentile is defined to be the value of x that
    exceeds p of the measurements and is less than
    (100-p) of the values.
  • Ex Scores of 20, 30, 50, 60, 67, 67, 70, 80,
    90, 95
  • The score 50 is in the 30th percentile, meaning
    that 30 of the scores were lower than yours and
    70 were higher than yours.
  • Quartiles similarly reflect in which quarter of
    the set of values a particular observation lies
  • Ex Scores of 20, 30, 50, 60, 67, 67, 70, 80, 90,
    95
  • 1st Quartiles 50, 3rd Quartile 80

22
Probability
  • Suppose you do an experiment with a finite number
    of possible outcomes (ex coin toss)
  • The Probability of an event E (H/T) is the chance
    () that the event will turn out in a given way
    in the next repetition of the experiment
  • Probabilities values are always between 0 and 1
  • The notation for probabilities is as follows
  • Given our coin toss experiment,
  • P(H) Probability that a Head will be tossed in
    the next round
  • P(T) Probability that a Tail will be tossed in
    the next round
  • One can estimate probabilities by repeating the
    event many times and observing the outcomes

23
Probabilities Some Simple Rules
  • Arithmetically, one can combine probabilities of
    simple and sequential events
  • Given a complex event composed of N simple
    events, the probability of the complex event is
    equal to the sum of the probabilities of each of
    the simple events
  • Ex Coin toss 1 and Coin toss 2
  • Event First Coin Second Coin P(Ei)
  • E1 Heads Heads ¼
  • E2 Heads Tails ¼
  • E3 Tails Heads ¼
  • E4 Tails Tails ¼
  • Let A E2, E3. Then P(A) P(E2)P(E3) ½

24
Probability Distributions
  • Given a random variable X (either discrete or
    continuous), the Probability Distribution gives a
    table or formula or graph of the probabilities of
    each potential value of X
  • For a Probability Distribution P(x) the following
    must hold
  • 0 lt P(x) lt 1
  • Sum (all P(x) over all x) 1

25
Probability Distributions
  • There are many kinds of probability
    distributions
  • Binomial Distribution
  • Applies to binary variable experiments where only
    2 outcomes are possible
  • Poisson Distribution
  • Applies to variables that represent the number of
    occurrences of a specified event in a given unit
    of time or space
  • Hypergeometric Distribution
  • Applies to experiments where the numbers of
    elements in the population is small in comparison
    to the sample size and thus the success of a
    trial depends on the outcomes of preceding trials

26
Probability Distributions
  • Normal Distribution (N)
  • Applies to continuous random variables
  • Standard Normal Distribution (Z)
  • A Normal Distribution with
  • Mean of 0
  • Standard Deviation of 1

27
Estimation Techniques
  • So now that we know that certain experiments
    can have results distributed in certain ways, how
    can we predict the result of this experiment?
  • This process is called Statistical Inference,
    where we can estimate the quality of a larger
    population by analyzing a small sample

28
Estimation Techniques
  • Populations and Samples
  • A Population is the larger set of objects we wish
    to study
  • Ex The number of democrats in the country
  • A Sample is a set of representative objects we
    choose in order to estimate the characteristics
    of the larger set of objects
  • Ex Take 100 people from each state and determine
    whether they are democrats

29
Estimation Techniques
  • Parameters and Statistics
  • A Parameter is the quality of the population we
    are trying to estimate
  • In order to estimate the parameter we measure the
    quality in a sample. This sample quality is
    called its statistic

30
Estimation Techniques
  • Many types of samples can be taken
  • Completely Random Sample
  • Stratified Random Sample
  • Divide the population into strata (groups)
  • Take a sample from each group
  • Ex Party loyalties of teenagers, adults and
    elderly
  • Cluster Sample
  • Take a simple random sample of clusters from the
    available clusters in a population
  • Ex Urban vs. Rural sampling

31
Hypothesis Testing
  • Large Sample Estimation Techniques

32
Introduction to Estimating Techniques
  • Before we begin, lets review some common terms
  • Point Estimate When we do an experiment and
    generate a result, the result at one point in
    time for one run of the experiment is called a
    point estimate (mean, etc.). Since each
    experiment has some error, there is a margin of
    error for every point estimate
  • Interval Estimate Now if we repeat the
    experiment many times over we will get sense of
    how far off we are from running a perfect
    experiment. This sense of confidence in our
    experimental ability is called an interval
    estimate or a confidence interval.

33
Confidence Intervals
  • Typically, the confidence interval is defined as
    follows
  • CI Mean /- 1.96 x Variance / sqrt(N)
  • It tells us that if we repeat the experiment
    many times over, 95 of the time our values for
    the Mean will lie in the limits specified here

34
Significance Value (a)
  • Statisticians arbitrarily choose a value of 5 to
    represent events that can occur by chance alone
  • So if an event occurs more than 5 of the time,
    it is considered statistically significant
  • The 5 value is called a significance value, or a

35
P-Values
  • A P-value is a useful way to represent the
    probability of a certain event and is seen
    extensively in the medical literature
  • Definition
  • The P-Value is simply the probability that an
    event occurs by chance alone
  • Given our significance level of 5 for chance, we
    want P-values to be less than 5 or .05 to be
    considered statistically significant

36
Comparing Means
  • Many times we wish to compare the means of two
    subsets of a population
  • Ex MCAT scores for Biology vs. Chemistry majors
  • To do this we would sample MCAT scores from
    random samples of biology and chemistry majors
    across the country
  • We would compute the mean of all these samples
  • We would compare the means to determine if they
    are significantly different.
  • This kind of analysis is exactly what is done by
    Hypothesis Testing (we hypothesize there is no
    difference and then refute this hypothesis)

37
Hypothesis Testing
  • A statistical test of hypothesis consists of 4
    parts
  • A NULL Hypothesis, termed Ho
  • An Alternate Hypothesis, termed Ha
  • A test statistic
  • A rejection region
  • The NULL hypothesis is what we want to refute
  • The Alternate hypothesis is what we want to
    support
  • The test statistic is what we will use to compare
    the NULL and the Alternate Hypotheses
  • The Rejection Region is the value of the test
    statistic for which Ho will be rejected

38
Hypothesis Testing
  • So what does this all mean IN LAYMANS TERMS?
  • Basically we are asking the question that given a
    test statistic we specify, what is the
    probability that the hypothesis in question (Ha)
    is due to chance alone?
  • We convert the test statistic into a probability
    value by looking it up in a table that specifies
    the respective probabilities associates with that
    particular statistic value

39
Constructing a Hypothesis
  • Consider the following question
  • We wish to show that the hourly wages of
    construction workers in California is larger than
    the national average of 14
  • The hypothesis will be written down as
  • Ha ? ltgt 14
  • Ho ? 14
  • Test statistic Z-value X Uo /
    (Var/sqrt(N))
  • Rejection region 0.05 (a value)

40
Testing a Hypothesis
  • The average weekly earnings for men in managerial
    and professional positions is 725. Do women in
    the same position have average weekly earnings
    that are less than those for men?
  • A random sample of N40 women in managerial
    positions showed X670 and Var 102. Test the
    appropriate hypothesis using a 0.01
  • Solution Ho U 725 Ha U lt 725
  • Z X U / (Var/sqrt(N))
  • Z 670 725 / (102 / sqrt(40)) -3.41
  • Since -3.41 lt 0.01 we conclude that Ho is false
    and the average weekly salary for women is
    significantly less than for men and the
    probability that we have made an incorrect
    decision is 0.01

41
Confidence in our Test Result
  • So what is our confidence in our result?
  • Well, we can have 2 types of errors
  • Type I error Rejecting Ho when Ho is true a
  • Type II error Accepting Ho when Ho is false b
  • To compute a confidence value, we calculate the
    Power of the Test which is the probability of
    correctly rejecting the NULL hypothesis
  • Power (1-b)

42
Types of Tests
  • Given the kinds of data we have and the types of
    information we seek there are different types of
    tests available to us
  • Students T-Test
  • Used to compare MEANS of two populations
  • Works for small samples (Nlt30)
  • Chi-Square Test
  • Used to estimate a populations VARIANCE
  • F-Test
  • Used to compare the VARIANCES of 2 populations

43
Types of Tests
  • We can do these tests in different ways
  • We can have one-tailed and two-tailed tests
  • A One-tailed test occurs when our hypothesis mean
    is on one side (either less or greater) than the
    null hypothesis mean
  • A Two-tailed test occurs when we can say that the
    hypothesis mean can be on either side of the null
    value
  • We can also do Paired Tests, where we do 2 tests
    in a specific sequential order

44
T-tests Small Sample Testing
  • Up to now we have assumed the sample size to be
    large (Ngt30) in order to achieve good power. But
    what happens when the sample size is small
    (Nlt30).
  • Well, in this case the shape of the normal
    distribution looks somewhat different it is
    shorter and wider and is called the
    T-Distribution
  • Every T-distribution has an associated Degree of
    Freedom (df) which is equal to N-1
  • A T-Table is consulted to get the appropriate
    values of the T-statistic when doing a T-test.
    You need the df and the significance level to
    look up the T-values.

45
Chi-Square Distribution
  • Remember that the T-test compares population
    Means. What if we want to estimate a population
    variance?
  • In this case, we would use a Chi-Square
    distribution and our test statistic will be a
    chi-square value
  • X2 (n-1)s2 / oo2
  • where n sample size
  • s sample variance
  • oo Population Variance that we are trying to
    estimate
  • A variant of the Chi-Square Distribution is
    called the Mantel-Haenszel Test
  • It is a test of association between 2 ordinal
    variables (frequency data)

46
F-Distribution
  • What if we want to compare the population
    variances of two different populations?
  • In this case we use an F-Distribution and an
    F-statistic
  • F s12/s22, where s1 and s2 are variances of
    Samples 1 and 2
  • Typically we will have 2 degrees of freedom (v1
    and v2) with F-tests

47
Linear Regression and Correlation
48
Linear Regression and Correlation
  • In many situations in clinical studies we wish to
    attempt to answer the question How is the
    random variable X related to the random variable
    Y?
  • Ex How is smoking related to lung cancer?
  • Ex How is age related to development of
    Alzheimers Disease?
  • Ex How is hypertriglyceridemia related to
    metabolic syndrome?
  • Such questions are answered statistically using
    the concepts of Regression Analysis which looks
    for relationships among different variables
    (either negatively or positively) and
    Correlations, the strengths of the relationships
  • Relationships may have many forms
  • Related linearly
  • Related curvilinearly
  • Related colinearly
  • Associations but not Correlations

49
Linear Regression
  • The Linear Regression model postulates that two
    random variables X and Y are related by a
    straight line as follows
  • Y a bX e
  • Where
  • Y is the dependent variable
  • X is the independent variable
  • a is the intercept
  • b is the slope
  • e is the residual value

50
Linear Regression
  • Residual Value (e)
  • Given that the regression analysis procedure is
    itself a statistical approach, it is expected to
    have some degree of error associated with it
  • Thus we add a value called the residual value (e)
    to any regression equation to account for random
    errors in the process
  • Scatterplots
  • In order to perform regression analysis visually,
    it helps to graph the points on a scatterplot
  • A visual relationship can often be observed when
    looking at these plots

51
Method of Least Squares
  • So, assuming that 2 variables are linearly
    related, how do we best fit a line through a
    series of points on a scatterplot the
    regression line.
  • One way is to use a goodness of fit estimator
    called the Sum of Squares for Error (S) which we
    want to minimize

f(xi)
yi
52
Inferences Concerning Slopes
  • The initial question once we have a regression
    line is whether the data present sufficient
    evidence to indicate that Y increases or
    decreases linearly as X increases over the
    observed region?
  • So we use the variability of the points about the
    line to estimate this
  • Variance s2 S / n 2
  • S Sum of squares for error
  • n Sample size

53
Inferences Concerning Slopes
  • Given that we can use S for estimating the
    population variance, we can formulate our
    hypothesis using a T-test to compare means as
    follows
  • Null Hypothesis Ho b bo
  • Alternate Hypothesis Ha b lt bo or b gt bo
  • Test Statistic t-value b bo / (s /
    sqrt(Sxx))
  • b regression line slope
  • bo slope to test with
  • s variance
  • Sxx Standard Error for Xis Sum over all i
    (Xi Xmean)2

54
Inferences Concerning Slopes
  • So how do we do the T-test and reach a conclusion
    or calculate a P-value?
  • Well, the T-table has several features
  • Df Degrees of Freedom n 1
  • T-values listed for various significance levels
  • The procedure for using a T-Table is as follows
  • Compute the T-value using the statistic in your
    test
  • Lookup the appropriate T-value in the table given
    your degree of freedom (n 1)
  • Then look up the column to whichever significance
    level it belongs to and the P will be less than
    that significance level

55
Linear Regression
  • So, graphically what does it look like?

56
Other Regressions
  • Given the types of data you have, there are other
    methods for fitting the data to a geometric
    shape
  • For example, there is Curvilinear Regression
  • Cubic Spline Interpolation
  • Quadratic Interpolation
  • Higher Order Interpolation
  • Logarithmic Regression
  • This is useful when you have categorical data
    (non-numeric)
  • For example, when you have a binomial random
    variable such as HTN (y/n), Gender(M/F) or Race

57
Correlation
  • As opposed to finding the best fit line through
    a set of data points, Correlation seeks to
    understand the strength of the relationship.
  • R 0.17 R 0.85 R -0.94

58
Correlation
  • We compute the Pearson Product Moment Coefficient
    of Correlation (R) as follows
  • R Sxy / sqrt (Sxx X Syy)
  • where
  • Sxy Sum over all i (Xi Yi)2
  • Sxx Sum over all i (Xi Xmean)2
  • Syy Sum over all i (Yi Ymean)2
  • 0 lt R lt 1, the larger the R the stronger the
    correlation

59
Multiple Linear Regression
  • So far we discussed how one variable is related
    to another in a study.
  • But in real life, a study typically has many
    variables that it is trying to compare as they
    related to an outcome
  • Ex CAD as f(HTN, DM, Smoking, Hyperchol.,
    Obesity, Age)
  • In order to do this type of analysis, we can
    extend the general notion of linear regression to
    multiple variables.
  • We have an intercept as usual but partial slopes
    (or partial regression coefficients), each one
    representing a different variable

60
Multiple Linear Regression
  • The General Linear Model (GLM) is then stated as
    follows
  • Y b0 b1x1 b2x2 b3x3 .
    bnxn e
  • With the following assumptions
  • 1. Y is the response variable you wish to
    predict
  • 2. b0, b1 . bn are unknown constants
  • 3. x1, x2 . xn are independent predictor
    variables that are measured without error
  • 4. e is a random error that for any set of
    predictors is normally distributed
  • 5. The random errors associated with any pair of
    Y values are independent

61
Multiple Linear Regression
  • Note that you can use qualitative (categorical)
    and quantitative variables in a GLM.
  • Categorical Variables look like
  • X1 1, if Group A, 0 if not Group A
  • Typically computing p-values and regression
    equations in a GLM is hard to do by hand so most
    people will do it using computer software
  • SAS has a procedure called Proc GLM
  • SPSS/PC
  • StatSoft
  • HyperStat

62
Multiple Linear Regression
  • Problems that can occur when using GLM
  • Multicolinearity
  • This happens when 2 of the independent variables
    xi, xj are themselves related and occurrence in a
    model overestimates the true effect size
  • Also known as Covariants or Confounding Factors
  • Interaction Terms
  • When 2 variables in a model are co-related then
    we must add an interaction term to the model
  • For example, suppose you want to study the salary
    of a professor with respect to of years of
    service. Well, this may differ slightly whether
    you are a male or female.
  • Thus, the salary slope for males may be slightly
    higher than the salary slope for females despite
    the same number of years of service.
  • This type of relationship is called an
    Interaction (between gender and years of service
    because the slope varies depending on whether a
    male or female is selected) and we must add a
    term of the type
  • Y b0 b1x1 b2x2 b3x1x2

63
Logistic Regression
  • What happens when you have data in the form of
    proportions (or frequency data) of categorical
    variables?
  • The tool for analysis of this type of data is
    called a Logistic Regression
  • It is based on the Chi-Square Distribution and
    the model is described as follows
  • lnp/(1-p) a BX e or p/(1-p) expa
    expB X exp e
  • where
  • ln is the natural logarithm, logexp, where
    exp2.71828
  • p is the probability that the event Y occurs,
    p(Y1)
  • p/(1-p) is the "odds ratio"
  • lnp/(1-p) is the log odds ratio, or "logit"
  • all other components of the model are the same.

64
The ANalysis Of VAriance
  • Also known as ANOVA

65
ANOVA
  • Suppose you want to compare the mean
    reimbursement rates from 5 different health plans
  • You could do t-tests among all combinations of
    the 5 plans, or 10 t-tests
  • Suppose all the means are equal. When this
    procedure is repeated 10 times, the probability
    of incorrectly concluding that at least one pair
    of means differ is quite high and you reach an
    erroneous decision
  • Thus we want one test which could compare means
    for all 5 groups at the same time
  • This is exactly what ANOVA provides

66
ANOVA
  • ANOVA is a powerful procedure which allows you to
    do 2 things
  • Compare the variance between the means of 2 or
    more groups
  • Compare the variance in data values within each
    group

67
ANOVA
  • ANOVA procedures can be done with different study
    designs
  • Completely Randomized Design
  • Random samples are independently selected from
    each of k populations.
  • Assumes that the data is homogeneously
    distributed with a fixed variation
  • Randomized Block Design
  • Assumes that subsets of the population have
    different variances
  • Within each subset, however, the variability is
    the same
  • Each subset is called a block.
  • Random samples are then taken from each block

68
ANOVA for Completely Randomized Designs
  • Suppose we want to compare k population means
    u1..uk based on random independent samples of
    n1..nk observations selected from populations
    1..k respectively
  • Ex Suppose we have 10 observations of
    reimbursement figures from each of 5 health plans
    then we will have 50 total values
  • Then let
  • Xij represent the jth measurement in the ith
    group
  • We define an entity called the Total Sum of
    Squares (SS) as follows
  • k ni
  • Total SS Sxx ? ? (xij x)2
  • i1 j1

69
ANOVA for Completely Randomized Designs
  • It can be shown that the sum of squares of
    deviations of all values about the overall mean
    the Total SS - (of all 50 values) can be
    partitioned into 2 components
  • SST Sum of Squares for Treatments
  • SSE Sum of Squares for Error (measures
    variation within samples)
  • We have
  • Total SS SST SSE

70
ANOVA for Completely Randomized Designs
  • Now, we can also compute SSE readily and it is
  • n1 n2 nk
  • SSE ? (x1j x1)2 ? (x2j x2)2 ? (xkj
    xk)2
  • j1 j1 j1
  • Knowing SSE and SS, we can compute SST
  • We then compute the Mean Squares of these as
    follows
  • MST SST / k-1
  • MSE SSE / n-k
  • The final step is to compute an F-statistic as
    follows
  • F MST / MSE

71
ANOVA for Completely Randomized Designs
  • Now, F-tests have 2 degrees of freedom v1 and v2
  • In the case of ANOVA,
  • v1 k 1
  • v2 n k
  • We can then our usual hypothesis testing using
    this F-statistic as our test
  • Ho u1 u2 u3 uk
  • Ha One of more pairs of population means differ
  • F-Statistic MST/MSE with df v1(k-1), v2(n-k)
  • Rejection Region Reject Ho if F gt Fa (found
    from the table using v1, v2 and a)

72
ANOVA for Randomized Block Designs
  • The computational steps are very similar to those
    of a completely randomized design except that we
    add a third term, the sum of squares for BLOCKS
    (with b blocks)
  • Total SS SST SSE SSB
  • We then perform 2 different hypothesis tests
  • (1) For comparing Treatment Means
  • F MST/MSE, v1k-1, v2n-b-k1
  • (2) For comparing BLOCK Means
  • F MSB/MSE, v1b-1, v2n-b-k1

73
Nonparametric Statistics
  • Analysis of Ranked Data

74
Nonparametric Statistics
  • What do we do when we have oppinion data?
  • For example, suppose a judge is employed to
    evaluate and rank the sales abilities of 4
    salesmen, the edibility of 5 brands of Corn
    Flakes or the relative appeal of 5 brands or
    automobiles
  • Clearly it is impossible to give an exact measure
    of sales competence, the palatability of food or
    design appeal
  • But, it is possible to rank the salespeople, food
    or design choices based on our own oppinions.
  • Many, Many types of studies in medicine use this
    kind of data gathering (patient satisfaction is
    one example)

75
Nonparametric Statistics
  • There are many tests available for studying this
    kind of data
  • The Sign Test
  • The Mann-Whitney U Test
  • The Wilcoxon Signed-Rank Test for a Paired
    Experiment
  • The Kruskal-Wallis H Test for Completely
    Randomized Designs
  • The Friedman Fr Test for Randomized Block Designs
  • Spearmans Rank Correlation Test

76
The Sign Test
  • Compares 2 populations with respect to how they
    differ in the responses to qualitative questions
  • Compute the number of responses that were the
    same
  • Then compute the number of responses that
    differed
  • Finally compute X, the number of times responses
    from population A was greater than responses from
    population B
  • This gives you the number of times (A-B) is
    positive (i.e. has a positive sign hence the
    name)
  • This is your test statistic
  • You then use a Binomial Probability Distribution
    to do a hypothesis test

77
Mann-Whitney U Test
  • Analogous to the T-test for nonparametric data
  • Suppose you have 2 populations from which 2
    samples n1 and n2 are obtained
  • You should rank all samples (n1n2) into
    ascending order assigning rank values 1, 2, 3 to
    all observations
  • Tied observations are handled by averaging the
    ranks assigned to both of the tied observations
  • Then calculate the sum of the ranks T1 and T2 for
    both of the samples

78
Mann-Whitney U Test
  • Now compute the U statistic as follows
  • U1 n1n2 (n1(n11)/2) T1
  • U2 n1n2 (n2(n21)/2) T2
  • Look up the appropriate a value in the table
    given n2
  • The Table will give you a value for Uo on the
    left hand side corresponding to your n1
  • Your computed U (smaller of U1 or U2) should be
    less than the U stated in the table in order to
    reject the Null hypothesis (that the population
    relative frequency distributions are identical)

79
Wilcoxon Signed Rank Test
  • Similar to the Mann-Whitney U Test
  • Allows you to compare paired differences
  • Given n pairs of observations from populations A
    and B, compute the paired differences (xA-xB) for
    each pair of values
  • Rank the positive differences and the negative
    differences separately
  • Compute the sums T and T- of these rankings
  • For a one tailed test, use T- and for a two
    tailed test, use the smaller of T or T-
  • Reject Ho if T lt To (critical value) obtained
    from the Wilcoxon Table, given n and a values

80
Other Nonparametric Tests
  • Kruskal-Wallis H Test
  • Just as the Mann-Whitney U Test is the
    nonparametric alternative to the Students T-Test
    for comparing population means, the
    Kruskal-Wallis H Test is the nonparametric
    alternative to ANOVA for a completely randomized
    design and is used to detect differences in
    location among more than 2 population
    distributions based on independent random
    sampling
  • Friedman Fr Test for Randomized Block Designs
  • Is a nonparametric test for comparing the
    distributions of measurements for k treatments
    laid out in b blocks using a randomized block
    design

81
Test of Association
  • Spearmans Rank Correlation Test
  • Tests whether there is an association between 2
    populations
  • Assume n pairs (xi, yi) of observations from 2
    populations X, Y
  • Rank each of the xi and yi in ascending order
  • Compute
  • Rs Sxy / sqrt (Sxx Syy)
  • Then given n and a, look up Ro in the Spearman
    Table
  • Reject Ho (no association) if Rs gt Ro or Rs lt
    -Ro

82
Survival Analysis
83
Introduction
  • There are many clinical studies that address the
    question of time to an event
  • For example, we often want to know given risk
    factors, what is a patients chance for an MI?
    (I.e. time to MI)
  • This type of data is called censored data
  • Survival Analysis seeks to study this type of
    question

84
Life Tables
  • The most straightforward way to compute a data
    structure known as a Life Table
  • The entire lifetime of a study object is divided
    into intervals of specified length
  • For each interval, the number of subjects
    surviving or died within that interval is
    determined and plotted
  • Based on this number, we can compute several
    types of statistics
  • Numbers of cases at risk
  • Proportion Failing or Proportion Surviving
  • Probability Density or Hazard Rate
  • Median Survival Time
  • Required Sample Sizes

85
Survival Analysis
  • Although life tables give us a good estimate of
    the risk of adverse events, it is desirable to
    understand the underlying survival function
    algorithmically for prediction purposes
  • The three distributions proposed for this are
    the
  • Exponential (linear exponential) distribution
  • Weibull Distribution
  • Gompertz Distribution
  • The parameter estimation procedure is then a
    modified version of the least-squares model
  • And the statistic used to study it is an
    incremental Chi-Square Statistic

86
Kaplan-Meier Product Limit Estimator
  • Rather than classify the survival into a
    life-table, the KM estimator computes a survival
    function directly from continuous survival or
    failure times
  • Imagine creating a life table with exactly one
    observation for each interval
  • Then we avoid the effect of grouping
    observations together into interval categories
  • Then S(t) Productj ((n-j)/(n-j1))d(j)
  • n total observations
  • d(j) 1 if censored, 0 if not in interval j

87
Comparing Survival Times
  • Often we wish to compare survival times in 2 or
    more populations
  • There are several tests available for this
    purpose
  • Gehans Generalized Wilcoxon Test
  • Cox-Mantel Test
  • Coxs F-Test
  • Log-Rank Test
  • Peto and Petos Wilcoxon Test
  • These are mostly nonparametric tests that
    generate Z-values for comparing means

88
Regression Models
  • We also want to be able to predict survival time
    given some independent risk factors
  • This is very common in the medical literature
  • The regression test of choice is the
    Cox-Proportional Hazards Model
  • The model is written as
  • h(t), (z1, z2, ..., zm) h0(t)exp(b1z1
    ... bmzm)
  • (where h(t,...) denotes the resultant hazard,
    given the values of the m covariates for the
    respective case (z1, z2, ..., zm) and the
    respective survival time (t). The term h0(t) is
    called the baseline hazard it is the hazard for
    the respective individual when all independent
    variable values are equal to zero). We can
    linearize this model by dividing both sides of
    the equation by h0(t) and then taking the natural
    logarithm of both sides
  • logh(t), (z...)/h0(t) b1z1 ...
    bmzm
  • We now have a fairly "simple" linear model that
    can be readily estimated)

89
Useful Links
  • http//hesweb1.med.virginia.edu/biostat/teaching/h
    andouts.html
  • http//stat.tamu.edu/stat30x/notes/trydouble2.html
  • http//www.statsoft.com/textbook/stathome.html
  • http//davidmlane.com/hyperstat/index.html
  • http//members.aol.com/johnp71/javastat.html
  • http//www.helsinki.fi/jpuranen/links.html
  • http//ubmail.ubalt.edu/harsham/statistics/REFSTA
    T.HTMrgenRes
  • http//trochim.human.cornell.edu/kb/index.htm

90
Questions
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com