Title: Design and Analysis of Clinical Study 9. Analysis of Cross-sectional Study
1Design and Analysis of Clinical Study 9.
Analysis of Cross-sectional Study
- Dr. Tuan V. Nguyen
- Garvan Institute of Medical Research
- Sydney, Australia
2Overview
- Estimate of prevalence
- Analysis of difference between two proportions
- Analysis of difference among proportions
Chi-square - Analysis of difference between two means
- Analysis of association I simple linear
regression analysis - Analysis of association II multiple regression
analysis
3Prevalence of Disease
- Prevalence is NOT incidence
- Measures the no. of people in a population who
have the disease at a given point in time. - this measure has been called point prevalence, in
contrast to period prevalence, infrequently used,
which sums cases existing at the start of a time
period to new cases that occur during the time
period - A measure of disease status, disease burden
- in contrast to incidence which measures disease
onset events
41
2
3
4
5
T
Time
Prevalence At time T, 2 out of 5 subjects had
the disease P 2/5 0.4
5Sampling Variability in Prevalence
- Prevalence in the population (p) is UNKNOWN
- Sample prevalence (p) is an unbiased estimate of
p. - x number of diseased individuals in the sample
- p prevalence
- N sample size
- Estimates
- p x/N
- variance of p
- standard error of p
- 95 CI of p
6An Example of Calculation of Prevalence
- The prevalence of ABO hemolytic disease in a
population is 43 out of 3584 subjects. - So, the estimated prevalence
- p 43/3584 0.0125
- Standard error of the prevalence
-
-
- 95 confidence interval
- 0.0125(1.96 x 0.002) 0.009 to 0.016.
7Test for Difference Between Two Proportions
Vietnam Australia
N 700 1287
Osteoporosis 148 345
Prevalence 0.211 0.268
Variance (s2) 0.000238 0.000152
- p1 proportion for group 1
- p2 proportion for group 2
- N1 sample size for group 1
- N2 sample size for group 2
- d p1 p2
- variance of d
-
- z-test
d 0.268 0.211 0.057 variance of d s2
0.000238 0.000152 0.000391 z-test z
0.057 / sqrt(0.00391) 2.87 Significant!
8Test for Difference Among Proportions
Caffeine consumption 1- 151- 300- None 150 300
900 Total _______________________________________
_____ Marital status Married 652 1537 598 242 3029
Divorced 36 46 38 21 141 Single 218 327 106 67 71
8 Total 906 1910 742 330 3888
652/30290.22 1537/30290.51
598/30290.20 242/30290.08 36/1410.26
46/1410.33 38/1410.27
21/1410.15 218/7180.30 327/7180.46
106/7180.15 67/7180.09 906/38880.23
1910/38880.49 742/3888-.19
330/38880.08
In percent (row) Married 0.22 0.51 0.20 0.08 100 D
ivorced 0.26 0.33 0.27 0.15 100 Single
0.30 0.46 0.15 0.09 100 Total 0.23 0.49 0.19 0.08
100
9Test for Difference Among Proportions
Caffeine consumption 1- 151- 300- None 150 300
900 Total _______________________________________
________ Marital status Married 652 1537 598 242 3
029 Divorced 36 46 38 21 141 Single 218 327 106 67
718 Total 906 1910 742 330 3888
3029/3888906705.8 3029/388819101488
3029/3888742578.1 3029/3888330257.1 14
1/388890632.9 141/3888191069.3
141/388874226.9
141/388833012.0 718/3888906167.3
718/38881910352.7 718/3888742137.0
718/388833060.9
Caffeine consumption 1- 151- 300- None 150 300
900 Total _______________________________________
________ Expected freq. Married 705.8 1488 578.1 2
57.1 3029 Divorced 32.9 69.3 26.9 12.0 141 Single
167.3 352.7 137.0 60.9 718 Total 906 1910 742 330
3888
10Test for Difference Among Proportions
Caffeine consumption 1- 151- 300- None 150 300
900 ____________________________________________
___ Marital status Married 652 1537 598 242 O (70
5.8) (1488) (578.1) (257.1) E Divorced 36 46 38 2
1 O (32.9) (69.3) (26.9) (12.0) E Single 218 327
106 67 O (167.3) (352.7) (137.0) (60.9) E
(652-705.8)2 / 705.8 4.11 (1537 1488)2 / 1488
1.61 .
Chisq 51.6 df 3x26 X2 1.63 for a0.05
(O - E)2/E Married 4.11 1.61 0.69 0.89 7.30 Divorc
ed 0.30 7.82 4.57 6.82 19.51 Single
15.30 1.88 7.02 0.60 24.86 Total 19.77 11.31 12.28
8.31 51.66
11Normal Distribution
Phân ph?i chi?u cao ? ph? n? Vi?t Nam v?i trung
bình 156 cm và d? l?ch chu?n 4.6 cm. Tr?c hoành
là chi?u cao và tr?c tung là xác su?t cho m?i
chi?u cao.
12Application of the Normal Distribution
- The serum cholesterol levels of Californian
children have a mean of 175 mg/100ml and a
standard deviation of 30 mg/100ml. The
distribution of the cholesterol levels is normal.
- 95 of the children should have cholesterol
levels ranged between 175 (1.96x30) 116 and
234 mg/100ml. - If we let X be the chol. level for any child,
then X can be converted to a variable with mean0
and SD1 - Z (X 175) / 30
mg/100l
116
234
175
Z
-1.96
1.96
0
Abnormal?
Abnormal?
13Two-group comparison unpaired t-test
Mean difference D x1 x2 Variance of D
Group 1 Group2 x11
x21 x12 x22 x13 x23 x14 x24
x15 x25 x1n x2n Sample size n1 n2
Mean x1 x2 SD s1 s2
T-statistic
95 Confidence interval
14Two-group comparison an example
A B 100 122 108 130 119 138 127 142 132
152 135 154 136 176 164 N 8 7 Mean 127.
6 144.9 SD 19.6 17.8
Mean difference d 127.6 144.9
-17.3 Variance of D
T-statistic
95 Confidence interval
15Analysis of Correlation
ID Age Chol (mg/ml) 1 46 3.5 2 20 1.9 3 52 4.0
4 30 2.6 5 57 4.5 6 25 3.0 7 28 2.9 8 36 3.8 9 22
2.1 10 43 3.8 11 57 4.1 12 33 3.0 13 22 2.5 14 63
4.6 15 40 3.2 16 48 4.2 17 28 2.3 18 49 4.0
16Variance, Covariance and Correlation Theory
- Let x and y be two random variables from a sample
of n obervations. - Measure of variability of x and y variance
- Measure of covariation between x and y ?
- Coefficient of correlation (r)
17Positive and Negative Correlation
r 0.9
r -0.9
18Test of Hypothesis of Correlation
- Hypothesis Ho r 0 versus Ho r not equal to
0. - Step 1 Fishers z-transformation
- Step 2 calculate standard error of z
- Step 3 calculate t-statistic
19An Example of Correlation Analysis
- ID Age Cholesterol
- (x) (y mg/100ml)
- 1 46 3.5
- 2 20 1.9
- 3 52 4.0
- 4 30 2.6
- 5 57 4.5
- 6 25 3.0
- 7 28 2.9
- 8 36 3.8
- 9 22 2.1
- 10 43 3.8
- 11 57 4.1
- 12 33 3.0
- 13 22 2.5
- 14 63 4.6
- 15 40 3.2
- 16 48 4.2
- 17 28 2.3
Cov(x, y) 10.68
t-statistic 0.56 / 0.26 2.17 Critical t-value
with 17 df and alpha 5 is 2.11 Conclusion
There is a significant association between age
and cholesterol.
20Simple Linear Regression Analysis
- Only two variables are of interest one response
variable and one predictor variable - No adjustment is needed for confounding or
covariate - Assessment
- Quantify the relationship between two variables
- Prediction
- Make prediction and validate a test
- Control
- Adjusting for confounding effect (in the case of
multiple variables)
21Linear Regression Model
- Y random variable representing a response
- X random variable representing a predictor
variable (predictor, risk factor) - Both Y and X can be a categorical variable (e.g.,
yes / no) or a continuous variable (e.g., age). - If Y is categorical, the model is a logistic
regression model if Y is continuous, a simple
linear regression model. - Model
- Y a bX e
- a intercept
- b slope / gradient
- random error (variation between subjects in y
even if x is constant, e.g., variation in
cholesterol for patients of the same age.)
22Linear Regression Assumptions
- The relationship is linear in terms of the
parameter - X is measured without error
- The values of Y are independently from each other
(e.g., Y1 is not correlated with Y2) - The random error term (e) is normally distributed
with mean 0 and constant variance. - If the assumptions are tenable, then
- The expected value of Y is E(Y x) a bx
- The variance of Y is var(Y) var(e) s2
23Estimation of Model Parameters
- Given two points A(x1, y1) and B(x2, y2) in a
two-dimensional space, we can derive an equation
connecting the points
Gradient
y
B(x2,y2)
Equation y mx a What happen if we have
more than 2 points?
dy
A(x1,y1)
dx
a
0
x
24Method of Least Squares
- For a series of pairs (x1, y1), (x2, y2), (x3,
y3), , (xn, yn) - Let a and b be sample estimates for parameters a
and b, - We have a sample equation Y a bx
- Aim finding the values of a and b so that (Y
Y) is minimal. - Let SSE sum of (Yi a bxi)2.
- Values of a and b that minimise SSE are called
least square estimates.
25Criteria of Estimation
yi
Chol
Age
The goal of least square estimator (LSE) is to
find a and b such that the sum of d2 is minimal.
26Least squares Estimates
- After some calculus operations, the results can
be shown to be
Where
- When the regression assumptions are valid, the
estimators of a and b have the following
properties - Unbiased
- Uniformly minimal variance (eg efficient)
27Goodness-of-fit
- Now, we have the equation Y a bX
- Question how well the regression equation
describe the actual data? - Answer coefficient of determination (R2) the
amount of variation in Y is explained by the
variation in X.
28Partitioning of variations geometry
SSE
Chol (Y)
SST
SSR
mean
Age (X)
- SST sum of squared difference between yi and
the mean of y. - SSR sum of squared difference between the
predicted value of y and the mean of y. - SSE sum of squared difference between the
observed and predicted value of y. - SST SSR SSE
- The the coefficient of determination is R2
SSR / SST
29Linear Regression Analysis by R
- age lt- c(46,20,52,30,57,25,28,36,22,43,57,33,22,63
,40,48,28,49) - chol lt- c(3.5,1.9,4.0,2.6,4.5,3.0,2.9,3.8,2.1,3.8
,4.1,3.0,2.5,4.6,3.2,4.2,2.3,4.0) - lipid lt- data.frame(age,chol)
- attach(lipid)
- results lt- lm(chol age)
- summary(results)
Residuals Min 1Q Median 3Q
Max -0.40729 -0.24133 -0.04522 0.17939
0.63040 Coefficients Estimate Std.
Error t value Pr(gtt) (Intercept) 1.089218
0.221466 4.918 0.000154 age
0.057788 0.005399 10.704 1.06e-08
--- Signif. codes 0 '' 0.001 '' 0.01
'' 0.05 '.' 0.1 ' ' 1 Residual standard error
0.3027 on 16 degrees of freedom Multiple
R-Squared 0.8775, Adjusted R-squared 0.8698
F-statistic 114.6 on 1 and 16 DF, p-value
1.058e-08
30Interpretation of Model Estimates
- Cholesterol 1.089 0.0578(Age)
- Estimate Std. Error t value Pr(gtt)
- (Intercept) 1.089218 0.221466 4.918 0.000154
- age 0.057788 0.005399 10.704 1.06e-08
- Interpretation Cholesterol is increased by
0.0578 mg/ml for each year increase in age. The
association between age and cholesterol is
statistically significant (p 1.06e-08).
R-squared 0.8698
- Interpretation Variation in age explained 85
variation in cholesterol.
31Prediction
- plot(chol age)
- abline(results)
Regression line Chol 1.089 0.0578(Age)
32Checking Assumptions
- par(mfrowc(2,2))
- plot(results)
33The Importance of Assumption BMI and Sexual
Attractiveness
- bmi lt- c(11.0, 12.0, 12.5, 14.0, 14.0, 14.0,
14.0, 14.0, 14.0, - 14.8, 15.0, 15.0, 15.5, 16.0, 16.5,
17.0, 17.0, 18.0, - 18.0, 19.0, 19.0, 20.0, 20.0, 20.0,
20.5, 22.0, 23.0, - 23.0, 24.0, 24.5, 25.0, 25.0, 26.0,
26.0, 26.5, 28.0, - 29.0, 31.0, 32.0, 33.0, 34.0, 35.5,
36.0, 36.0) - sa lt- c(2.0, 2.8, 1.8, 1.8, 2.0, 2.8, 3.2, 3.1,
4.0, 1.5, 3.2, - 3.7, 5.5, 5.2, 5.1, 5.7, 5.6, 4.8, 5.4,
6.3, 6.5, 4.9, - 5.0, 5.3, 5.0, 4.2, 4.1, 4.7, 3.5, 3.7,
3.5, 4.0, 3.7, - 3.6, 3.4, 3.3, 2.9, 2.1, 2.0, 2.1, 2.1,
2.0, 1.8, 1.7) - beauty lt- data.frame(bmi,sa)
- attach(beauty)
- results lt- lm(sa bmi)
- summary(results)
Coefficients Estimate Std. Error t
value Pr(gtt) (Intercept) 4.92512
0.64489 7.637 1.81e-09 bmi -0.05967
0.02862 -2.084 0.0432 --- Signif.
codes 0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1
' ' 1 Residual standard error 1.354 on 42
degrees of freedom Multiple R-Squared 0.09376,
Adjusted R-squared 0.07218 F-statistic 4.345
on 1 and 42 DF, p-value 0.04323
34Incorrect Functional Form
35Cubic Regression
resultslt-lm(sa poly(bmi,3)) summary(results)
- Coefficients
- Estimate Std. Error t value
Pr(gtt) - (Intercept) 3.6500 0.1193 30.587 lt
2e-16 - poly(bmi, 3)1 -2.8228 0.7915 -3.566
0.000957 - poly(bmi, 3)2 -5.9749 0.7915 -7.548
3.27e-09 - poly(bmi, 3)3 4.0324 0.7915 5.094
8.76e-06 - ---
- Signif. codes 0 '' 0.001 '' 0.01 '' 0.05
'.' 0.1 ' ' 1 - Residual standard error 0.7915 on 40 degrees of
freedom - Multiple R-Squared 0.7051, Adjusted
R-squared 0.683 - F-statistic 31.88 on 3 and 40 DF, p-value
1.077e-10
SA 3.65 2.82(BMI) 5.97(BMI)2 4.03(BMI)3
36Sexual Attractiveness and BMI Cubic Function
- bmi.new lt- (1040)
- sa.pred predict(results, data.frame(bmibmi.new)
) - plot(sa bmi)
- lines(bmi.new, sa.pred, col"blue", lwd3)