Title: Help! My mentor gave me data and asked me to analyze it
1Help! My mentor gave me data and asked me to
analyze it.
- Pathways to Careers in Clinical and Translational
Research (PACCTR) Curriculum Core
2Help! My mentor gave me data and asked me to
analyze it..
- Very common project for a mentor to give to a
student - BUT, may not be an appropriate project if you
dont have experience in statistical analysis or
access to a statistician - Also known as secondary data analysis
3Secondary Data Analysis
- Often there is extra data left over that mentor
thinks could be interesting. - Example An RCT has been completed (and
published) examining the efficacy of a medication
for some disease. At baseline, subjects were
asked a lot of questions about quality of life
and sexual function. The researcher wants you to
analyze this data.
4Step 1 Define what you have
- Sometimes this is not clear
- Get a list of variables, questionnaire/ survey,
data abstraction instrument - Read the research protocol
- Read any papers/posters already published
5Step 2 Research Q
- Always start with a research question.
- Ask your mentor what the research question is.
- If there isnt a clear research question, proceed
with caution. There may not be an interesting
project here.
6Research Q Example
- Possible RQs from RCT example
- 1. How does QOL change in treated vs placebo
group? (An RCT) - 2. What is the QOL and sexual functioning of
people with this disease at baseline (ie before
treatment). (A descriptive study, ie
cross-sectional study) - 3. What are the determinants of low QOL in people
with this disease? (similar to above but you
determine if certain groups have lower QOL than
others eg by race, education, comorbidities etc.
This could be done with multivariate analysis.)
7Step 2 Novel?
- Is this novel?
- Again-often there is left over data that
researcher thinks might be interesting. - Your job is to figure out if it would be
interesting! - Do a lit search to see whats been done, talk to
clinicians to see if it is interesting. - If not, consider choosing a different project.
8Step 3 initial data analysis
- If there is a research question and it is
interesting, proceed with initial data analysis. - Type of analysis depends on type of data
- Continuous outcomes means, t-tests, linear
regression - Dichotomous or categorical outcomes proportions
(s), chi square tests, logistic regression
9Step 3 initial data analysis
- Do you have a programmer or statistician?
- NO. If you dont have data analysis experience,
consider a different project unless you have a
lot of time to teach yourself or take a class.
See Hulleys Designing Clinical Research - Yesyou have a programmer or statistician to help
you - your job is to communicate with him/her in order
to get the info you need. Ask for - list of variables
- list of means and proportions for the variables
you are interested in - Compilation of cross-tabs and/or t-tests for
selected variables to see if there are
differences between groups. (see next slide)
10Cross-tabs?
Low QOL High QOL Total
Male 100 69.44 67.11 44 30.56 31.88 144 100.00 50.17
Female 49 34.27 32.89 94 65.73 68.12 143 100.00 49.83
Total 149 51.92 100.00 138 48.08 100.00 287 100.00 100.00
- Cross-tabs are a short hand way of saying chi
square tests (or Fischer exact test) - If you ask for sex by high vs low QOL, you would
get
Fisher's exact 0.000
1-sided Fisher's exact 0.000
Risk ratio 2.026644 1.575926
2.606269
11How to interpret?
- There are 287 total with about half male (144)
and half female (143)
Fisher's exact 0.001
1-sided Fisher's exact 0.001
Risk ratio 2.026644 1.575926
2.606269
12How to interpret?
- There are 287 total with about half male (144)
and half female (143) - Men are more likely to have low QOL (100 of 144
or 69.44) than women (49 of 143 or 34.27)
Fisher's exact 0.001
1-sided Fisher's exact 0.001
Risk ratio 2.026644 1.575926
2.606269
13How to interpret?
- There are 287 total with about half male (144)
and half female (143) - Men are more likely to have low QOL (100 of 144
or 69.44) than women (49 of 143 or 34.27) - This difference is significant with a p0.001
Fisher's exact 0.001
1-sided Fisher's exact 0.001
Risk ratio 2.026644 1.575926
2.606269
14How to interpret?
Risk ratio 2.026644 1.575926
2.606269
- Sometimes the output will instead come to you as
a risk ratio (relative risk or odds ratio) - Interpretation Men are 2 fold more likely to
have low QOL (RR2.02) - This difference is significant because 95
confidence interval does not include 1.0 (ie
1/57-2.61)
Fisher's exact 0.001
1-sided Fisher's exact 0.001
15What about t-tests?
- If you asked for BMI vs High/Low QOL you would
get this - Two-sample t test with equal variances
- --------------------------------------------------
---------------------------- - Group Obs Mean Std. Err.
Std. Dev. 95 Conf. Interval - -------------------------------------------------
---------------------------- - Low QOL 64 24.74505 .8713092
6.970474 23.00388 26.48622 - High QOL 62 25.43989 1.0174
8.011014 23.40548 27.47431 - -------------------------------------------------
---------------------------- - combined 126 25.08696 .6662367
7.478489 23.76839 26.40552 - -------------------------------------------------
---------------------------- - diff -.6948402
1.336548 -3.340244
1.950563 - --------------------------------------------------
---------------------------- - Degrees of freedom 124
- Ho mean(Placebo) - mean(Digoxin)
diff 0 - Ha diff lt 0 Ha diff 0
Ha diff gt 0 - t -0.5199 t -0.5199
t -0.5199 - P lt t 0.3020 P gt t 0.6041
P gt t 0.6980
16How to interpret?
- Two-sample t test with equal variances
- --------------------------------------------------
---------------------------- - Group Obs Mean Std. Err.
Std. Dev. 95 Conf. Interval - -------------------------------------------------
---------------------------- - Low QOL 64 24.74505 .8713092
6.970474 23.00388 26.48622 - High QOL 62 25.43989 1.0174
8.011014 23.40548 27.47431 - -------------------------------------------------
---------------------------- - combined 126 25.08696 .6662367
7.478489 23.76839 26.40552 - -------------------------------------------------
---------------------------- - diff -.6948402
1.336548 -3.340244
1.950563 - --------------------------------------------------
---------------------------- - Degrees of freedom 124
- Ho mean(Placebo) - mean(Digoxin)
diff 0 - Ha diff lt 0 Ha diff 0
Ha diff gt 0 - t -0.5199 t -0.5199
t -0.5199 - P lt t 0.3020 P gt t 0.6041
P gt t 0.6980
- The low QOL subjects (n64) have a mean BMI of
24.7 with a std dev of 6.9 and a 95 CI of 23.0
to 26.5
17How to interpret?
- Two-sample t test with equal variances
- --------------------------------------------------
---------------------------- - Group Obs Mean Std. Err.
Std. Dev. 95 Conf. Interval - -------------------------------------------------
---------------------------- - Low QOL 64 24.74505 .8713092
6.970474 23.00388 26.48622 - High QOL 62 25.43989 1.0174
8.011014 23.40548 27.47431 - -------------------------------------------------
---------------------------- - combined 126 15.08696 .6662367
7.478489 23.76839 26.40552 - -------------------------------------------------
---------------------------- - diff -.6948402
1.336548 -3.340244
1.950563 - --------------------------------------------------
---------------------------- - Degrees of freedom 124
- Ho mean(Placebo) - mean(Digoxin)
diff 0 - Ha diff lt 0 Ha diff 0
Ha diff gt 0 - t -0.5199 t -0.5199
t -0.5199 - P lt t 0.3020 P gt t 0.6041
P gt t 0.6980
- The low QOL subjects (n64) have a mean BMI of
24.7 with a std dev of 6.9 and a 95 CI of 23.0
to 26.5 - The high QOL subjects have a mean BMI of 25.4
18How to interpret?
- Two-sample t test with equal variances
- --------------------------------------------------
---------------------------- - Group Obs Mean Std. Err.
Std. Dev. 95 Conf. Interval - -------------------------------------------------
---------------------------- - Low QOL 64 24.74505 .8713092
6.970474 23.00388 26.48622 - High QOL 62 25.43989 1.0174
8.011014 23.40548 27.47431 - -------------------------------------------------
---------------------------- - combined 126 25.08696 .6662367
7.478489 23.76839 26.40552 - -------------------------------------------------
---------------------------- - diff -.6948402
1.336548 -3.340244
1.950563 - --------------------------------------------------
---------------------------- - Degrees of freedom 124
- Ho mean(Placebo) - mean(Digoxin)
diff 0 - Ha diff lt 0 Ha diff 0
Ha diff gt 0 - t -0.5199 t -0.5199
t -0.5199 - P lt t 0.3020 P gt t 0.6041
P gt t 0.6980
- The low QOL subjects (n64) have a mean BMI of
24.7 with a std dev of 6.9 and a 95 CI of 23.0
to 26.5 - The high QOL subjects have a mean BMI of 25.4
- Is this significantly different? Nolook at
middLe column, p0.6041
19What about multivariate analysis?
- Predictors of Low QOL (Low QOL is the outcome so
this is a logistic regression b/c it is a
dichotomous outcome) - Choose variables to place in your model. Choice
depends on both biologic plausibility and on
results of the bivariate analysis (the cross-tabs
and t-tests you did above)
20Model selection, multivariate analysis
- You may choose to put all variables in the model
that were significant in bivariate analysis at a
p of lt0.10 (usually you choose p0.10 to 0.20 b/c
if you limit it to lt0.05 you may miss some
variables that become significant in a
multivariate model due to confounding by other
variables) - And, even if not significant in the bivariate
model, you may choose to include variables that
you think are important biologically or b/c
others have reported an association (eg co-morbid
conditions)
21Results Multivariate analysis
You ask for the model to be run and get this
- . xi logistic lowqol i.trirace i.agecat2 male
private q33job lesshs married - Logit Estimates
Number of obs 371 -
chi2(16) 79.29 -
Prob gt chi2 0.0000 - Log Likelihood -202.4476
Pseudo R2 0.1638 - --------------------------------------------------
---------------------------- - lowqol Odds Ratio Std. Err. z Pgtz
95 Conf. Interval - -------------------------------------------------
---------------------------- - Itrira_1 .9543597 .3067677 -0.145
0.884 .5082804 1.791929 - Itrira_2 .404713 .1310575 -2.793
0.005 .2145379 .763467 - Iageca 1 2.149653 .749715 2.194
0.028 1.085182 4.25828 - Iageca_2 2.007573 .6533771 2.141
0.032 1.060822 3.79927 - male 2.227047 .9420758 1.893
0.058 .9719808 5.102711 - private 1.085656 .8550493 0.104
0.917 .2318977 5.082625 - q33job .8852046 .2355718 -0.458
0.647 .5254371 1.491305 - lesshs .8078212 .2238751 -0.770
0.441 .4692648 1.390633 - married .9584556 .268145 -0.152
0.879 .5539024 1.658482 - --------------------------------------------------
----------------------------
22Interpretation?
Outcome variable low QOL
- . xi logistic lowqol i.trirace i.agecat2 male
private q33job lesshs married - Logit Estimates
Number of obs 371 -
chi2(16) 79.29 -
Prob gt chi2 0.0000 - Log Likelihood -202.4476
Pseudo R2 0.1638 - --------------------------------------------------
---------------------------- - lowqol Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - Itrira_1 .9543597 .3067677 -0.145
0.884 .5082804 1.791929 - Itrira_2 .404713 .1310575 -2.793
0.005 .2145379 .763467 - Iageca 1 2.149653 .749715 2.194
0.028 1.085182 4.25828 - Iageca_2 2.007573 .6533771 2.141
0.032 1.060822 3.79927 - male 2.227047 .9420758 1.893
0.058 .9719808 5.102711 - private 1.085656 .8550493 0.104
0.917 .2318977 5.082625 - q33job .8852046 .2355718 -0.458
0.647 .5254371 1.491305 - lesshs .8078212 .2238751 -0.770
0.441 .4692648 1.390633 - married .9584556 .268145 -0.152
0.879 .5539024 1.658482 - --------------------------------------------------
----------------------------
Variables in model Race (3 categories,
refwhite), Age (3 categories, reflt30), Male (vs
female), Private insurance (vs Medicaid),
Employed (vs unemployed), Education lt high school
(vs more), Married (vs unmarried).
Note that BMI is not in the model b/c it wasnt
significant in bivariate analysis (t-test)
23Interpretation?
Outcome variable low QOL
- . xi logistic lowqol i.trirace i.agecat2 k20
private q33job lesshs married - Logit Estimates
Number of obs 371 -
chi2(16) 79.29 -
Prob gt chi2 0.0000 - Log Likelihood -202.4476
Pseudo R2 0.1638 - --------------------------------------------------
---------------------------- - lowqol Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - Itrira_1 .9543597 .3067677 -0.145
0.884 .5082804 1.791929 - Itrira_2 .404713 .1310575 -2.793
0.005 .2145379 .763467 - Iageca 1 2.149653 .749715 2.194
0.028 1.085182 4.25828 - Iageca_2 2.007573 .6533771 2.141
0.032 1.060822 3.79927 - male 2.227047 .9420758 1.893
0.058 .9719808 5.102711 - private 1.085656 .8550493 0.104
0.917 .2318977 5.082625 - q33job .8852046 .2355718 -0.458
0.647 .5254371 1.491305 - lesshs .8078212 .2238751 -0.770
0.441 .4692648 1.390633 - married .9584556 .268145 -0.152
0.879 .5539024 1.658482 - --------------------------------------------------
----------------------------
Look at the P column to see which variables are
significantly associated with low QOL after
adjustment for other variables in the model
Odds ratios gt 1.0 indicate a higher risk of low
QOL, odds ratios lt1.0 indicate a lower risk of
low QOL.
24Interpretation?
- . xi logistic lowqol i.trirace i.agecat2 k20
private q33job lesshs married - Logit Estimates
Number of obs 371 -
chi2(16) 79.29 -
Prob gt chi2 0.0000 - Log Likelihood -202.4476
Pseudo R2 0.1638 - --------------------------------------------------
---------------------------- - lowqol Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval - -------------------------------------------------
---------------------------- - Itrira_1 .9543597 .3067677 -0.145
0.884 .5082804 1.791929 - Itrira_2 .404713 .1310575 -2.793
0.005 .2145379 .763467 - Iageca 1 2.149653 .749715 2.194
0.028 1.085182 4.25828 - Iageca_2 2.007573 .6533771 2.141
0.032 1.060822 3.79927 - male 2.227047 .9420758 1.893
0.058 .9719808 5.102711 - private 1.085656 .8550493 0.104
0.917 .2318977 5.082625 - q33job .8852046 .2355718 -0.458
0.647 .5254371 1.491305 - lesshs .8078212 .2238751 -0.770
0.441 .4692648 1.390633 - married .9584556 .268145 -0.152
0.879 .5539024 1.658482 - --------------------------------------------------
----------------------------
Variables associated with increased risk of low
QOL 1. age 40-50 2 fold increase risk 2. Age
gt50 2 fold increase risk 3. Male has trend
toward significance with p0.06. Variables
associated with decreased risk low QOL 1. Asian
(category 2) 60 decrease risk All other
variables no longer significantly associated with
outcome
Odds ratios gt 1.0 indicate a higher risk of low
QOL, odds ratios lt1.0 indicate a lower risk of
low QOL.
25Summary Data analysis
- Clearly define the research question and ensure
it is novel - Understand the data get variable list, read
questionnaire, read research proposal and already
published posters/papers - Preliminary analysisbivariate (t-test, chi
square) - Advanced analysis multivariate
26PACCTR Curriculum Core
- Rebecca Jackson MD, School of Medicine
- Roberta Oka RN, ANP, DNSc, School of Nursing
- George Sawaya MD, School of Medicine
- Susan Hyde DDS, MPH, PhD, School of Dentistry
- Jennifer Cocohoba PharmD, School of Pharmacy
- Joel Palefsky MD, School of Medicine
Pathways to Careers in Clinical and
Translational Research