Appropriate Use of Statistics for BigData Projects - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Appropriate Use of Statistics for BigData Projects

Description:

I think most are saying we do not need theoretical models, or preconceived models. Although generally you need some idea of what data you are looking at general as ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 67
Provided by: alt7
Category:

less

Transcript and Presenter's Notes

Title: Appropriate Use of Statistics for BigData Projects


1
Musings by a Statistician
Laura Lee Johnson, PhD September 11, 2008 NIH
Biomedical Computing Interest Group
2
Topics on My Mind
  • Study and experimental design
  • 20 people sending me the Wired (23 JUN 2008)
    article on the obsolesce of the scientific method
  • Large (vs. small?) data
  • Repeating the same mistakes
  • Variance and independent measurements
  • Analyses and sample size
  • It is just a pilot (the 4th pilot)

3
May 2006 Talk at BCIG Outline
  • What is your question?
  • What is your design?
  • What does your data look like?
  • Lots of measures, few people
  • Lots of people, lots of zeros/NAs
  • No data, large parameter and value space

4
Outline
  • Lots of numbers, few people
  • fMRI, Microarray, Proteomics
  • Lots of people, lots of zeros/NAs
  • sparse data
  • Multidimensional, hierarchical
  • No data, large parameter and value space
  • Data farming
  • The Sims, Project Albert

5
The Key
  • Leverage what you have
  • Do not oversell it
  • Add to it this All data has a shape.
  • Can you use that shape to answer a question or to
    generate hypotheses to develop a study to test a
    question of interest?
  • Sometimes it cannot be done well

6
What is the Question?
7
Not that different today
  • Petabytes are great
  • But do not fool yourself
  • It might fool whoever keeps demanding answers
    from you

8
Progression Treatment of Periodontal Disease
  • Two studies
  • Subject level
  • Each tooth of each subject
  • Multiple locations on each tooth of each subject
  • Longitudinal structure of the study
  • Make that studIES
  • The data contain multiple levels of correlation

9
Variation and Correlation
10
Variance and Correlation Laboratory(ies)
11
Variation and Correlation Data!
12
Variation and Correlation
13
Generating More Questions
14
Not What You Want
15
Can you hit the broad side of
  • A barn?
  • Did you paint the bulls eye before or after you
    shot?
  • Hypothesis generation is great
  • Do not forget that is what you did
  • Do not forget 99.9 can be insufficient
  • Do not forget 90 could be a winner

16
Statistical Inference
  • Inferences about a population are made on the
    basis of results obtained from a sample drawn
    from that population
  • Want to talk about the larger population from
    which the subjects are drawn, not the particular
    subjects!

17
Linear Regression
  • Model for simple linear regression
  • Yi ß0 ß1x1i ei
  • ß0 intercept
  • ß1 slope
  • Assumptions
  • Observations are independent
  • Normally distributed with constant variance
  • Hypothesis testing
  • H0 ß1 0 vs. HA ß1 ? 0

18
Stretching AssumptionsIndependent in any
direction?
  • fMRI, microarrays
  • Voxels and genes associated with others
  • Repeated measures on the same person EVEN IF
    DIFFERENT SAMPLES should be considered
    non-independent
  • Are all outcomes measured with equivalent
    sensitivity?
  • Probably not
  • Variances will not be the same

19
Analyses Make More Datasets
  • Permutation/randomization test
  • Rearrange the current data ? new dataset
  • Calculate the test statistic
  • Repeat many times
  • Compare results to the original datas test
    statistic
  • Bootstrap
  • Sample record with replacement ? new dataset
  • Rest is same as above

20
What is your design?
21
When do I Cringe?
  • Small sample size
  • Biased sample
  • Convenience samples
  • Median? Mean? Range? Standard deviation?
  • Interviews Who did it, of whom, with what,
    where, and how?
  • Completely impractical in real world setting

22
Design
  • Produces/uses data to answer the question(s)
  • If the question needs hierarchical data to answer
    it
  • Needs a hierarchical design
  • Needs to have a hierarchical data analysis
  • Not sparse

23
Causation
  • Biological plausibility
  • Temporal relationship
  • Dose response
  • Reproducibility
  • Strength of association
  • Coherence with established facts
  • Specificity of association

24
National Health Interview Survey
  • State level stratification
  • Black and Hispanic populations oversampled 2006
    added Asians to that
  • Area frame based on previous decennial census
    change this every 10 years
  • Family
  • One sample adult and one sample child (if
    children under 18 present)
  • Household, family, and person level files

25
fMRI
  • Lots of data/scan
  • Voxels
  • Not independent
  • Correlation between voxels not uniform

26
What is the question? What is the design?
  • Inter and intra subject correlation
  • Complex dimensions of brain features
  • Supervised learning
  • Activity patterns predictive of XXX
  • Spatial resolution

27
Power
  • How likely are you to see a difference if there
    is one there?
  • Look for a big difference!
  • More subjects, more runs, longer runs
  • Improved signal to noise ratio

28
Who? Structure of Design
  • Cross-sectional study design
  • Compare groups at a single time point
  • Pre-post or longitudinal designs
  • Look at how one group changes during an
    intervention compared to another group
  • If the design lets you stay simple, do
  • Add hierarchy onto all this? If you need to

29
Data Sweet Data
  • Thank you, Chris Anderson, and everyone who sent
    me that article. I already said it.

30
Terabyte, meet Petabyte
  • Need to turn one huge matrix, all at once
  • MATLAB is your friend (say some)
  • Random effects (linear mixed) models are your
    friends
  • 64 x 64 voxels x 10 slices per image
  • Or 128x128 or 15 or 20 slices
  • At least 10 people/group
  • Preferably more
  • Do you have any confounders to adjust for?

31
HAHAH 10 per group!
  • Yeah, I said it, which was a high number back
    then
  • I thought it was too low, but everyone thought 10
    was too high
  • So what should the number be?
  • 10, 12, 15, 36, 50, 100, i2b2

32
What is the measure?
  • ADNI had a nice presentation July 2008
  • Formal comparisons of MRI, PET summary measures,
    association of MRI, PET with cognitive change
  • Can sample sizes of 400-1000 or 800-8500 per
    study arm/group be reduced?
  • In normals do the measures help us? Can we look
    earlier into the disease process?

33
Missing Data
  • Data we know is missing
  • Imputation
  • Are the zeros real
  • Categorize continuous responses
  • Data that might exist
  • Not measured on anyone

34
How Clean is the Data?
  • Measurement error
  • Sample quality
  • Laboratory quality assurance
  • Consistency
  • Are time and condition confounded?
  • Might be a numerical answer should not be sought
    at this time
  • Pattern recognition

35
Class Comparison Things to be/not to be worried
about
  • Bias
  • Systematic
  • Not systematic
  • Replication
  • One array per specimen
  • More?

36
Class Comparison
  • Time of day/temperature (if known)
  • Serum sample age, preservation, storage
  • Uniformity of sample collection
  • Change in machine or technician
  • May hide the truth
  • May be the truth
  • Try to clean up the noise
  • Overprocess the data

37
Class Comparison
  • Fixable but not regression
  • Computer buffer
  • Machine problems Power cords and other
    electrical interference
  • Not fixable
  • Matrix artifacts
  • Fingerprints on scanned chip

38
QC/QA that will help?
  • Take samples at the same time under the same
    conditions
  • RCT or observational study
  • If you cannot collect samples in the same way
  • Differences may come from collection, not the
    class difference hoping to measure
  • Will your findings hold up to other data sets?

39
QC/QA that will help?
  • Randomize specimens into the lab!
  • Do not process all controls, then all cases
  • If there is a non-biological artifact lurking
    randomization of samples might help avoid it
    hurting your outcome

40
It is all about the N
  • If you cannot enroll 10 per group and analyze
    that data in some manner
  • Why should I think you could get 50 or 600 per
    group?
  • As many questions as you ask
  • Google and Amazon likely know more and better
    data on a person than the research participant
    his/her self
  • Maybe

41
Call it a Pattern or a Model
  • Figure something out using analytic tools
  • Call it anything you please
  • I think most are saying we do not need
    theoretical models, or preconceived models
  • Although generally you need some idea of what
    data you are looking at general as that may be

42
Good Models from Good Data
  • Wide range of people contributing data to the
    model
  • Training set, test set, validation set
  • Validation set
  • Locked away
  • Preferably from another lab
  • Run through the final model, get the calls, give
    to a person who has the truth and see

43
Analysis follows Design
  • Questions ? Hypotheses ?
  • Experimental Design ? Samples ?
  • Data ? Analyses ? Conclusions

44
But Google Tells Me I Do Not Need Models or
Hypotheses or
  • Yeah, you need applied math
  • Guess what stats is
  • You need better data with better analytical tools
  • You can track and measure what people do with
    unprecedented fidelityif you have the data.
    Good data.

45
It Works
  • Business does this all the time
  • Analytic tools tell you something you try it
  • It fails, you realize it, and you change quickly
  • If I give a drug to all type II diabetics
  • Hard to change course when 40 of them die
  • At least for those 40
  • Numbers speak for themselves when you have enough
    of them
  • Hypothesis testing does not miss penicillin.
    When you need an antibiotic.

46
And PS
  • If you are only looking for correlation you are
    probably wrong
  • Associations, associations, associations
  • May not be linear
  • May not be bivariate
  • Some things are simple, though
  • But that is not why most people want big data
  • Even though they tend to analyze it that way

47
I Agree
  • We can stop looking for models. We can analyze
    the data without hypotheses about what it might
    show. We can throw the numbers into the biggest
    computing clusters the world has ever seen and
    let statistical algorithms find patterns where
    science cannot.
  • And then we have to figure out what to do with
    them. You can make some good guesses.

48
Analysis follows Design
  • Questions ? Hypotheses ?
  • Experimental Design ? Samples ?
  • Data ? Analyses ? Conclusions

49
Step Forward
  • Replication is finding the pattern again
  • Science perhaps is figuring out why we care about
    the pattern
  • Translational research (medical or otherwise) is
    figuring out how the use of this pattern can
    change the health and well being of people (or
    animals, the planet, or whatever else you care to
    look at)

50
Lies, Damn Lies, and Statistics!
  • Go astray from the statistical and study design
    assumptions
  • Same applies for analytic tools and
    interpretations made from the use
  • Impede accurate interpretation of resulting data
  • Simulations, permutations, bootstrap
  • Outliers are interesting, not a pain
  • Many, many runs are needed

51
Goals
  • How can you ensure the numerical processing of
    your data does not hurt the interpretability of
    its final outcome?
  • Big-data projects
  • fMRI
  • Proteomics
  • Microarray

52
Just fix it
  • Statistician ? Miracle Worker
  • If the new latest greatest technology provides
    data with serious numerical bias
  • Statisticians job may involve working with bias
    BUT
  • New better machines with less measurement error
    and bias should be built

53
Class Comparison MistakesInteractions and
Covariance
  • What goes into the models?
  • Regression models
  • Linear Mixed models
  • Variance structure is not simple
  • Variance structure is not known
  • Big complex data - big complex structure
  • Explain in 1 sentence in the methods section

54
Prediction MistakesEyes on the Prize
  • Prediction itself to remain stable
  • Do not care about the features that get you to
    the prediction remaining the same across various
    prediction models
  • Are you looking for a clinically useful model?
    Is this a step in trying to find a simple test?
  • In 15 years running the same lab method with
    specimens or something else?

55
Class Discovery
  • Do not have pre-defined classes
  • Unsupervised
  • Objective Discovering new groupings or
    taxonomies of specimens based on expression
    profiles. Discovering classes of co-expressed
    and co-regulated genes

56
Class Discovery MistakesIf you look you will
find something
  • Cluster analyses lead to a cluster structure
  • What clusters do you believe?
  • Where do you stop?
  • Same data-different algorithms-different
    structure
  • Texas Sharpshooter Fallacy
  • Reproducibility?
  • Data perturbation methods
  • Estimate of clusters

57
Data Summary Remarks
  • Enormous chances for spurious findings
  • Knowing everything about 1 specimen, or 3
  • Do not hold all the answers
  • But sometimes you hit gold
  • Validation on larger independent collections of
    specimens is essential
  • Yes, business does this too

58
Lets say the data is perfect
  • New technologies ? many many different measures
    from the same sample
  • Measures, even if of different items, may be
    associated inside the same person
  • Known or unknown pathways or associations
  • Variance and covariance
  • Ignore it or use it

59
System
  • Almost everything is part of a system
  • Evaluating more complex systems
  • Not easy, not common
  • If you care about the system
  • Ask the question that way
  • Design the study that way
  • Collect enough data to analyze it that way
  • Simple is ok answer YOUR question

60
Outline
  • What is your question?
  • What is your design?
  • What does your data look like?
  • Lots of measures, few subjects
  • Lots of subjects, lots of zeros/NAs
  • No data, large parameter and value space

61
Topics on My Mind
  • Study and experimental design
  • 20 people sending me the Wired (23 JUN 2008)
    article on the obsolesce of the scientific method
  • Large (vs. small?) data
  • Repeating the same mistakes
  • Variance and independent measurements
  • Analyses and sample size
  • It is just a pilot (the 4th pilot)

62
What is the question?
  • If you make an inference do real inference
  • What is the difference of interest
  • What is compared to what?
  • A difference or activation or down regulation
    means TWO or more items were compared. What is
    the basis of the comparison?
  • What is the baseline or control condition and how
    many are there?

63
Summary Remarks
  • Analysis tools cannot compensate for poorly
    designed experiments or studies
  • Fancy analysis tools do not necessarily
    outperform simple ones
  • Even the best analysis tools, if applied
    inappropriately, can produce incorrect or
    misleading results

64
Summary Remarks
  • Have a statistician who has thought about high
    dimensional data help design the study and
    experiment, do the analysis, and interpret the
    results
  • Not the only person but often forgotten one
  • Skilled programmers who understand the question
  • Good computing support, space, and speed

65
Question ? Design ? Analysis
  • Garbage in
  • Miracle occurs
  • All our fault
  • The postdoc worked very hard
  • We know already! Garbage out.
  • If you really do, then stop
  • Use correlation?
  • Stop talking about testing correlations. That is
    bogus.

66
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com