Title: Challenges In Progressing Biomarkers To Clinical Use Proteomic Experiences
1Challenges In Progressing Biomarkers To Clinical
UseProteomic Experiences
- Chris Harbron
- Technical Lead For High Dimensional Data
- AstraZeneca
- FDA Industry Statistics Workshop
- September 2006
2Gap Between Published Biomarkers And Biomarkers
Being Approved For Use
3Why Might This Be?Challenges
- Pressures from the contextual environment
- High quality data is essential
- These are new technologies - not simple to use or
analyse - Robust study design including
- Consistent sample collection and processing
- Need to understand reproducibility between
within labs within subjects - Failure leads to poor data quality, frequently
dominated by nuisance factors - Rigorous validation is also essential
- Occurs at many levels
- Avoid overfitting data
- Omics may not do it alone
- Applications will require combining -omics with
other data types
4Example Case-Control Study
- Interest in identifying a peptidomic profile that
could predict an adverse event - Potential use as a personalised medicine
predictive marker - Blood samples taken from subjects at start of
treatment - Subjects monitored for adverse event using a
rigorous definition - Subjects entered in cohorts
- Samples processed in batches within cohorts
- Analysed on a LC/MS-MS platform
5LC-MS/MS Proteomics
Clinical Plasma Samples
Preparation Digestion
Mass / Charge Ratio
Ion intensity
Peptides
Liquid Chromatography
Separation By Retention Time
Retention Time
Separation By Mass/Charge Measurement Of
Intensity
690.81
Mass Spectrometry
Fragment Ion intensity
1027.87
570.33
1156.84
599.13
579.3
635.85
1138.86
643.8
1122.83
1251.79
371.25
799.93
1010.89
242.26
727.23
1252.9
258.19
881.99
Protein Identification
389.22
561.21
958.89
276.24
832.76
1269.83
286.28
1234.85
1107.00
1346.63
MS/MS
Mass / Charge Ratio
6Distribution Of Average Intensities
5,500,000 RT / MZ / Intensity Measurements Per
Sample
Distribution Of Average Intensities
High Intensity
- Pre-Processing
- Alignment Of Retention Times
- Scaling
- Binning
Mass-Charge Ratio
Low Intensity
25,000 Common Peaks Per Sample
Retention Time
7Proteomic DataExploratory Analysis - PCA
Considerable batch to batch variation
Control Case Non-Index Case
Cohort 1
Cohort 2
Cohort 3
Cohort 4
8Proteomic DataExploratory Analysis - PCA
Within all batches with both cases and controls,
there is separation of cases and controls
9Univariate Analyses Within BatchesHistograms Of
t-Test p-Values
10Global Test Of Agreement Between Batches Using A
Permutation Test
Identify peaks where direction of effect agrees
in all 3 batches Summarise by maximum
p-value Global test of expected level due to
multiple testing by permutation
Observed
Permuted
11Typical Highly Significant Peak
Within each batch, cases are highly expressed
compared to controls Not possible to define a
global cut-off between cases and controls
Intensity
Batches
CASE CONTROL NIC
12Multivariate Analyses
- Identified consistent effect
- BUT, may be difficult to use as a predictive
biomarker in a clinical setting due to batch
variation - Would a combination of markers, a peptidomic
profile, work as a predictive biomarker? - Use Random Forests to generate multivariate
predictive models - Assess predictive power using a nested
cross-validation - Within and between batch prediction
13Modelling Process
Data
Control Only batches Batch excluded Observation
excluded
Mixed Case-Control batches Exclude Batches In
Turn Exclude Observations By LOO
Observation Excluded
Training Set
Test Set
Batch Excluded
Analyse Each Peak Within Each Batch
Take Maximum p-Value For Each Peak
Rank Peaks By p-Value
Number Of Peaks
Build Model With Top n Peaks
Test Model In Test Set
14Leave One Out Cross ValidationProteomic Model
Predictions
Leave One Out Training Set Batches
Cases Leave One Out Training Set Batches
Controls Other Mixed Batch Cases Other Mixed
Batch Controls Other Batches - Controls
15Mask Data By Restricting To High Quality Regions
Of Proteomic Space
- TECHNICALLY
- Region of focus for instrument
- EMPIRICALLY
- Lowest residual
- variability
- Highest average intensity
16Analysis Of Unmasked Peaks
- Batch Effects Still Dominate
- Consistent Case-Control Effect
Can Identify Peaks Separating Cases Controls
Across Batches
17Cross-Validation PredictionsUnmasked Peaks
Leave One Out Same Batch Cases Leave One Out
Same Batch - Controls Other Mixed Batch -
Cases Other Mixed Batch - Controls Other Batches
- Controls
- Good Predictions Within Same Batch
- Prediction Rate Falls When Extrapolated To
Other Batches - Need To Prospectively Test In Another Set Of
Patients
18How To Combine Other Non-omic Information Into A
Biomarker?
- Combining different data types is challenging
- The bigger data type will dominate the
modelling - Greater signal in data, but doesnt extrapolate
as well - Exploring options turning the random part of
random forests to our advantage
Known Clinical Prognostic Proteomic Peaks
19Proteomic Quality Control Consortium?
- MAQC recently reported a reproducibility study
for microarrays - Wealth of valuable information
- Mammoth effort
- Could we do the same for proteomics?
- Less mature technology
- Greater diversity of platforms
- Diversity of pre-processing methodologies
- Issues of identification making large scale
comparisons challenging
20Conclusions
- Complicated new technologies
- Many challenges
- Technical, Data Quality, Data Analysis, Practical
- Essential role for statistics
- Need to integrate statistical approaches with
understanding of technologies and biology - Great potential
- Better treatments for patients
- Improved use of compounds
- Greater biological understanding