Title: BioDCV: a gridenabled complete validation setup for functional profiling
1BioDCV a grid-enabled complete validation setup
for functional profiling
- Cesare Furlanello
- with
- Silvano Paoli, Davide Albanese. Giuseppe Jurman,
Annalisa Barla, Stefano Merler, Roberto Flor
http//mpa.itc.it
Wannsee Retreat, October 2005
2(No Transcript)
3Predictive classification and functional profiling
- Algorithms and software systems for
- Predictive classification, feature selection,
discovery
- Our BioDCV system a set-up based on the E-RFE
algorithm for Support Vector Machines (SVM) - Control of selection bias, a serious
experimental design issue in the use of
prognostic molecular signatures - Subtype identification for studies of disease
evolution and response to treatment
4Selection bias
John P A IoannidisFebruary 5, 2005
In conclusion, the list of genes included in a
molecular signature (based on one training set
and the proportion of misclassifications seen in
one validation set) depends greatly on the
selection of the patients in training sets.
Five of the seven largest published studies
addressing cancer prognosis did not classify
patients better than chance. This result suggests
that these publications were overoptimistic.
5Michiels et al, Lancet 2005
the 95 CI for the proportion of
misclassi?cations fell to below 50 for some
training-set sizes in only two of the
studies We noted unstable molecular signatures
and misclassi?cation rates (with minimum rates
between 31 and 49).
6M. Ruschhaupt et al (2004) "A Compendium to
Ensure Computational Reproducibility in
High-Dimensional Classification Tasks",
Statistical Applications in Genetics and
Molecular Biology 3 (1), Article 37.
A NEW PAPER ON PREDICTIVE GENE PROFILING
INGREDIENTS DATA METHODS (- EXPERIMENTAL
SETUP?)
- The authors present a novel algorithm for
classification, preprocessing , feature
selection, - A description is available in XXX and the
algorithm is publicly available as a Windows
program/ website/ R package YYY. - BUT, HAVE THEY ANSWERED THE FOLLOWING QUESTIONS?
- 1. Which classification result could be achieved
using standard algorithms and is there a
difference in classification quality between a
standard algorithm and the proposed one? - 2. If there is a substantial difference, what is
the reason?
7REANALYSIS OF DATASET Huang E, Cheng SH,
Dressman H, Pittman J, Tsou MH, Horng CF, Bild A,
Iversen ES, Liao M, Chen CM, West M, Nevins JR,
Huang AT. Gene expression predictors of breast
cancer outcomes. The Lancet 36115901596 (2003).
M. Ruschhaupt et al (2004) "A Compendium to
Ensure Computational Reproducibility in
High-Dimensional Classification Tasks",
Statistical Applications in Genetics and
Molecular Biology 3 (1), Article 37.
ERRORS FOR DIFFERENT CLASSIFIERS AND WITH/WITHOUT
METAGENES IN COMPLETE VALIDATION RF random
forest PAM Class prediction by nearest
shrunken centroids PLR Penalized logistic
regression SVM Support Vector Machines BBT
Bayesian Binary Prediction Tree Models
Metagenes new variables from linear combinations
8M. Ruschhaupt et al (2004) "A Compendium to
Ensure Computational Reproducibility in
High-Dimensional Classification Tasks",
Statistical Applications in Genetics and
Molecular Biology 3 (1), Article 37.
RESULTS
- Misclassification rates of around 25 with all
eight methods. - The use of metagenes did not seem to make a big
difference either way. - Most of the misclassified samples come from the
group of patients with recurrence, which is the
smaller group. Possibly, this could be explained
by a preference of the classification algorithms
to favour the larger group.
THEN HOW TO EVALUATE ACCURACY IN PREDICTION?
9The BioDCV setup (E-RFE SVM)
- To avoid selection bias (pgtgtn) a COMPLETE
VALIDATION SCHEME - externally a stratified random partitioning,
- internally a model selection based on a K-fold
cross-validation - ? 3 x 105 SVM models ( random labels ? 2 x 106)
Binary classification, on a 20 000 genes x 45
cDNA array, 400 loops
Ambroise McLachlan, 2002, Simon et. al 2003,
Furlanello et. al 2003
OFS-M Model tuning and Feature ranking
ONF Optimal gene panel estimator
ATE Average Test Error
10Tasks for BioDCV (E-RFE SVM)
Colon cancer 62 samples (40 tumoral 22
control) described by 2000 genes (Alon et. al,
1999)
Lymphoma 96 samples (74 tumoral 24 control)
described by 4096 genes (Alizadeh et. al, 2000)
Tumor vs. Metastases 76 samples (64 primary
tumoral 12 metastatic) described by 16063 genes
(Ramaswamy et al., 2001)
High Sezary CTCL 30 samples (18 disease 12
control) described by 6660 genes (Kari et al.,
2003) coll. Wistar Inst.
Glioma 50 samples (28 glioblastoma 22
oligodendroglioma) described by 12 627 genes
(Nutt et al., 2003)
Breast cancer 37 samples (18 high risk 19 low
risk) described by 12 625 genes (Huang et al.,
2003)
Mouse Model of Myocardial Infarction 36 samples
(18 infarcted 18 control) described by 12488
geni (Cardiogenomics PGA http//cardiogenomics.med
.harvard.edu, 2003)
11Tasks for BioDCV (E-RFE SVM)
Liver cancer 213 samples, 107 tumors from liver
cancer 106 non tumoral/normal, 1993 genes,
ATAC-PCR (Sese et. al, 2000)
Breast cancer Wang et al. 2005 238 samples
(286 lymph-node-negative), Affimetrix, 17819
genes Chang et al. 2005 295 samples (151
lymph-node-negative, 144 pos), cDNA 25000 genes
IFOM 62 BRCA (4 subclasses)
Pediatric Leukemia 327 samples, 12626 genes (7
classes, binary 28443), Yeoh et al. 2002
Tumor vs. Metastases 76 samples (64 primary
tumoral 12 metastatic) described by 16063 genes
(Ramaswamy et al., 2001)
High Sezary CTCL 30 samples (18 disease 12
control) described by 6660 genes (Kari et al.,
2003) coll. Wistar Inst.
Glioma 50 samples (28 glioblastoma 22
oligodendroglioma) described by 12 627 genes
(Nutt et al., 2003)
Breast cancer 37 samples (18 high risk 19 low
risk) described by 12 625 genes (Huang et al.,
2003)
Mouse Model of Myocardial Infarction 36 samples
(18 infarcted 18 control) described by 12488
geni (Cardiogenomics PGA http//cardiogenomics.med
.harvard.edu, 2003)
121. With a Linux OpenMosix HPC facility
Our HPC resource, MpaCluster, 6Xeon 40 Pentium
CPU, OpenMosix, 3 TeraB central storage.
Upgraded in 2005. ? production in GRID
2003-2004, P3 biproc.
13Roadmap for a new grid application
- Starting from a suite of C modules and Perl/shell
scripts running on a local HPC resource - Optimize modules and scripts
- database management of data, of model structures,
of system outputs, scripts for OpenMosix Linux
Clusters - Wrap BioDCV into a grid application
- Learn about grid computing
- Port the serial version on a computational grid
testbed - Analyze/verify results identify needs/problems
- Wrap with C MPI scripts
- Build the MPI mechanism
- Experiment on the testbed
- Submit on production grid
- Test scalability
- Production
141. Optimize modules and scripts
- Rewrite shell/Perl scripts in C language
- control I/O costs,
- a process granularity optimal for temporary data
allocation without tmp files - convenient for migrations
- SQLite interface (Database engine library)
- SQLite is small, self-contained, embeddable
- It provides a relational access to model and data
structures (inputs, outputs, diagnostics) - It supports transactions and multiple
connections, databases up to 2 terabytes in size
local copy (db file) model definitions a
copy of of data indexes defining the partition
of the replicate sample(s)
15BioDCV
(4)unify the local datasets are merged with
setup after completing the validations tasks. A
complete dataset collecting all the relevant
parameters is created.
(1)exp experiment design through configuration
of the setup database
(2)scheduler script submitting jobs (run) on
each available processor. Platform dependent.
(3)run performs fractions of the complete
validation procedures on several data splits.
Local db is created
162. Wrapping into a grid application
- Why porting into the grid?
- Because we need enough computational
resources - How to port the BioDCV in grid?
- PRELIMINARY
- Identify a collaborator with experience in grid
computing (e.g. the Egrid Project hosted at ICTP
http//www.egrid.it ) - Train human resources (SP ? Trieste)
- Join the Egrid testbed (installing a supernode in
Trento) - HANDS-ON
- Porting of the serial application on the testbed
- patch code as needed code portability is
mandatory to make life easier - Identify requirements/problems
17A few EDG definitions
User Interface (UI) machine to access the GRID
site
Storage Element (SE) stores the user data in the
grid and makes it available for subsequent
elaboration
Computing Element (CE) where the grid user
programs are delivered for elaboration this is
usually a front-end to several elementary Worker
Node machines
Worker Node (WN) machines where the user
programs are actually executed, possibly with
multiple CPUs
18The ICTP Egrid project infrastructures
- The local testbed in Trieste
- Small computational grid based on EDG middleware
Egrid add-ons - Designed for testing/training/porting of
applications - Full compatibility with Grid.it middleware
- The production infrastructure
- A Virtual Organization within Grid.it, with its
own services - Star topology with central node in Padova
Trieste
CESEWN
Trento
CESEWN
Padova
CE
SE 2.8 TByte
WNs 100 cpus
19Hands on
- Porting the serial application
- Easy task due to portability (no actual work
needed) - No software/library dependencies
- Testing/Evaluation
- Problems identified
- Job submission overhead due to EDG mechanisms
- Managing multiple (hundreds/thousands) jobs is
difficult and cumbersome - Answer parallellize jobs on the GRID via MPI
- Single submission
- Multiple executions
203. Wrap with C MPI scripts
C MPI
- How can we use C MPI?
- Prepare two wrappers, and an unifier
- one shell script to submit jobs (BioDCV.sh)
- one C MPI program (Mpba-mpi)
- one shell script to integrate results
(BioDCV-union.sh) - BioDCV.sh in action
- copies file from and to Storage Element (SE) and
distributes the microarray dataset to all WNs. - It then starts the C MPI wrapper which spawns
several runs of the BioDCV program (optimize for
resources) - When all BioDCV runs are completed, the wrapper
copies all the results (SQLite files) from the
WNs to the starting SE. - MPBA-MPI executes the BioDCV runs in parallel
- BioDCV-union.sh collates results in one SQLite
file (? R)
21Using BioDCV in Egrid
Edg-job-submit bioDCV.jdl
Padova
UI Egrid Live CD
CE
SE 2.8 TByte
WNs 100 cpus
Trento
site
Resource broker (PD-TN)
CESEWN
Trieste
Palermo
a bootable Linux live-cd distribution with a
complete suit of GRID tools by Egrid (ICTP
Trieste)
. . . .
CESEWN
CESEWN
22A Job Description
Type "Job" JobType
"MPICH" NodeNumber 64
Executable BioDCV.sh" Arguments
Mpba-mpi 64 lfn/utenti/spaoli/sarcoma.db 400"
StdOutput "test.out" StdError
"test.err" InputSandbox
BioDCV.sh",Mpba-mpi","run", "run.sh"
OutputSandbox "test.err","test.out","executable
.out" Requirements other.GlueCEInfoLRM
SType "PBS" other.GlueCEInfoLRMSType
"LSF"
BioDCV.jdl
23Using BioDCV in Egrid (II)
BioDCV.sh runs on
First step
Request file
BioDCV.sh copies data from SE to the WN
sarcoma.db
WN 1
SE
Second step
WN 2
Mpba-mpi and Sarcoma.db are distributed to all
the involved WNs
WN 3
BioDCV Sarcoma.db
. . .
WN 1
WN n
24Using BioDCV in Egrid (III)
Mpba-mpi runs on WN2
Third step
BioDCV runs on
Mpba-mpi on WN3
BioDCV is executed on all involved WNs by MPI
. . .
WN 1
Mpba-mpi on WN n
Fourth step
Job completed
BioDCV.sh runs on
Output
Job completed
Output
SE
Output
WN 1
. . .
Output
BioDCV.sh copies all results (SQLite files) from
the WNs to the starting SE
Job completed
25RUNNING ON THE TESTBED (EGRID.IT)
SCALING UP TESTS
a.
b.
CPUs Intel Xeon _at_ 2.80 GHz
26Usage (examples)
Complete dataset
Outlier detection
BioDCV
213 Samples
Complete dataset
Shaved dataset
213 Samples
198 Samples
- Compare subgroups and pathological features
BioDCV
27Example of Semisupervised analysis (Sese)
28Results
- The pros
- MPI execution on the GRID in a few days..
- The tests showed scalable behavior of our grid
application for increasing numbers of CPUs - Grid computing reduces significantly production
times and allows to tackle larger problems (see
next slide) -
- The cons
- Data movements limit the scalability for a large
number of CPUs - Note this is a GRID.it limitation there is no
shared Filesystem between the WNs, so each file
needs to be copied everywhere! - To hide the latency (ideas)
- Smart data distribution from MWN to WNs
- Reduce the amount of data to be moved
- Proportionate BioDCV subtasks to local cache
- Data transferred via MPI communication
- Requires MPI coding and some MPI adaptation of
the code) -
29MOVE no. 2 Improving the system
- Reduce the amount of data to be moved
- Redesign per run
- SVM models (about 200) and
- results, evaluation
- Variables for semisupervised analysis
- all managed within one data structure
- A large part of the sampletracking semisupervised
analysis,is now managed within BioDCV (about
2000 files, 300MB) i.e. stored through SQLite. - Randomization of labels is fully automated
- The SVM library is now an external library
- Modular use of machine learning methods
- Now adding a PDA module
- BioDCV now under GPL (code curation )
- Distributed at BioDCV with a SubVersion server
since September 2005
30(OCTOBER 2005) CLUSTER AND GRID ISSUES
- At work on several clusters
- MPBA-old 50 P3 CPUs, 1GHz
- MPBA-new 6 Xeon CPUs, 2,8 GHz
- ECT (BEN) up to 32 (of 100) CPU Xeon, 2.8GHz
- SISSA (Cozzini) up to 32 (of 60) P4, 2GHz,
Myrinet - GRID experiences
- Egrid production grid (INFN Padua) up to 64
(of 100) Cpu Xeon, 2-3GHzMicroarray data
Sarcoma, HS random, Morishita, Wang, - LESSONS LEARNED
- the latest version reduces latencies (system
times) due to file copying and management ? CPU
saturation - Life quality (and more studies) huge reduction
of file installing and retrieving from facilities
and WITHIN facilities - Forgetting the severe limitation of file system
(AFS, ) - Now installing 2.6 LCG2 (CERN realise September
2005)
31Challenges
Challenges for predictive profiling
- INFRASTRUCTURE
- MPACluster -gt available for batch jobs
- Connecting with IFOM -gt 2005
- Running at IFOM -gt 2005/2006
- Production on GRID resources (spring 2005)
- ALGORITHMS II
- Gene list fusion suite of algebraic/statistical
methods - Prediction over multi-platform gene expression
datasets (sarcoma, breast cancer) large scale
semi-supervised analysis - New SVM Kernels for prediction on spectrometry
data within complete validation
32Challenges (AIRC-BICG)
HPC-Interaction access through webfront-ends to
GRID HPC
- BASIC CLASSIFICATION MODELS, lists, additional
tools - Tools for researchers subtype discovery,
outlier detection - Connection to data (DBMIAME)
33(No Transcript)
34Acknowledgments
ITC-irst, Trento Davide Albanese Giuseppe
JurmanStefano MerlerRoberto FlorAlessandro
Soraruf ICTP E-GRID Project, Trieste Angelo
Leto Cristian Zoicas Riccardo Murri Ezio
Corso Alessio Terpin
- IFOM-FIRC and INT, Milano
- James Reid
- Manuela Gariboldi
- Marco A. Pierotti
- Grants
- BICG (AIRC)
- Democritos
- Data
- ShoweLab (Wistar)
- Cardiogenomics PGA
35DTW based clustering
Positive
Curves were clustered with Dynamic Time Warping,
as distance with weight configuration (1,2,1).
0.008
0.006
125
0.004
Height
61a
91a
0.002
137
104a
17
27a
81a
13
33
69a
84
14
117
109
120
36a
58a
116
0.000
103
111
43a
143a
21a
92a
72a
121
80a
123
122a
19
57a
132
83a
37a
44a
85
47a
106a
50a
82a
55a
100a
45a
25
59a
99a
48a
30
144a
124
302
112
95
29a
63a
24
105a
89a
96a
87a
93a
98a
20
128a
88
42a
79a
56a
90a
68a
119a
97a
66
54
62
49a
41a
75a
86a
118
301a
40a
127a
22
51a
67a
64a
70a
52a
300
130a
36 Subgroup and pathological features
- All samples have virus type B (V-B).
- Bottom lines incidence in the cluster subgroup
and in all the positive samples.
- DTW based clustering of sample tracking profiles
of cluster subgroup.
37Predictive error
30
Complete dataset
Shaved dataset
Confidence interval (.95 level)
25
20
ATE
15
10
5
0
1
5
50
150
500
1993
Number of features