BioDCV: a gridenabled complete validation setup for functional profiling - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

BioDCV: a gridenabled complete validation setup for functional profiling

Description:

Starting from a suite of C modules and Perl/shell scripts running. on a local HPC resource ... Rewrite shell/Perl scripts in C language. control I/O costs, ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 38

Provided by: giusepp9

Category:

more less

Transcript and Presenter's Notes

Title: BioDCV: a gridenabled complete validation setup for functional profiling

1
BioDCV a grid-enabled complete validation setup
for functional profiling

Cesare Furlanello
with
Silvano Paoli, Davide Albanese. Giuseppe Jurman,
Annalisa Barla, Stefano Merler, Roberto Flor

http//mpa.itc.it
Wannsee Retreat, October 2005
2
(No Transcript)
3
Predictive classification and functional profiling

Algorithms and software systems for
Predictive classification, feature selection,
discovery

Our BioDCV system a set-up based on the E-RFE
algorithm for Support Vector Machines (SVM)
Control of selection bias, a serious
experimental design issue in the use of
prognostic molecular signatures
Subtype identification for studies of disease
evolution and response to treatment

4
Selection bias
John P A IoannidisFebruary 5, 2005
In conclusion, the list of genes included in a
molecular signature (based on one training set
and the proportion of misclassifications seen in
one validation set) depends greatly on the
selection of the patients in training sets.
Five of the seven largest published studies
addressing cancer prognosis did not classify
patients better than chance. This result suggests
that these publications were overoptimistic.
5
Michiels et al, Lancet 2005
the 95 CI for the proportion of
misclassi?cations fell to below 50 for some
training-set sizes in only two of the
studies We noted unstable molecular signatures
and misclassi?cation rates (with minimum rates
between 31 and 49).
6
M. Ruschhaupt et al (2004) "A Compendium to
Ensure Computational Reproducibility in
High-Dimensional Classification Tasks",
Statistical Applications in Genetics and
Molecular Biology 3 (1), Article 37.
A NEW PAPER ON PREDICTIVE GENE PROFILING
INGREDIENTS DATA METHODS (- EXPERIMENTAL
SETUP?)

The authors present a novel algorithm for
classification, preprocessing , feature
selection,
A description is available in XXX and the
algorithm is publicly available as a Windows
program/ website/ R package YYY.
BUT, HAVE THEY ANSWERED THE FOLLOWING QUESTIONS?
1. Which classification result could be achieved
using standard algorithms and is there a
difference in classification quality between a
standard algorithm and the proposed one?
2. If there is a substantial difference, what is
the reason?

7
REANALYSIS OF DATASET Huang E, Cheng SH,
Dressman H, Pittman J, Tsou MH, Horng CF, Bild A,
Iversen ES, Liao M, Chen CM, West M, Nevins JR,
Huang AT. Gene expression predictors of breast
cancer outcomes. The Lancet 36115901596 (2003).
M. Ruschhaupt et al (2004) "A Compendium to
Ensure Computational Reproducibility in
High-Dimensional Classification Tasks",
Statistical Applications in Genetics and
Molecular Biology 3 (1), Article 37.
ERRORS FOR DIFFERENT CLASSIFIERS AND WITH/WITHOUT
METAGENES IN COMPLETE VALIDATION RF random
forest PAM Class prediction by nearest
shrunken centroids PLR Penalized logistic
regression SVM Support Vector Machines BBT
Bayesian Binary Prediction Tree Models
Metagenes new variables from linear combinations
8
M. Ruschhaupt et al (2004) "A Compendium to
Ensure Computational Reproducibility in
High-Dimensional Classification Tasks",
Statistical Applications in Genetics and
Molecular Biology 3 (1), Article 37.
RESULTS

Misclassification rates of around 25 with all
eight methods.
The use of metagenes did not seem to make a big
difference either way.
Most of the misclassified samples come from the
group of patients with recurrence, which is the
smaller group. Possibly, this could be explained
by a preference of the classification algorithms
to favour the larger group.

THEN HOW TO EVALUATE ACCURACY IN PREDICTION?
9
The BioDCV setup (E-RFE SVM)

To avoid selection bias (pgtgtn) a COMPLETE
VALIDATION SCHEME
externally a stratified random partitioning,
internally a model selection based on a K-fold
cross-validation
? 3 x 105 SVM models ( random labels ? 2 x 106)

Binary classification, on a 20 000 genes x 45
cDNA array, 400 loops
Ambroise McLachlan, 2002, Simon et. al 2003,
Furlanello et. al 2003
OFS-M Model tuning and Feature ranking
ONF Optimal gene panel estimator
ATE Average Test Error
10
Tasks for BioDCV (E-RFE SVM)
Colon cancer 62 samples (40 tumoral 22
control) described by 2000 genes (Alon et. al,
1999)
Lymphoma 96 samples (74 tumoral 24 control)
described by 4096 genes (Alizadeh et. al, 2000)
Tumor vs. Metastases 76 samples (64 primary
tumoral 12 metastatic) described by 16063 genes
(Ramaswamy et al., 2001)
High Sezary CTCL 30 samples (18 disease 12
control) described by 6660 genes (Kari et al.,
2003) coll. Wistar Inst.
Glioma 50 samples (28 glioblastoma 22
oligodendroglioma) described by 12 627 genes
(Nutt et al., 2003)
Breast cancer 37 samples (18 high risk 19 low
risk) described by 12 625 genes (Huang et al.,
2003)
Mouse Model of Myocardial Infarction 36 samples
(18 infarcted 18 control) described by 12488
geni (Cardiogenomics PGA http//cardiogenomics.med
.harvard.edu, 2003)
11
Tasks for BioDCV (E-RFE SVM)
Liver cancer 213 samples, 107 tumors from liver
cancer 106 non tumoral/normal, 1993 genes,
ATAC-PCR (Sese et. al, 2000)
Breast cancer Wang et al. 2005 238 samples
(286 lymph-node-negative), Affimetrix, 17819
genes Chang et al. 2005 295 samples (151
lymph-node-negative, 144 pos), cDNA 25000 genes
IFOM 62 BRCA (4 subclasses)
Pediatric Leukemia 327 samples, 12626 genes (7
classes, binary 28443), Yeoh et al. 2002
Tumor vs. Metastases 76 samples (64 primary
tumoral 12 metastatic) described by 16063 genes
(Ramaswamy et al., 2001)
High Sezary CTCL 30 samples (18 disease 12
control) described by 6660 genes (Kari et al.,
2003) coll. Wistar Inst.
Glioma 50 samples (28 glioblastoma 22
oligodendroglioma) described by 12 627 genes
(Nutt et al., 2003)
Breast cancer 37 samples (18 high risk 19 low
risk) described by 12 625 genes (Huang et al.,
2003)
Mouse Model of Myocardial Infarction 36 samples
(18 infarcted 18 control) described by 12488
geni (Cardiogenomics PGA http//cardiogenomics.med
.harvard.edu, 2003)
12
1. With a Linux OpenMosix HPC facility
Our HPC resource, MpaCluster, 6Xeon 40 Pentium
CPU, OpenMosix, 3 TeraB central storage.
Upgraded in 2005. ? production in GRID
2003-2004, P3 biproc.
13
Roadmap for a new grid application

Starting from a suite of C modules and Perl/shell
scripts running on a local HPC resource
Optimize modules and scripts
database management of data, of model structures,
of system outputs, scripts for OpenMosix Linux
Clusters
Wrap BioDCV into a grid application
Learn about grid computing
Port the serial version on a computational grid
testbed
Analyze/verify results identify needs/problems
Wrap with C MPI scripts
Build the MPI mechanism
Experiment on the testbed
Submit on production grid
Test scalability
Production

14
1. Optimize modules and scripts

Rewrite shell/Perl scripts in C language
control I/O costs,
a process granularity optimal for temporary data
allocation without tmp files
convenient for migrations
SQLite interface (Database engine library)
SQLite is small, self-contained, embeddable
It provides a relational access to model and data
structures (inputs, outputs, diagnostics)
It supports transactions and multiple
connections, databases up to 2 terabytes in size

local copy (db file) model definitions a
copy of of data indexes defining the partition
of the replicate sample(s)
15
BioDCV
(4)unify the local datasets are merged with
setup after completing the validations tasks. A
complete dataset collecting all the relevant
parameters is created.
(1)exp experiment design through configuration
of the setup database
(2)scheduler script submitting jobs (run) on
each available processor. Platform dependent.
(3)run performs fractions of the complete
validation procedures on several data splits.
Local db is created
16
2. Wrapping into a grid application

Why porting into the grid?
Because we need enough computational
resources
How to port the BioDCV in grid?
PRELIMINARY
Identify a collaborator with experience in grid
computing (e.g. the Egrid Project hosted at ICTP
http//www.egrid.it )
Train human resources (SP ? Trieste)
Join the Egrid testbed (installing a supernode in
Trento)
HANDS-ON
Porting of the serial application on the testbed
patch code as needed code portability is
mandatory to make life easier
Identify requirements/problems

17
A few EDG definitions
User Interface (UI) machine to access the GRID
site
Storage Element (SE) stores the user data in the
grid and makes it available for subsequent
elaboration
Computing Element (CE) where the grid user
programs are delivered for elaboration this is
usually a front-end to several elementary Worker
Node machines
Worker Node (WN) machines where the user
programs are actually executed, possibly with
multiple CPUs
18
The ICTP Egrid project infrastructures

The local testbed in Trieste
Small computational grid based on EDG middleware
Egrid add-ons
Designed for testing/training/porting of
applications
Full compatibility with Grid.it middleware
The production infrastructure
A Virtual Organization within Grid.it, with its
own services
Star topology with central node in Padova

Trieste
CESEWN
Trento
CESEWN
Padova
CE
SE 2.8 TByte
WNs 100 cpus
19
Hands on

Porting the serial application
Easy task due to portability (no actual work
needed)
No software/library dependencies
Testing/Evaluation
Problems identified
Job submission overhead due to EDG mechanisms
Managing multiple (hundreds/thousands) jobs is
difficult and cumbersome
Answer parallellize jobs on the GRID via MPI
Single submission
Multiple executions

20
3. Wrap with C MPI scripts
C MPI

How can we use C MPI?
Prepare two wrappers, and an unifier
one shell script to submit jobs (BioDCV.sh)
one C MPI program (Mpba-mpi)
one shell script to integrate results
(BioDCV-union.sh)
BioDCV.sh in action
copies file from and to Storage Element (SE) and
distributes the microarray dataset to all WNs.
It then starts the C MPI wrapper which spawns
several runs of the BioDCV program (optimize for
resources)
When all BioDCV runs are completed, the wrapper
copies all the results (SQLite files) from the
WNs to the starting SE.
MPBA-MPI executes the BioDCV runs in parallel
BioDCV-union.sh collates results in one SQLite
file (? R)

21
Using BioDCV in Egrid
Edg-job-submit bioDCV.jdl
Padova
UI Egrid Live CD
CE
SE 2.8 TByte
WNs 100 cpus
Trento
site
Resource broker (PD-TN)
CESEWN
Trieste
Palermo
a bootable Linux live-cd distribution with a
complete suit of GRID tools by Egrid (ICTP
Trieste)
. . . .
CESEWN
CESEWN
22
A Job Description
Type "Job" JobType
"MPICH" NodeNumber 64
Executable BioDCV.sh" Arguments
Mpba-mpi 64 lfn/utenti/spaoli/sarcoma.db 400"
StdOutput "test.out" StdError
"test.err" InputSandbox
BioDCV.sh",Mpba-mpi","run", "run.sh"
OutputSandbox "test.err","test.out","executable
.out" Requirements other.GlueCEInfoLRM
SType "PBS" other.GlueCEInfoLRMSType
"LSF"
BioDCV.jdl
23
Using BioDCV in Egrid (II)
BioDCV.sh runs on
First step
Request file
BioDCV.sh copies data from SE to the WN
sarcoma.db
WN 1
SE
Second step
WN 2
Mpba-mpi and Sarcoma.db are distributed to all
the involved WNs
WN 3
BioDCV Sarcoma.db
. . .
WN 1
WN n
24
Using BioDCV in Egrid (III)
Mpba-mpi runs on WN2
Third step
BioDCV runs on
Mpba-mpi on WN3
BioDCV is executed on all involved WNs by MPI
. . .
WN 1
Mpba-mpi on WN n
Fourth step
Job completed
BioDCV.sh runs on
Output
Job completed
Output
SE
Output
WN 1
. . .
Output
BioDCV.sh copies all results (SQLite files) from
the WNs to the starting SE
Job completed
25
RUNNING ON THE TESTBED (EGRID.IT)
SCALING UP TESTS
a.
b.
CPUs Intel Xeon _at_ 2.80 GHz
26
Usage (examples)

Complete dataset
Outlier detection
BioDCV
213 Samples
Complete dataset
Shaved dataset
213 Samples
198 Samples

Compare subgroups and pathological features

BioDCV
27
Example of Semisupervised analysis (Sese)

28
Results

The pros
MPI execution on the GRID in a few days..
The tests showed scalable behavior of our grid
application for increasing numbers of CPUs
Grid computing reduces significantly production
times and allows to tackle larger problems (see
next slide)
The cons
Data movements limit the scalability for a large
number of CPUs
Note this is a GRID.it limitation there is no
shared Filesystem between the WNs, so each file
needs to be copied everywhere!
To hide the latency (ideas)
Smart data distribution from MWN to WNs
Reduce the amount of data to be moved
Proportionate BioDCV subtasks to local cache
Data transferred via MPI communication
Requires MPI coding and some MPI adaptation of
the code)

29
MOVE no. 2 Improving the system

Reduce the amount of data to be moved
Redesign per run
SVM models (about 200) and
results, evaluation
Variables for semisupervised analysis
all managed within one data structure
A large part of the sampletracking semisupervised
analysis,is now managed within BioDCV (about
2000 files, 300MB) i.e. stored through SQLite.
Randomization of labels is fully automated
The SVM library is now an external library
Modular use of machine learning methods
Now adding a PDA module
BioDCV now under GPL (code curation )
Distributed at BioDCV with a SubVersion server
since September 2005

30
(OCTOBER 2005) CLUSTER AND GRID ISSUES

At work on several clusters
MPBA-old 50 P3 CPUs, 1GHz
MPBA-new 6 Xeon CPUs, 2,8 GHz
ECT (BEN) up to 32 (of 100) CPU Xeon, 2.8GHz
SISSA (Cozzini) up to 32 (of 60) P4, 2GHz,
Myrinet
GRID experiences
Egrid production grid (INFN Padua) up to 64
(of 100) Cpu Xeon, 2-3GHzMicroarray data
Sarcoma, HS random, Morishita, Wang,
LESSONS LEARNED
the latest version reduces latencies (system
times) due to file copying and management ? CPU
saturation
Life quality (and more studies) huge reduction
of file installing and retrieving from facilities
and WITHIN facilities
Forgetting the severe limitation of file system
(AFS, )
Now installing 2.6 LCG2 (CERN realise September
2005)

31
Challenges
Challenges for predictive profiling

INFRASTRUCTURE
MPACluster -gt available for batch jobs
Connecting with IFOM -gt 2005
Running at IFOM -gt 2005/2006
Production on GRID resources (spring 2005)

ALGORITHMS II
Gene list fusion suite of algebraic/statistical
methods
Prediction over multi-platform gene expression
datasets (sarcoma, breast cancer) large scale
semi-supervised analysis
New SVM Kernels for prediction on spectrometry
data within complete validation

32
Challenges (AIRC-BICG)
HPC-Interaction access through webfront-ends to
GRID HPC

BASIC CLASSIFICATION MODELS, lists, additional
tools
Tools for researchers subtype discovery,
outlier detection
Connection to data (DBMIAME)

33
(No Transcript)
34
Acknowledgments
ITC-irst, Trento Davide Albanese Giuseppe
JurmanStefano MerlerRoberto FlorAlessandro
Soraruf ICTP E-GRID Project, Trieste Angelo
Leto Cristian Zoicas Riccardo Murri Ezio
Corso Alessio Terpin

IFOM-FIRC and INT, Milano
James Reid
Manuela Gariboldi
Marco A. Pierotti
Grants
BICG (AIRC)
Democritos
Data
ShoweLab (Wistar)
Cardiogenomics PGA

35
DTW based clustering

Positive
Curves were clustered with Dynamic Time Warping,
as distance with weight configuration (1,2,1).
0.008
0.006
125
0.004
Height
61a
91a
0.002
137
104a
17
27a
81a
13
33
69a
84
14
117
109
120
36a
58a
116
0.000
103
111
43a
143a
21a
92a
72a
121
80a
123
122a
19
57a
132
83a
37a
44a
85
47a
106a
50a
82a
55a
100a
45a
25
59a
99a
48a
30
144a
124
302
112
95
29a
63a
24
105a
89a
96a
87a
93a
98a
20
128a
88
42a
79a
56a
90a
68a
119a
97a
66
54
62
49a
41a
75a
86a
118
301a
40a
127a
22
51a
67a
64a
70a
52a
300
130a
36

Subgroup and pathological features