BioDCV: a gridenabled complete validation setup for functional profiling - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

BioDCV: a gridenabled complete validation setup for functional profiling

Description:

Starting from a suite of C modules and Perl/shell scripts running. on a local HPC resource ... Rewrite shell/Perl scripts in C language. control I/O costs, ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 38
Provided by: giusepp9
Category:

less

Transcript and Presenter's Notes

Title: BioDCV: a gridenabled complete validation setup for functional profiling


1
BioDCV a grid-enabled complete validation setup
for functional profiling
  • Cesare Furlanello
  • with
  • Silvano Paoli, Davide Albanese. Giuseppe Jurman,
    Annalisa Barla, Stefano Merler, Roberto Flor

http//mpa.itc.it
Wannsee Retreat, October 2005
2
(No Transcript)
3
Predictive classification and functional profiling
  • Algorithms and software systems for
  • Predictive classification, feature selection,
    discovery
  • Our BioDCV system a set-up based on the E-RFE
    algorithm for Support Vector Machines (SVM)
  • Control of selection bias, a serious
    experimental design issue in the use of
    prognostic molecular signatures
  • Subtype identification for studies of disease
    evolution and response to treatment

4
Selection bias
John P A IoannidisFebruary 5, 2005
In conclusion, the list of genes included in a
molecular signature (based on one training set
and the proportion of misclassifications seen in
one validation set) depends greatly on the
selection of the patients in training sets.
Five of the seven largest published studies
addressing cancer prognosis did not classify
patients better than chance. This result suggests
that these publications were overoptimistic.
5
Michiels et al, Lancet 2005
the 95 CI for the proportion of
misclassi?cations fell to below 50 for some
training-set sizes in only two of the
studies We noted unstable molecular signatures
and misclassi?cation rates (with minimum rates
between 31 and 49).
6
M. Ruschhaupt et al (2004) "A Compendium to
Ensure Computational Reproducibility in
High-Dimensional Classification Tasks",
Statistical Applications in Genetics and
Molecular Biology 3 (1), Article 37.
A NEW PAPER ON PREDICTIVE GENE PROFILING
INGREDIENTS DATA METHODS (- EXPERIMENTAL
SETUP?)
  • The authors present a novel algorithm for
    classification, preprocessing , feature
    selection,
  • A description is available in XXX and the
    algorithm is publicly available as a Windows
    program/ website/ R package YYY.
  • BUT, HAVE THEY ANSWERED THE FOLLOWING QUESTIONS?
  • 1. Which classification result could be achieved
    using standard algorithms and is there a
    difference in classification quality between a
    standard algorithm and the proposed one?
  • 2. If there is a substantial difference, what is
    the reason?

7
REANALYSIS OF DATASET Huang E, Cheng SH,
Dressman H, Pittman J, Tsou MH, Horng CF, Bild A,
Iversen ES, Liao M, Chen CM, West M, Nevins JR,
Huang AT. Gene expression predictors of breast
cancer outcomes. The Lancet 36115901596 (2003).
M. Ruschhaupt et al (2004) "A Compendium to
Ensure Computational Reproducibility in
High-Dimensional Classification Tasks",
Statistical Applications in Genetics and
Molecular Biology 3 (1), Article 37.
ERRORS FOR DIFFERENT CLASSIFIERS AND WITH/WITHOUT
METAGENES IN COMPLETE VALIDATION RF random
forest PAM Class prediction by nearest
shrunken centroids PLR Penalized logistic
regression SVM Support Vector Machines BBT
Bayesian Binary Prediction Tree Models
Metagenes new variables from linear combinations
8
M. Ruschhaupt et al (2004) "A Compendium to
Ensure Computational Reproducibility in
High-Dimensional Classification Tasks",
Statistical Applications in Genetics and
Molecular Biology 3 (1), Article 37.
RESULTS
  • Misclassification rates of around 25 with all
    eight methods.
  • The use of metagenes did not seem to make a big
    difference either way.
  • Most of the misclassified samples come from the
    group of patients with recurrence, which is the
    smaller group. Possibly, this could be explained
    by a preference of the classification algorithms
    to favour the larger group.

THEN HOW TO EVALUATE ACCURACY IN PREDICTION?
9
The BioDCV setup (E-RFE SVM)
  • To avoid selection bias (pgtgtn) a COMPLETE
    VALIDATION SCHEME
  • externally a stratified random partitioning,
  • internally a model selection based on a K-fold
    cross-validation
  • ? 3 x 105 SVM models ( random labels ? 2 x 106)


Binary classification, on a 20 000 genes x 45
cDNA array, 400 loops
Ambroise McLachlan, 2002, Simon et. al 2003,
Furlanello et. al 2003
OFS-M Model tuning and Feature ranking
ONF Optimal gene panel estimator
ATE Average Test Error
10
Tasks for BioDCV (E-RFE SVM)
Colon cancer 62 samples (40 tumoral 22
control) described by 2000 genes (Alon et. al,
1999)
Lymphoma 96 samples (74 tumoral 24 control)
described by 4096 genes (Alizadeh et. al, 2000)
Tumor vs. Metastases 76 samples (64 primary
tumoral 12 metastatic) described by 16063 genes
(Ramaswamy et al., 2001)
High Sezary CTCL 30 samples (18 disease 12
control) described by 6660 genes (Kari et al.,
2003) coll. Wistar Inst.
Glioma 50 samples (28 glioblastoma 22
oligodendroglioma) described by 12 627 genes
(Nutt et al., 2003)
Breast cancer 37 samples (18 high risk 19 low
risk) described by 12 625 genes (Huang et al.,
2003)
Mouse Model of Myocardial Infarction 36 samples
(18 infarcted 18 control) described by 12488
geni (Cardiogenomics PGA http//cardiogenomics.med
.harvard.edu, 2003)
11
Tasks for BioDCV (E-RFE SVM)
Liver cancer 213 samples, 107 tumors from liver
cancer 106 non tumoral/normal, 1993 genes,
ATAC-PCR (Sese et. al, 2000)
Breast cancer Wang et al. 2005 238 samples
(286 lymph-node-negative), Affimetrix, 17819
genes Chang et al. 2005 295 samples (151
lymph-node-negative, 144 pos), cDNA 25000 genes
IFOM 62 BRCA (4 subclasses)
Pediatric Leukemia 327 samples, 12626 genes (7
classes, binary 28443), Yeoh et al. 2002
Tumor vs. Metastases 76 samples (64 primary
tumoral 12 metastatic) described by 16063 genes
(Ramaswamy et al., 2001)
High Sezary CTCL 30 samples (18 disease 12
control) described by 6660 genes (Kari et al.,
2003) coll. Wistar Inst.
Glioma 50 samples (28 glioblastoma 22
oligodendroglioma) described by 12 627 genes
(Nutt et al., 2003)
Breast cancer 37 samples (18 high risk 19 low
risk) described by 12 625 genes (Huang et al.,
2003)
Mouse Model of Myocardial Infarction 36 samples
(18 infarcted 18 control) described by 12488
geni (Cardiogenomics PGA http//cardiogenomics.med
.harvard.edu, 2003)
12
1. With a Linux OpenMosix HPC facility
Our HPC resource, MpaCluster, 6Xeon 40 Pentium
CPU, OpenMosix, 3 TeraB central storage.
Upgraded in 2005. ? production in GRID
2003-2004, P3 biproc.
13
Roadmap for a new grid application
  • Starting from a suite of C modules and Perl/shell
    scripts running on a local HPC resource
  • Optimize modules and scripts
  • database management of data, of model structures,
    of system outputs, scripts for OpenMosix Linux
    Clusters
  • Wrap BioDCV into a grid application
  • Learn about grid computing
  • Port the serial version on a computational grid
    testbed
  • Analyze/verify results identify needs/problems
  • Wrap with C MPI scripts
  • Build the MPI mechanism
  • Experiment on the testbed
  • Submit on production grid
  • Test scalability
  • Production

14
1. Optimize modules and scripts
  • Rewrite shell/Perl scripts in C language
  • control I/O costs,
  • a process granularity optimal for temporary data
    allocation without tmp files
  • convenient for migrations
  • SQLite interface (Database engine library)
  • SQLite is small, self-contained, embeddable
  • It provides a relational access to model and data
    structures (inputs, outputs, diagnostics)
  • It supports transactions and multiple
    connections, databases up to 2 terabytes in size

local copy (db file) model definitions a
copy of of data indexes defining the partition
of the replicate sample(s)
15
BioDCV
(4)unify the local datasets are merged with
setup after completing the validations tasks. A
complete dataset collecting all the relevant
parameters is created.
(1)exp experiment design through configuration
of the setup database
(2)scheduler script submitting jobs (run) on
each available processor. Platform dependent.
(3)run performs fractions of the complete
validation procedures on several data splits.
Local db is created
16
2. Wrapping into a grid application
  • Why porting into the grid?
  • Because we need enough computational
    resources
  • How to port the BioDCV in grid?
  • PRELIMINARY
  • Identify a collaborator with experience in grid
    computing (e.g. the Egrid Project hosted at ICTP
    http//www.egrid.it )
  • Train human resources (SP ? Trieste)
  • Join the Egrid testbed (installing a supernode in
    Trento)
  • HANDS-ON
  • Porting of the serial application on the testbed
  • patch code as needed code portability is
    mandatory to make life easier
  • Identify requirements/problems

17
A few EDG definitions
User Interface (UI) machine to access the GRID
site
Storage Element (SE) stores the user data in the
grid and makes it available for subsequent
elaboration
Computing Element (CE) where the grid user
programs are delivered for elaboration this is
usually a front-end to several elementary Worker
Node machines
Worker Node (WN) machines where the user
programs are actually executed, possibly with
multiple CPUs
18
The ICTP Egrid project infrastructures
  • The local testbed in Trieste
  • Small computational grid based on EDG middleware
    Egrid add-ons
  • Designed for testing/training/porting of
    applications
  • Full compatibility with Grid.it middleware
  • The production infrastructure
  • A Virtual Organization within Grid.it, with its
    own services
  • Star topology with central node in Padova

Trieste
CESEWN
Trento
CESEWN
Padova
CE
SE 2.8 TByte
WNs 100 cpus
19
Hands on
  • Porting the serial application
  • Easy task due to portability (no actual work
    needed)
  • No software/library dependencies
  • Testing/Evaluation
  • Problems identified
  • Job submission overhead due to EDG mechanisms
  • Managing multiple (hundreds/thousands) jobs is
    difficult and cumbersome
  • Answer parallellize jobs on the GRID via MPI
  • Single submission
  • Multiple executions

20
3. Wrap with C MPI scripts
C MPI
  • How can we use C MPI?
  • Prepare two wrappers, and an unifier
  • one shell script to submit jobs (BioDCV.sh)
  • one C MPI program (Mpba-mpi)
  • one shell script to integrate results
    (BioDCV-union.sh)
  • BioDCV.sh in action
  • copies file from and to Storage Element (SE) and
    distributes the microarray dataset to all WNs.
  • It then starts the C MPI wrapper which spawns
    several runs of the BioDCV program (optimize for
    resources)
  • When all BioDCV runs are completed, the wrapper
    copies all the results (SQLite files) from the
    WNs to the starting SE.
  • MPBA-MPI executes the BioDCV runs in parallel
  • BioDCV-union.sh collates results in one SQLite
    file (? R)

21
Using BioDCV in Egrid
Edg-job-submit bioDCV.jdl
Padova
UI Egrid Live CD
CE
SE 2.8 TByte
WNs 100 cpus
Trento
site
Resource broker (PD-TN)
CESEWN
Trieste
Palermo
a bootable Linux live-cd distribution with a
complete suit of GRID tools by Egrid (ICTP
Trieste)
. . . .
CESEWN
CESEWN
22
A Job Description
Type "Job" JobType
"MPICH" NodeNumber 64
Executable BioDCV.sh" Arguments
Mpba-mpi 64 lfn/utenti/spaoli/sarcoma.db 400"
StdOutput "test.out" StdError
"test.err" InputSandbox
BioDCV.sh",Mpba-mpi","run", "run.sh"
OutputSandbox "test.err","test.out","executable
.out" Requirements other.GlueCEInfoLRM
SType "PBS" other.GlueCEInfoLRMSType
"LSF"
BioDCV.jdl
23
Using BioDCV in Egrid (II)
BioDCV.sh runs on
First step
Request file
BioDCV.sh copies data from SE to the WN
sarcoma.db
WN 1
SE
Second step
WN 2
Mpba-mpi and Sarcoma.db are distributed to all
the involved WNs
WN 3
BioDCV Sarcoma.db
. . .
WN 1
WN n
24
Using BioDCV in Egrid (III)
Mpba-mpi runs on WN2
Third step
BioDCV runs on
Mpba-mpi on WN3
BioDCV is executed on all involved WNs by MPI
. . .
WN 1
Mpba-mpi on WN n
Fourth step
Job completed
BioDCV.sh runs on
Output
Job completed
Output
SE
Output
WN 1
. . .
Output
BioDCV.sh copies all results (SQLite files) from
the WNs to the starting SE
Job completed
25
RUNNING ON THE TESTBED (EGRID.IT)
SCALING UP TESTS
a.
b.
CPUs Intel Xeon _at_ 2.80 GHz
26
Usage (examples)

Complete dataset
Outlier detection
BioDCV
213 Samples
Complete dataset
Shaved dataset
213 Samples
198 Samples
  • Compare subgroups and pathological features

BioDCV
27
Example of Semisupervised analysis (Sese)


28
Results
  • The pros
  • MPI execution on the GRID in a few days..
  • The tests showed scalable behavior of our grid
    application for increasing numbers of CPUs
  • Grid computing reduces significantly production
    times and allows to tackle larger problems (see
    next slide)
  • The cons
  • Data movements limit the scalability for a large
    number of CPUs
  • Note this is a GRID.it limitation there is no
    shared Filesystem between the WNs, so each file
    needs to be copied everywhere!
  • To hide the latency (ideas)
  • Smart data distribution from MWN to WNs
  • Reduce the amount of data to be moved
  • Proportionate BioDCV subtasks to local cache
  • Data transferred via MPI communication
  • Requires MPI coding and some MPI adaptation of
    the code)

29
MOVE no. 2 Improving the system
  • Reduce the amount of data to be moved
  • Redesign per run
  • SVM models (about 200) and
  • results, evaluation
  • Variables for semisupervised analysis
  • all managed within one data structure
  • A large part of the sampletracking semisupervised
    analysis,is now managed within BioDCV (about
    2000 files, 300MB) i.e. stored through SQLite.
  • Randomization of labels is fully automated
  • The SVM library is now an external library
  • Modular use of machine learning methods
  • Now adding a PDA module
  • BioDCV now under GPL (code curation )
  • Distributed at BioDCV with a SubVersion server
    since September 2005

30
(OCTOBER 2005) CLUSTER AND GRID ISSUES
  • At work on several clusters
  • MPBA-old 50 P3 CPUs, 1GHz
  • MPBA-new 6 Xeon CPUs, 2,8 GHz
  • ECT (BEN) up to 32 (of 100) CPU Xeon, 2.8GHz
  • SISSA (Cozzini) up to 32 (of 60) P4, 2GHz,
    Myrinet
  • GRID experiences
  • Egrid production grid (INFN Padua) up to 64
    (of 100) Cpu Xeon, 2-3GHzMicroarray data
    Sarcoma, HS random, Morishita, Wang,
  • LESSONS LEARNED
  • the latest version reduces latencies (system
    times) due to file copying and management ? CPU
    saturation
  • Life quality (and more studies) huge reduction
    of file installing and retrieving from facilities
    and WITHIN facilities
  • Forgetting the severe limitation of file system
    (AFS, )
  • Now installing 2.6 LCG2 (CERN realise September
    2005)

31
Challenges
Challenges for predictive profiling
  • INFRASTRUCTURE
  • MPACluster -gt available for batch jobs
  • Connecting with IFOM -gt 2005
  • Running at IFOM -gt 2005/2006
  • Production on GRID resources (spring 2005)
  • ALGORITHMS II
  • Gene list fusion suite of algebraic/statistical
    methods
  • Prediction over multi-platform gene expression
    datasets (sarcoma, breast cancer) large scale
    semi-supervised analysis
  • New SVM Kernels for prediction on spectrometry
    data within complete validation

32
Challenges (AIRC-BICG)
HPC-Interaction access through webfront-ends to
GRID HPC
  • BASIC CLASSIFICATION MODELS, lists, additional
    tools
  • Tools for researchers subtype discovery,
    outlier detection
  • Connection to data (DBMIAME)

33
(No Transcript)
34
Acknowledgments
ITC-irst, Trento Davide Albanese Giuseppe
JurmanStefano MerlerRoberto FlorAlessandro
Soraruf ICTP E-GRID Project, Trieste Angelo
Leto Cristian Zoicas Riccardo Murri Ezio
Corso Alessio Terpin
  • IFOM-FIRC and INT, Milano
  • James Reid
  • Manuela Gariboldi
  • Marco A. Pierotti
  • Grants
  • BICG (AIRC)
  • Democritos
  • Data
  • ShoweLab (Wistar)
  • Cardiogenomics PGA

35
DTW based clustering

Positive
Curves were clustered with Dynamic Time Warping,
as distance with weight configuration (1,2,1).
0.008
0.006
125
0.004
Height
61a
91a
0.002
137
104a
17
27a
81a
13
33
69a
84
14
117
109
120
36a
58a
116
0.000
103
111
43a
143a
21a
92a
72a
121
80a
123
122a
19
57a
132
83a
37a
44a
85
47a
106a
50a
82a
55a
100a
45a
25
59a
99a
48a
30
144a
124
302
112
95
29a
63a
24
105a
89a
96a
87a
93a
98a
20
128a
88
42a
79a
56a
90a
68a
119a
97a
66
54
62
49a
41a
75a
86a
118
301a
40a
127a
22
51a
67a
64a
70a
52a
300
130a
36

Subgroup and pathological features

  • All samples have virus type B (V-B).
  • Bottom lines incidence in the cluster subgroup
    and in all the positive samples.
  • DTW based clustering of sample tracking profiles
    of cluster subgroup.

37
Predictive error

30
Complete dataset
Shaved dataset
Confidence interval (.95 level)
25
20
ATE
15
10
5
0
1
5
50
150
500
1993
Number of features
Write a Comment
User Comments (0)
About PowerShow.com