Title: Grid Enabled High Throughput Virtual Screening Against Four Different Targets Implicated in Malaria
1Grid Enabled High Throughput Virtual Screening
Against Four Different Targets Implicated in
Malaria
HealthGrid Conference, April 2007, Geneva
2Outline
- Wisdom introduction
- Biological targets
- Resources used in wisdom
- Production environment
- Results
- Issues
- Conclusions
3WISDOM Wide In Silico Docking On Malaria
- Biological goal
- Proposition of new inhibitors for a family of
proteins produced by Plasmodium - Biomedical informatics goal Deployment of in
silico virtual docking on the grid - Grid goal
- Deployment of a CPU consuming application
generating large data flows to test the grid
operation and services gt data challenge
4Objective of the WISDOM development
- Objective
- Producing a large amount of data in a limited
time with a minimal human cost during the data
challenge. - Need an optimized environment
- Limited time
- Performance goal
- Need a fault tolerant environment
- Stress usage of the grid during the DC
- Need an automatic production environment
- Grid API are not fully adapted for a bulk use at
a large scale
5Introduction to the disease malaria
- 300 million people worldwide are affected
- 1-1.5 million people die every year
- Widely spread
- Caused by protozoan parasites of the genus
Plasmodium
Complex life cycle with multiple stages
6WISDOM-II, second large scale docking deployment
against malaria
Involved in
Biology partners
Malaria target
Parasite detoxification
U. of Pretoria, South-Africa
GST from Plasmodium falciparum
Parasite DNA synthesis
U. of Los Andes, Venezuela U. of Modena, Italia
DHFR from Plasmodium vivax
Parasite DNA synthesis
U. of Modena, Italia
DHFR from Plasmodium falciparum
Parasite cell replication
CEA, Acamba project, France
Tubulin from Plasmodium/plant/mamal
7High Throughput Virtual Docking
Millions of chemical compounds available in
laboratories
High Throughput Screening 1-10/compound, nearly
impossible
- Chemical compounds (ZINC) 4.3 million
Molecular docking (FlexX) 413 CPU years, 1.738
TB data
Data challenge on EGEE 90 days on 5000 computers
Hits screening using assays performed on living
cells
Leads
Targets (PDB) Plm, PvDHFR PfDHFR, GST, tubulin
Clinical testing
Drug
8Use of a production system
- Managing thousands of jobs and files is a
manually labor-intensive task - Job preparation, submission and monitoring,
output retrieval, failure identification and
resolution, job resubmission - In order to efficiently use the resources
- The amount of transferred data impacts on grid
performance - The data must be installed on the grid
- The database is stored into subsets
- Grid process introduces significant delays
- The submitted jobs must be sufficiently long in
order to reduce the impact of this middleware
overhead - The production system will provide automated and
fault-tolerant jobs and files management - The system requires tools providing global
statistic data and figures
9Simplified grid workflow
Results
Compounds list
Site1
Parameter settings Target structures Compounds
sub lists
Statistics
Resource Broker
User interface
Site2
Compounds database
Storage Element
Software
Results
- FlexX license server
- 6000 floating licenses offered by BioSolveIT to
SCAI - Maximum number of concurrent used licenses was
5000
10Schema of the current WISDOM production
environment
User Interface
User Interface
CEs WNs
SEs
Submits the jobs
SEs
CEs WNs
D M S / G F T P
WMS
WMS
WISDOM production system
FlexX job
FlexX
Checks job status Resubmits
Statistics
Structure file
FLEXlm
FlexLM
Compounds file
Statistics
license
license
Output file
Local server
HealthGrid Server
Web Site
Web Site
inputs
WISDOM DB
outputs
11Grid infrastructures and projects contributing to
WISDOM-II
EMBRACE
BioinfoGrid
SHARE
EGEE
Auvergrid
EUMedGrid
EUChinaGrid
TWGrid
EELA
European grid project
European grid infrastructure
Regional/national grid infrastructure
12Instances on different infrastructures
Instances deployed on the different
infrastructures during the WISDOM-II data
challenge
13Deployment on different infrastrucures
- Up to 5000 computers in more than17 countries
mobilized in Autumn 2007 to provide CPU - 1.738 TB of data produced
Distribution of jobs
14Statistics of deployment
- First DC
- 80 CPU years
- 1 TB
- 1700 CPUs used in parallel
- July 1st - August 15th 2005
- 2nd DC
- 100 CPU years
- 800 GB
- 1700 CPUs used used in parallel
- May 1st -April 15th 2006
- 3rd DC
- 413 CPU years
- 1.7 TB
- Up to 5000 CPUs in parallel
- 1st October 2006 - 31 January 2007
15Biological results
- The repartition of docking energies of the ZINC
database against GST A structure. - (The red column represents a score of -24kj/Mol,
the docking score of a co-crystallized ligand
(GTX) of GST A chain)
16Issues
- Scheduling efficiency of the grid is still a
major issue - The resource broker is still the main bottleneck
- This deployment also shows that it is not
possible to do a naive blacklisting of the
failing resources, for the simple fact that
virtually all the grid resources have produced
aborted jobs, and this blacklisting should also
take care of the site scheduled downtimes. - Store and treat the data in a relational database
17Conclusions
- Demonstrated the relevance of computational grids
in life science applications - Manual intervention is reduced
- Future steps
- Analysis of the biological results
- Address the issue of RB
18The long term vision a grid for malaria
- Use the grid technology to foster research and
development on malaria and other neglected
diseases - To provide services to research laboratories
- To collect and analyze epidemiological data
- To build a chemogenomic knowledge space
19Acknowledments
Auvergrid Accamba BioInfoGRID EGEE EMBRACE EUChina
GRID EUMedGRID SHARE TWGrid Conseil Regional
dAuvergne European Union
Academia Sinica BioSolveIT CNR-ITB CNRS CEA Health
grid IN2P3 LPC SCAI Fraunhofer Università di
Modena e Reggio Emilia Université Blaise
Pascal University of Pretoria University of Los
Andes
wisdom.healthgrid.org