Grid Enabled High Throughput Virtual Screening Against Four Different Targets Implicated in Malaria - PowerPoint PPT Presentation

About This Presentation
Title:

Grid Enabled High Throughput Virtual Screening Against Four Different Targets Implicated in Malaria

Description:

Molecular docking (FlexX, Autodock) 20 cents/compound, 1 minute. Data challenge on EGEE ... is Molecular Dynamics. Task: deploy Molecular Dynamics computations ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 24
Provided by: Bre9160
Learn more at: https://users.sdsc.edu
Category:

less

Transcript and Presenter's Notes

Title: Grid Enabled High Throughput Virtual Screening Against Four Different Targets Implicated in Malaria


1
Grid Enabled High Throughput Virtual Screening
Against Four Different Targets Implicated in
Malaria
  • Presented by
  • Vinod Kasam

CLADE workshop, HPDC conference, June 25, 2007,
Monterey Bay
2
Outline
  • Wisdom introduction
  • Biological targets
  • Resources used in wisdom
  • Production environment
  • Results
  • Issues
  • Conclusions

3
Introduction to the disease malaria
  • 300 million people worldwide are affected
  • 1-1.5 million people die every year
  • Widely spread
  • Caused by protozoan parasites of the genus
    Plasmodium

Complex life cycle with multiple stages
4
WISDOM-II, second large scale docking deployment
against malaria
Involved in
Biology partners
Malaria target
Parasite detoxification
U. of Pretoria, South-Africa
GST from Plasmodium falciparum
Parasite DNA synthesis
U. of Los Andes, Venezuela U. of Modena, Italia
DHFR from Plasmodium vivax
Parasite DNA synthesis
U. of Modena, Italia
DHFR from Plasmodium falciparum
Parasite cell replication
CEA, Acamba project, France
Tubulin from Plasmodium/plant/mamal
5
WISDOM Wide In Silico Docking On Malaria
  • Biological goal
  • Proposition of new inhibitors for a family of
    proteins produced by Plasmodium
  • Biomedical informatics goal Deployment of in
    silico virtual docking on the grid
  • Grid goal
  • Deployment of a CPU consuming application
    generating large data flows to test the grid
    operation and services gt data challenge

6
High Throughput Virtual Docking
Millions of chemical compounds available
High Throughput Screening 1-10/compound, several
hours
  • Compounds
  • ZINC- 4,3M
  • Chembridge - 500 000

Molecular docking (FlexX, Autodock) 20
cents/compound, 1 minute
Data challenge on EGEE 3 months on 2000
computers
Hits screening using assays performed on living
cells
Leads
Targets 3D structures in PDB One homology model
Clinical testing
Drug
7
Objective of the WISDOM development
  • Objective
  • Dock a whole compound database in a limited time
    with a minimal human involvement during the data
    challenge.
  • Need an optimized environment
  • Production in Limited time
  • Performance are important
  • Need a fault tolerant environment
  • Stress usage of the grid during the DC
  • Grid is heterogeneous and dynamic
  • Data produced are important and cant be easily
    reproduced
  • Need an automatic production environment
  • Grid API are not fully adapted for a bulk use at
    a large scale
  • Ease the execution
  • User-friendly hi-level services

8
Use of a production system
  • Managing thousands of jobs and files is a
    manually labor-intensive task
  • Job preparation, submission and monitoring,
    output retrieval, failure identification and
    resolution, job resubmission
  • In order to efficiently use the resources
  • The amount of transferred data impacts on grid
    performance
  • The data must be installed on the grid
  • The database is stored into subsets
  • Grid process introduces significant delays
  • The submitted jobs must be sufficiently long in
    order to reduce the impact of this middleware
    overhead
  • The production system will provide automated and
    fault-tolerant jobs and files management

9
Grid added value for international collaboration
on neglected and emerging diseases
  • Grids offer unprecedented opportunities for
    sharing information and resources world wide
  • Grids are unique tools for
  • Collecting and sharing information (Epidemiology,
    Genomics)
  • Networking experts
  • Mobilizing resources routinely or in emergency
    (vaccine drug discovery)

10
Grid added value of EGEE for a large scale in
silico experimentation
  • Large computing and storage resources
  • 24 hours a day availability of resources, user
    support
  • Workload Management Service
  • Information and Monitoring Services
  • Data Management Services
  • Security
  • Reliability of services

11
Simplified grid workflow
Results
Compounds list
Site1
Parameter settings Target structures Compounds
sub lists
Statistics
Resource Broker
User interface
Site2
Compounds database
Storage Element
Software
Results
  • FlexX license server
  • 6000 floating licenses offered by BioSolveIT to
    SCAI
  • Maximum number of concurrent used licenses was
    5000

12
Schema of the current WISDOM production
environment
User Interface
User Interface
CEs WNs
SEs
Submits the jobs
SEs
CEs WNs
D M S / G F T P
WMS
WMS
WISDOM production system
FlexX job
FlexX
Checks job status Resubmits
Statistics
Structure file
FLEXlm
FlexLM
Compounds file
Statistics
license
license
Output file
Local server
HealthGrid Server
Web Site
Web Site
inputs
WISDOM DB
outputs
13
Grid infrastructures and projects contributing to
WISDOM-II
EMBRACE
BioinfoGrid
SHARE
EGEE
Auvergrid
EUMedGrid
EUChinaGrid
TWGrid
EELA
European grid project
European grid infrastructure
Regional/national grid infrastructure
14
Instances on different infrastructures
Instances deployed on the different
infrastructures during the WISDOM-II data
challenge
15
Deployment on different infrastrucures
  • Up to 5000 computers in more than17 countries
    mobilized from october 2006 Jan 2007 to provide
    CPU
  • 1.738 TB of data produced

Distribution of jobs
16
Statistics of deployment
  • First DC
  • 80 CPU years
  • 1 TB
  • 1700 CPUs used in parallel
  • July 1st - August 15th 2005
  • 2nd DC
  • 100 CPU years
  • 800 GB
  • 1700 CPUs used used in parallel
  • May 1st -April 15th 2006
  • 3rd DC
  • 413 CPU years
  • 1.7 TB
  • Up to 5000 CPUs in parallel
  • 1st October 2006 - 31 January 2007

Number of Jobs 77,504
Total Number of completed dockings 156,407,400
Estimated duration on 1 CPU 413 years
Duration of the experiment 76 days
Average throughput 78,400 dockings/hour
Maximum number of loaded licences (concurrent running jobs) 5,000
Number of used computing elements 98
Average duration of a job 41 hours
Average crunching factor 1,986
Volume of output results 1,738 TB
The crunching factor is the ratio of the total
CPU time over the duration of the experiment. It
represents the average number of CPUs used
simultaneously all along the data challenge and
is a metric of the parallelization gain.
17
Biological results
  • The repartition of docking energies of the ZINC
    database against GST A structure.
  • (The red column represents a score of -24kj/Mol,
    the docking score of a co-crystallized ligand
    (GTX) of GST A chain)

18
Issues
  • Scheduling efficiency of the grid is still a
    major issue
  • The resource broker is still the main bottleneck
  • This deployment also shows that it is not
    possible to do a naive blacklisting of the
    failing resources, for the simple fact that
    virtually all the grid resources have produced
    aborted jobs, and this blacklisting should also
    take care of the site scheduled downtimes.
  • Store and treat the data in a relational database

19
Interactive Web Portal
  • User Friendly Interface for biologists
  • Real Time output of the results
  • 3D views of the docking poses and structures
  • Resubmission of docking jobs

20
Conclusion
  • Take advantage of the EGEE services, APIs and
    resources.
  • Demonstrated the relevance of computational grids
    in life science applications
  • Manual intervention is reduced (automatic
    resubmission of jobs)
  • Use of AMGA to store results and statistics
    immediately.
  • Interoperable Web Service Interface
  • WSDL following the WS-I profile
  • Improved flexibility to deploy other
    bioinformatics applications.

21
The next steps
  • To address the issue of resource brokers, we are
    trying to submit the jobs by bypassing resource
    brokers
  • Docking step still requires a lot of manual
    intervention
  • Task improve output data collection and
    post-docking analysis
  • The next step after docking is Molecular Dynamics
  • Task deploy Molecular Dynamics computations on
    grid infrastructures (successfully deployed
    already on one target, plasmepsin)
  • Contribution from CNRS-IN2P3, within the
    framework of BioinfoGRID, Fraunhofer SCAI and
    University of Modena
  • Beyond virtual screening, the long term vision
    building a grid for malaria
  • To provide services to research labs working on
    malaria
  • To collect and analyze epidemiological data

22
Long term vision a grid for malaria
LPC Clermont-Ferrand Biomedical grid
SCAI Fraunhofer Knowledge extraction, Chemoinform
atics
CEA, Acamba project Biological targets,
Chemogenomics
Univ. Modena Biological targets, Molecular
Dynamics
HealthGrid Biomedical grid, Dissemination
ITB CNR Bioinformatics, Molecular modelling
Academica Sinica Grid user interface
Univ. Los Andes Biological targets, Malaria
biology
Univ. Pretoria Bioinformatics, Malaria biology
Use the grid technology to foster research and
development on malaria and other neglected
diseases
Contacts also established with WHO, Microsoft,
TATRC, Argonne, SDSC, SERONO, NOVARTIS,
Sanofi-Aventis, Hospitals in subsaharian Africa,
23
Acknowledments
Auvergrid Accamba BioInfoGRID EGEE EMBRACE EUChina
GRID EUMedGRID SHARE TWGrid Conseil Regional
dAuvergne European Union
Academia Sinica BioSolveIT CNR-ITB CNRS CEA Health
grid IN2P3 LPC SCAI Fraunhofer Università di
Modena e Reggio Emilia Université Blaise
Pascal University of Pretoria University of Los
Andes
wisdom.healthgrid.org
Write a Comment
User Comments (0)
About PowerShow.com