Grid enabled e-Research in - PowerPoint PPT Presentation

Loading...

PPT – Grid enabled e-Research in PowerPoint presentation | free to download - id: 67fde6-NmQ1Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Grid enabled e-Research in

Description:

Grid enabled e-Research in the Life Sciences Prof Richard Sinnott Technical Director National e-Science Centre Anthony Stell University of Glasgow, Scotland, UK – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Date added: 27 October 2019
Slides: 51
Provided by: nescAcUkt
Learn more at: http://www.nesc.ac.uk
Category:
Tags: enabled | grid | research | role

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Grid enabled e-Research in


1
Grid enabled e-Research in the Life
Sciences Prof Richard Sinnott Technical
Director National e-Science Centre Anthony
Stell University of Glasgow, Scotland, UK 26th
October 2006
2
Life Sciences
  • Some of the Big Questions
  • How does a cell/brain work?
  • Which genes/pathways are involved in which
    diseases and can we develop drugs to target them?
  • Why do people who eat less tend to live longer?
  • Is this drug effective (for these individuals)?
  • How important are genetic / social /
    environmental factors to specific diseases?
  • how clinically significant is the consumption of
    deep fried Mars Bar and Pizza Crunch in Scotland?

3
Life Science Grids
  • Extensive Research Community
  • gt 4000 at Glasgow
  • Extensive Applications
  • Many people care about them
  • Health, Food, Environment
  • Interacts with virtually every discipline
  • Physics, Chemistry, Maths/Stats,
    Nano-engineering,
  • MANY databases relevant to bioinformatics (and
    growing!)
  • Heterogeneity, Interdependence, Complexity,
    Change,

4
Database Growth

PDB Content Growth
Yesterday EMBL Database contained 147,881,486,173
nucleotides in 81,229,974 entries.
   Homo sapiens    Mus musculus    Rattus norvegicus    Bos taurus
   Pan troglo-dytes    Canis familiaris    Monodelphis domestica    Macaca mulatta
   Danio rerio    Aedes aegypti    Other   
  • DBs growing exponentially!!!
  • Biobliographic (MedLine, PubMed)
  • Protein Seq (UniProt, )
  • 3D Molecular Structure (PDB, )
  • Nucleotide Seq (GenBank, EMBL)
  • Pathways (KEGG, WIT)
  • Molecular Classifications (SCOP,)
  • Motif Libraries (PROSITE, Blocks, )

5
More genomes ...
Arabidopsis thaliana
Buchnerasp. APS
Mycobacterium tuberculosis
Borrelia burgorferi
Archaeoglobus fulgidus
Aquifex aeolicus
Yersinia pestis
Vibrio cholerae
Caenorhabitis elegans
Drosophila melanogaster
Campylobacter jejuni
Chlamydia pneumoniae
Escherichia coli
Thermoplasma acidophilum
Neisseria meningitidis Z2491
Mycobacterium leprae
Plasmodium falciparum
Helicobacter pylori
Pseudomonas aeruginosa
Ureaplasma urealyticum
mouse
Bacillus subtilis
Xylella fastidiosa
Thermotoga maritima
Rickettsia prowazekii
Saccharomyces cerevisiae
Salmonella enterica
rat
6
Distributed and Heterogeneous data
Structure
Function
Sequence
LPSYVDWRSAGAVVDIKSQG ECGGCWAFSAIATVEGINKI
TSGSLISLSEQELIDCGRTQQD NTRGCDGGYI TDGFQFIIND
GGINTEENYPYTAQDGDCDV AGGTATAGCGCGCGCGATATATA AA
ATGTACGTACGGGCCCTTATA CGCGCGCGATATATAGCGCGCG
Morphology
Gene expression
Pathways
7
Translational Research
Just one example!
8
Systems-Biology
Tissues
Cell
Protein functions
Organs
Protein Structures
Organisms
Gene expressions
Physiology
Populations
Nucleotide structures
Cell signalling
Nucleotide sequences
Protein-protein interaction (pathways)
9
Is Grid the Answer?
  • Key problems to be addressed
  • Tools that simplify access to and usage of data
  • Internet hopping is not ideal!
  • Tools that simplify access to and usage of large
    scale HPC facilities
  • qsub -a date_time -A account_string -c
    interval -C directive_prefix -e path -h
    -I -j join -k keep -l resource_list -m
    mail_options -M user_list -N name -o path
    -p priority -q destination -r c -S
    path_list -u user_list -v variable_list
    -V -W additional_attributes -z script

10
Is Grid the Answer ctd?
  • Key problems ctd
  • Tools designed to aid understanding of complex
    data sets and relationships between them
  • e.g. through visualisation
  • Support different kinds of collaborative research
  • break down the silos
  • be multi-discipline
  • support the research process
  • Provide access to many more computational
    resources
  • to expedite scientific process (or to make it
    feasible!)

11
Access to and Usage of Data
  • Grid technology should allow to
  • hide heterogeneity,
  • deal with location transparency,
  • address security concerns,
  • support data provenance
  • Data Access and Integration Specification (DAIS)
    being defined by GGF
  • OGSA-DAI/DAIT projects key role in shaping these
    standards
  • Other commercial solutions
  • IBM Information Integrator, SRS,

12
Access to and Usage of HPC facilities
  • Consider whole genome-genome comparisons between
    two species  
  • Current strategy essentially chops up one genome
    and fires searches for those fragments in the
    other then re-assembles results  
  • messy approximate matching - re-assembly
    difficult
  • important correlations can be lost
  • to make this tractable so called junk DNA ignored
  • chopping may introduce artefacts or hide phenomena
  • Better to put both full genomes in memory and
    perform a useful complete comparison
  • Only possible with very high-end machines
    (available via grids)
  • Should not have to be script writer/Linux
    sys-admin to use these facilities

13
Cognitive aspects of Data
they are!!!
  • Life science data can be ugly
  • Raw data sets messy
  • Requires significant effort to understand
  • Schemas/data models evolving
  • Tools needed to
  • Simplify understanding
  • Improve analysis
  • Navigate through potentially huge data sets
  • e.g. to find genes of interest in chromosomes of
    different species,

14
Collaborative Aspects
  • Should provide tools that automate the way
    researchers wish to work
  • User driven workflows
  • Linking compute and data resources on the fly
  • Where is the best place to submit these jobs
    right now?
  • MyGrid workbench gaining widespread acceptance
  • 20,000 downloads
  • 3000 bio-services

15
Collaborative Aspects ctd
  • Break down the silos and multi-disciplinary
  • We are all looking at possible genetic factors in
    cancer, metal health, cardiovascular
  • so we should co-ordinate our efforts and share
    data, knowledge,
  • Has anyone generated results like these?
  • Can I see them now
  • rather than waiting 2 years for the Nature /
    Science publication
  • I need input from a physicist, chemist, a
    statistician, a
  • to explain this,
  • to process these results,
  • to simulate this phenomenon,
  • to verify these results

16
Tissues
Cell
Protein functions
Organs
Protein Structures
Physiology
Organisms
Gene expressions
Populations
Nucleotide structures
Cell signalling
Nucleotide sequences
Protein-protein interaction (pathways)
17
BRIDGES Project
18
Bridges Portal
19
MagnaVista
www.nesc.ac.uk
20
MagnaVista
21
GeneVista
22
Grid Blast Interface
  • Allows genome scale blasting
  • Transparently uses NGS, ScotGrid, other GU
    clusters, Condor pools
  • Many databases already deployed across nodes
  • No user certificates
  • Fine grained security at
  • back-end

23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
Grid Enabled Microarray Expression Profile Search
(GEMEPS)
  • 1 year BBSRC project started 1st March 2006
  • Involves Glasgow, Cornell University, US, Riken
    Institute, Japan
  • Aim to provide tools for discovery, comparison
    and analysis of microarray data sets
  • How does my data compare to others?
  • Species, disease, platform, results,
  • How do these experiments compare?
  • Can we improve the way we establish how genes in
    different species are linked?
  • Requires data access, integration and move
    towards data mining
  • Built upon fine grained security
  • Microarrays expensive and contain potentially
    important (valuable) data sets

28
Experiences
  • Currently exploring microarray data sets in
    detail
  • GEO, ArrayExpress, local in-house microarray
    storage solutions at Riken, Cornell, SHWFGF
  • Investigating/Grid enabling CellMontage software
    (http//cellmontage.cbrc.jp/)
  • system for searching gene expression databases
    for cells or tissues similar to a query gene
    expression profile
  • similarity of two profiles computed by comparing
    the order of genes ranked by expression (Spearman
    Rank)
  • simple measure but sufficient to characterize
    cell types across different microarray platforms
  • gene sets/expression value ranges differ between
    platforms, making direct comparison
    difficult/impossible

29
Microarray Data Resources
  • Various standards and interoperability issues
  • MIAME
  • MAGE-ML
  • MINiML
  • SOFTtext
  • SOFTmatrix
  • Whats in a name?
  • Gene names, probe names, platforms, species names
    in experiments,
  • Life Science Identifiers

30
Grid Enabled Microarray Expression Profile Search
(GEMEPS)
31
Overview of VOTES
  • Grids
  • Compute and Data Grids
  • Accessing Grids
  • Grid Security
  • Clinical Trials and VOTES
  • Clinical Trials
  • VOTES Goals
  • Security Issues
  • Classification Issues
  • Project so far
  • Implementation and Technologies
  • Conclusion
  • Application to life sciences
  • The main challenge the human one

32
Grids what are they?
  • Use existing resources to solve large-scale
    compute or data problems more efficiently
  • Rather than throwing money at hardware solutions
  • Develop applications that intelligently use
    available resources whilst maintaining security
    between all parties involved
  • Compute grids
  • Aggregation of CPU cycles and storage for better
    performance
  • Data grids
  • Enhance quality and value of distributed
    information
  • Virtual Organisations
  • Where parties share data and resources but in a
    limited sense, so who can access what must be
    strictly controlled.

33
Soundbites
  • Next generation of the Internet Various
  • Knitting together network infrastructures
  • Internet on steroids Me at social events
  • Getting better performance with what youve got
  • More bang for your buck BBC Magazine
  • Do the same as could be achieved with lots of
    hardware, but doing it more efficiently
  • Co-ordinated resource sharing and problem
    solving in dynamic, multi-institutional virtual
    organizations Anatomy of the Grid, Ian
    Foster

34
Accessing Grids
  • Want an open, usable interface to access grid
    applications
  • As intuitive and easy to use as browsers to
    access applications on the Internet
  • Portal technology is one possible way forward
  • Developed as stateful web applications.
  • Communicate to middleware solutions which do
    their magic to allocate and use underlying grid
    infrastructures.

35
Grid Security - 1
  • Security is often classified as AAA
  • Authentication
  • Who are you?
  • Authorization
  • What are you allowed to do?
  • Accounting
  • Where were you on the night of?
  • But there are other aspects to be considered
  • Anonymisation
  • Confidentiality
  • Non-repudiation

36
Grid Security - 2
  • Not just server checking on client, but vice
    versa
  • Because a server might be a client to another
    process
  • PKI digital keys/certificates for
    authentication
  • Clever mathematics provide useful encryption and
    signature tools
  • The Code Book by Simon Singh
  • Proxy certificates
  • Pushing your credentials further down a path of
    trust than your immediate neighbour
  • Delegation of trust

37
Grid Technologies
  • Range of middleware solutions
  • Globus Toolkit
  • Open Middleware Infrastructure Institute (OMII)
  • Shibboleth
  • GridSphere
  • Not mature
  • Difficult to implement
  • No clear leaders
  • Our job is to pick, choose and develop

38
Clinical Trials
  • Research studies into new drugs, medical devices
    or other interventions on patients in
    scientifically-controlled environment.
  • Required for regulatory authority approval of new
    therapies.
  • Generally speaking they help improve quality of
    life.

39
VOTES
  • Virtual Organisations for Trials and
    Epidemiological Studies
  • 3 year (2.8 million) MRC funded project started
    in October 2005
  • Collaboration between various UK universities
  • Glasgow, Oxford, Nottingham/Leicester,
    Manchester, Imperial College London
  • Focuses on three key areas of clinical trials
  • Patient Recuitment
  • Data Collection
  • Study Management

40
Key Areas
  • Patient Recruitment
  • How many men aged between 45 and 65 had a heart
    attack last year? How many of them would be
    willing to participate in the trial of a new
    drug?
  • Data Collection
  • Are the participants taking their drug/placebo on
    a regular basis? Have there been any incidents
    relating to the trial?
  • Study Management
  • Who can see the trial data (e.g. consultants,
    nurses)? Who ensures the trial is in the
    patients interest? Can we simplify the ethical
    review process?

41
Data Grids
  • Falls into the remit of the Information Grid
  • which provides a way for information resources
    to be joined with related information resources
    to greater exploit the value of the inherent
    relationships among information, then for new
    connections to be made as situations change.
    Grid Computing with Oracle, technical white
    paper, 2005
  • Two main challenges
  • Security
  • Data classification
  • And we want to plug in to the existing NHS IT
    infrastructure

42
Additional Clinical Security
  • Anonymisation
  • De-identifying data
  • Only interested in the statistical data gt dont
    need to know the patients identity
  • So the identifying data is encrypted
  • Statistical Inference
  • When two bits of seemingly innocuous data are
    joined, can result in identification
  • E.g. an unusual condition in a particular
    postcode

43
Data Classification
  • Main problem here is one of language and
    definition across domains.
  • Solutions proposed include
  • Global schema
  • Essentially an overall description of data that
    all parties must subscribe to.
  • Ontology
  • Methods of translating the idiosyncratic
    description to a common description used by all
    parties.
  • No clear solution to this yet
  • Current method is to join distributed databases
    on CHI number (Community Health Index).

44
VOTES Portal Overview
  • Developed on local test-bed of distributed
    servers and databases
  • Log in and are assigned privileges based on role
  • Select clinical trial
  • Select parameters to view and apply conditions
    (if desired)
  • Results of this query are brought back from the
    databases distributed over the test-bed (or VO if
    you will), joined and presented as a unified
    resource.
  • Demo available at break

45
VOTES Portal Snapshots
46
Architecture
47
Technologies
  • Technologies
  • GridSphere (2.1)
  • Globus Toolkit (4.0)
  • OGSA-DAI (2.2)
  • Security Framework
  • Database user management (Resource-level)
  • Local restrictions on local resources
  • Access Control matrix (VO-level)
  • A bit-wise privilege matrix that will be
    available to the whole VO
  • Representative NHS Databases
  • GPASS
  • SCI Store

48
Conclusions
  • Grid Computing is a challenging field
  • We provide one possible solution to applying
    the technology paradigms to clinical trials and
    studies.
  • And it is hopefully a worthwhile effort, as it
    potentially brings
  • Efficient use of distributed resources and data.
  • Enhanced analysis and understanding of said data.
  • Closer collaboration between participants.
  • Peace, prosperity and general happiness to
    human-kind...
  • Maybe

49
The main challenge
  • is the human one.
  • Encouraging technological uptake
  • Challenging techno-phobic attitudes

50
Further Information
  • Website http//www.nesc.ac.uk/hub/projects/votes
  • Portal http//labpc-12.nesc.gla.ac.uk18080/grids
    phere
  • Contact
  • Prof. Richard Sinnott r.sinnott_at_nesc.gla.ac.uk
  • Anthony Stell a.stell_at_nesc.gla.ac.uk
  • Oluwafemi Ajayi o.ajayi_at_nesc.gla.ac.uk
About PowerShow.com