Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases

Description:

Authors: Allison Waugh, Glenn A. Williams, Liping Wei, and Russ B. Altman ... Protein Data Bank (PDB) increased from 5811 entries to 12110 in three years ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 17
Provided by: csU101
Category:

less

Transcript and Presenter's Notes

Title: Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases


1
Using Metacomputing Tools to Facilitate Large
Scale Analyses of Biological Databases
  • Vinay D. Shet
  • CMSC 838 Presentation
  • Authors Allison Waugh, Glenn A. Williams, Liping
    Wei, and Russ B. Altman

2
Motivation
  • Biological databases are growing at a very high
    rate
  • Protein Data Bank (PDB) increased from 5811
    entries to 12110 in three years
  • Computational tools required to efficiently
    access and analyze this data
  • Typical data analyses
  • Linear scans across database looking for
    something
  • all-versus-all comparisons within database
  • High performance distributed computing resources
    can play important role in these analyses
  • Authors use a distributed computing environment,
    LEGION, to enable large scale analysis on PDB

3
Motivation
  • Similar to evaluation of threaded-blast project
  • We run threaded blast over Sun SMP with 24
    processors
  • Authors run program called FEATURE over LEGION
    framework
  • Can access hundreds of CPUs worldwide
  • Can spawn sequential versions of FEATURE on all
    of them

4
Talk Overview
  • Overview of talk
  • Motivation
  • Background
  • LEGION
  • FEATURE
  • Methods
  • Experiments
  • Results
  • Discussions
  • Related work
  • Observations

5
Background
  • LEGION (Worldwide Virtual Computer)
  • Metacomputing environment comprised of
    geographically distributed, heterogeneous
    collections of workstations and supercomputers
  • Connects resources to make up a single,
    worldwide, virtual computer
  • Coordinates large number of parallel jobs on a
    mixture of processors SMPs, MPPs, PCs on any
    network
  • Legion provides the software infrastructure so
    that a system of heterogeneous, geographically
    distributed, high performance machines can
    interact seamlessly.
  • No manual installation of binaries over multiple
    platforms (LEGION does it automatically)

6
Background
  • LEGION
  • LAM - MPI implementation for workstation clusters
  • Legion supports transparent scheduling, data
    management, fault tolerance, site autonomy,
    single file name space , efficient scheduling
    comprehensive resource management, and a wide
    range of security options.

7
Background
  • FEATURE
  • Site characterization and recognition system
  • Site is a microenvironment distinguished by some
    structural or functional role
  • Identifies functional or structural sites of
    interest in query protein

8
Background
  • FEATURE
  • Measures spatial distributions of chemical and
    physical properties to create statistical model
    of microenvironment
  • Compares regions of query protein with known
    sites and control non-sites and assigns scores
    indicating likelihood of region being site
  • Produces list of potential sites locations with
    corresponding scores
  • Has been used to recognize ion, ligand and enzyme
    binding sites
  • FEATURE is typical data-driven algorithm
    requiring large data storage and efficient data
    analysis
  • Requires 12 hours on single processor to evaluate
    580 non-redundant PDB entries

9
Methods
  • FEATURE run on all protein entries in May 2000
    PDB
  • Searched for potential Calcium binding sites
  • FEATURE has 90 sensitivity and 100 specificity
    to this
  • Three experiments conducted
  • Sequential scan of PDB subset using single
    processor
  • Comprehensive scan of PDB using LEGION system
    using 50 processors
  • Set of runs of LEGION using constant PDB subset
    but varying processors
  • Input parameters to FEATURE and statistical model
    for Ca remained constant

10
Methods
  • Experiments
  • Sequentially scanned arbitrary 726 proteins from
    PDB
  • Runs made on single processor Sun E450 machine
    with 300 MHz Ultra-Sparc CPU
  • Comprehensive scan of all proteins (10,996 total)
    in PDB
  • Maximum of processors 50
  • FEATURE code compiled for various platforms so
    binaries can be run on different machines across
    LEGION
  • Scanned subset of proteins with varying number of
    processors
  • Arbitrarily selected 4997 proteins for each run
  • Varied number of processors using values 20, 40,
    60, and 80

11
Results
  • FEATURE reported six run time failures due to
    non-standard PDB file formats for sequential run
  • FEATURE also run time assertion failures, illegal
    instructions or segmentation faults during second
    experiment

12
Results
13
Discussion
  • FEATURE performance deteriorates after of
    processors exceeds 60
  • Optimal max number is constrained by
  • clients process table which keeps track of each
    LEGION process spawned
  • amount of memory available to support spawned
    processes
  • Thus even if LEGION contains 100s of nodes, users
    cannot use them
  • Also LEGION provides minimal fault-tolerance (if
    any instance fails user must wait till everything
    has finished to re-spawn)
  • Authors maintained local copy of database but
    concede that this is not realistic situation as
  • updates to PDB occur frequently
  • Consumes lot of disk space

14
Related Work
  • Threaded BLAST and MPI Blast
  • Authors work is similar to threaded blast
  • MPI Blast is a parallelized version of Blast so
    single query can be split across multiple
    processors
  • FEATURE is not truly parallelized

15
Observations
  • Running CPU intensive tasks over many processors
    is definitely useful
  • However, LEGION does not scale well as there is
    performance degradation after 60 processors
  • They have not utilized true parallelism in
    FEATURE
  • It seems to me that there is lot of potential to
    parallelize FEATURE given that many potential
    sites can be examined simultaneously
  • What is performance enhancement in parallelized
    version?

16
Questions
Write a Comment
User Comments (0)
About PowerShow.com