TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub

Description:

CMSC 838 Presentation ... Merge results TurboHub infrastructure Evaluation 3 test runs and analysis Related work Powerblast Paracel s BLAST Machine ... PVM, OpenMP ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 19
Provided by: umd109
Category:

less

Transcript and Presenter's Notes

Title: TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub


1
TurboBLAST A Parallel Implementation of BLAST
Built on the TurboHub
  • Bin Gan
  • CMSC 838 Presentation

2
Motivation
  • NCBI BLAST on a single processor has become too
    costly , inefficient, and time-consuming.
  • Sequence database are exploding in size.
  • Growing at an exponential rate
  • Exceeds the rate of increase in hardware
    capabilities ( Moores Law)
  • Thrashing and buffer management
  • Goals
  • Faster results for life science laboratories
  • Do not change the BLAST algorithm
  • Avoid using costly multiprocessor machines
  • Cheap alternatives of clusters of machines

3
Talk Overview
  • Overview of talk
  • Motivation
  • Techniques
  • Database partition
  • Use the sequential BLAST
  • Merge results
  • TurboHub infrastructure
  • Evaluation
  • 3 test runs and analysis
  • Related work
  • Powerblast
  • Paracels BLAST Machine
  • mpiBLAST
  • Words of Bill Pearson ( auther of FASTA)
  • Observations

4
Techniques
  • Approach
  • Main intuition
  • Implementation
  • Clients, master, and workers
  • TurboHub System
  • Load balance
  • Fault recovery
  • Dynamic database partitioning
  • Binary tree analogy

5
Techniques
  • Approach
  • Split databases instead of query sequences in
    binary tree fashion
  • Algorithms to decide how to split with the goal
    of balance overhead load
  • Each processor runs complete sequential BLAST
    using database subsets
  • Merge the result into XML format
  • Adjust BLAST statistics for database sizes
  • TurboHub provide backend support for scheduling,
    fault recovery, etc.
  • Main Intuitions
  • Divide and Conquer
  • BLAST compares target sequence with each sequence
    in the database individually
  • Very little communication is needed, and the
    communication is not order dependant
  • Easy to achieve parallelism by splitting the
    database and assembling the result

6
Techniques
  • Implementations
  • 3 tier system
  • Client
  • End user submitting job to the system
  • Master
  • Java application accepts the job
  • Sets up for processing
  • Uses TurboHub
  • Manage task execution
  • Coordinate the workers
  • Support dynamic change in set of workers, fault
    tolerances, etc.
  • Workers

7
Techniques
  • Implementations Cont.
  • Workers
  • Has a local copy of NCBI blastall
  • Partition the database so that the resulting
    portion can fit into available physical memory
  • Initial task group of 10-20 sequences against all
    the databases to avoid startup cost
  • Some worker process will merge the results
  • Parse the output (store as XML format)
  • Adjust BLAST statistics for database size
  • Scheduling using Piranha models
  • not talked in paper, but very important

8
Techniques
  • TurboHub System
  • Developed by Scientific Computing Associates
  • Capabilities
  • Pipelining
  • Component Replication
  • Parallel Components in combination with tools
    from SCA, MPI, PVM, OpenMP
  • Application in this topic
  • Worker is a wrapped-up blastall components
  • Component scheduling
  • Fault recovery

9
Techniques
  • Task/Database Splitting
  • 2 options
  • Large Task
  • Advantage
  • Maximize resource utilization
  • Minimize task startup overhead
  • Disadvantage
  • Load imbalance
  • Limit the performance gain
  • Small Task
  • Advantage and disadvantage are reverse of the
    above

10
Techniques
  • Task/Database Splitting cont.
  • The papers intermediate approach
  • Create large initial task by experience
  • Communication and program startup are trivial for
    at least 10-20 input query sequences with 256M
    memory
  • If the task is too large, split the databases
  • For multiple databases, create roughly half of
    databases in each sub database
  • For single database, split the database by half
  • Uses virtual shared memory
  • The actual database files are never sent to a
    worker until it actually requires them

11
Techniques
  • Database Splitting
  • Split using NCBI database formatting program
    formatdb
  • Analogy of binary tree
  • All the combined leaves are the database
  • The portion of the database to access depends on
    which node the worker has decided to be at
  • Uses all leaves under the chosen node
  • Advantage
  • Flexibility
  • Deliver exact amount of data as needed
  • Single copy of database

12
Evaluation
  • Experimental environment for test one
  • Input data sets 50 Expressed Sequence Tags
    (ESTs)
  • Database used
  • Drosophila (1,170 sequences, 123 million
    nucleotides),
  • GSS Division of GENBANK (1.27 million sequences,
    651 million nucleotides)
  • E Coli (400 sequences, 4.6 million nucleotides)
  • A group of 500 Mhz PIII with 512K cache, 256M
    Memory,100Mb Ethernet
  • Performance result for test one
  • Serial version 2131.8 second (wall clock time)
  • Parallel version with 11 workers 130.0 second.
    (Speedup 16)

13
Evaluation
  • Experimental environment for test two
  • Input data sets Chromosomes 1, 2, 4 from the
    Arabidopsis Genome
  • Database used
  • Swiss-Prot Protein database (12.8 Million
    peptides)
  • A group of 500 Mhz PIII with 512K cache, 256M
    Memory,100Mb Ethernet
  • Performance result for test two
  • Serial version 5 Days 19 hours and 13 minutes
  • Parallel version with 11 workers 12 hours, 54
    minutes. (Speedup 10.8)

14
Evaluation
  • Experimental environment for test three
  • Input data sets 500 mouse ESTs with 200-400
    nucleotides each
  • Database used
  • An NT database from NCBI (1,681,522,266
    nucleotides)
  • IBM linux cluster of 8 dual processor workstation
  • Each workstation contains 2 996 PIIIs with 2 G
    memory, 100 Mbit ethernet
  • Performance result for test three
  • Serial version 4945 second
  • Parallel version with 8 workstations(16 workers)
    357.03 second. (Speedup 13.85)

15
Evaluation Analysis
  • Memory size vs. database size
  • Thrashing avoidance for superlinear speedups
  • Single query at a time
  • Single query at each node
  • Overhead
  • Need to combine results
  • TurboHub overhead
  • Database transmission overhead

16
Related Work
  • Other parallel BLAST
  • Blackstone's PowerBLAST (part of PowerCloud)
  • Automate the splitting of query databases into
    smaller chunks
  • Spread out over the cluster nodes' local disks
    for querying
  • Automates the merging of BLAST results
  • Use disk caching and scheduling techniques to
    speed up future queries of the same datasets
  • Paracel's BLAST Machine
  • Paracel actually got inside BLAST and
    parallelized the code
  • Post impressive speed up numbers and the
    statistics
  • Same as an unaltered BLAST query
  • mpiBLAST
  • Splits the database across each node in the
    cluster, so it can usually reside in the
    buffer-cache

17
Related Work
  • Words of Bill Pearson (FASTA) in response to why
    there are no MPI or PVM parallelized versions of
    BLAST
  • Note Paracels types of parallelization
  • It is too fast and there is not much demand for
    it
  • 95 of the time, BLAST is almost an in-memory
    grep
  • Sequence comparison is embarrassingly parallel,
    and very easily threaded
  • Distributing the sequence databases and
    collecting results has more overhead
  • FASTA is 5 - 10X slower than BLAST
  • Smith-Waterman is 5-20X slower than FASTA
  • The communications overhead is low, and
    distributed systems work OK for FASTA, and great
    for Smith-Waterman

18
Observations
  • Observation
  • Efficient due to the parallelism embedded in the
    BLAST algorithm
  • Different database splitting techniques
  • Feasible in practice (in computing power, user
    effort, etc)
  • Similar result to previous work
  • Improvement
  • Due to the requirement of not changing code on
    BLAST, superlinear speedup is only possible if
    existing thrashing is avoided.
  • Larger memory and cache size
  • Better load balancing technique
  • Overhead reduction, flexibility vs performance
Write a Comment
User Comments (0)
About PowerShow.com