Highthroughput sequence alignment using Graphics Processing Units - PowerPoint PPT Presentation

About This Presentation
Title:

Highthroughput sequence alignment using Graphics Processing Units

Description:

High-throughput sequence alignment using Graphics Processing Units ... NGS technologies produce a ton of data. AB SOLiD: 22e6 25-mers. Others are even worse... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 38
Provided by: stever1
Category:

less

Transcript and Presenter's Notes

Title: Highthroughput sequence alignment using Graphics Processing Units


1
High-throughput sequence alignment using Graphics
Processing Units
  • Michael C Schatz, Cole Trapnell, Arthur L
    Delcher, Amitabh Varshney
  • UMD
  • Presented by Steve Rumble

2
Motivation
  • NGS technologies produce a ton of data
  • AB SOLiD 22e6 25-mers
  • Others are even worse
  • How does 200e6 50-mers sound?
  • Algorithms have been pushed hard, but typically
    assume same workstation CPU
  • Wozniak and others showed S-W could be
    well-parallelised on special H/W.
  • What of other algorithms/hardware?

3
Motivation
  • GPUs have recently evolved general purpose
    programmability (GPGPU)
  • E.g. nVidia 8800 GTX
  • 16 multiprocessors
  • 8 processors each
  • gt 128 stream processors
  • 768MB onboard
  • 1.35GHz clock
  • Almost a year old now

4
Short GPU Overview
  • Highly parallel execution (hundreds of
    simultaneous operations)
  • Hundreds of gigaflops per chip!
  • Large on-board memories (up to 2GB)
  • Limitations
  • No recursion (no stacks)
  • Each multiprocessors constituent processors
    execute same instruction
  • Thread Divergence due to conditionals hurts
  • No direct host memory access
  • Small caches (locality is key)
  • High memory latency
  • No dynamic memory allocation (why one would ever
    do that, I dont know)

5
Short GPU Overview
  • GPGPU environments
  • Previously had to reduce problems to graphics
    primitives no more
  • Simplified C-like programming
  • Paper has very little detail, but they make it
    sound enticingly simple
  • Each processor runs the same kernel

6
Muh-muh-muh MUMmer!
  • Maximal Unique Match
  • Find longest match for each subsequence of a read
    (of reasonable length)
  • Employs Suffix Trees

7
MUMmerGPU
  • Plug-and-play replacement for MUMmer
  • MUMmer is not arithmetic intensive
  • Is the GPU a good fit?
  • Six-step process
  • 1) Build Suffix Tree of reference genome
    (Ukkonens alg. O(n)) on host CPU
  • 2) Suffix Tree -gt GPU Memory
  • 3) Queries -gt GPU Memory
  • 4) Kick off the GPU
  • 5) Results -gt Host Memory
  • 6) Final processing on Host CPU

8
Suffix Trees
  • We want to find the longest subsequence of a
    string (query) quickly
  • Suffix Trees permit O(m) string search, m
    string length
  • Space complexity is O(n)
  • But constants are apparently pretty big

9
Suffix Trees
  • Definition
  • Node edges have a node label
  • A string subsequence
  • Non-empty (but can be terminating)
  • A path label is the sequence formed by traversing
    from root to leaf
  • 1-1 correspondence of suffixes of S to path
    labels
  • Internal nodes have at least 2 children
  • n leaf nodes one for each suffix of S

10
Suffix Trees
  • O(n) space
  • n leaf nodes
  • gt at most n 1 internal nodes
  • gt n (n 1) 1 2n nodes (worst case)

n 3 n 1 2 3 2 root 6 nodes
11
Suffix Trees
  • Example TORONTO
  • is terminating character

T
NTO
O
RONTO
4
2
RONTO
ORONTO

O
NTO
6
0
5
3
1
12
Suffix Trees
  • Example TORONTO
  • Searching for ONT

T
NTO
O
RONTO
4
2
RONTO
ORONTO

O
NTO
6
0
5
3
1
13
Suffix Trees
  • Example TORONTO
  • Searching for ONT

T
NTO
O
RONTO
4
2
RONTO
ORONTO

O
NTO
6
0
5
3
1
14
Suffix Trees
  • Example TORONTO
  • Searching for ONT

T
NTO
O
RONTO
4
2
RONTO
ORONTO

O
NTO
6
0
5
3
1
15
Suffix Trees
  • Example TORONTO
  • Searching for ONT

T
NTO
O
RONTO
4
2
RONTO
ORONTO

O
NTO
6
0
5
3
1
ONT at position 3 in S
16
Suffix Trees
  • MUMmer wants to find all maximal unique matches
    for all suffixes
  • E.g., for query ACCGTGCGTC, we want
  • ACCGTGCGTC
  • CCGTGCGTC
  • CGTGCGTC
  • GTGCGTC
  • Up to some reasonable limit
  • Dont want to go back to root of tree each time

17
Suffix Trees
  • Suffix Links
  • All internal, non-root nodes have a suffix link
    to another node
  • If x is a single character and a is a (possibly
    empty) string (subsequence), then the path from
    the root to a node v spelling ax (path-label is
    ax) has a suffix link to node v, whose
    path-label is a.
  • Got that?

18
Suffix Trees
  • Example TORONTO
  • Suffix Links Dont backtrack (bad ex.)

T
NTO
O
RONTO
4
2
RONTO
ORONTO

O
NTO
6
0
5
3
1
19
Suffix Trees
  • Example BANANA
  • Better example of Suffix Links

A
NA
BANANA
NA

NA

0
2
4
5

NA
1
3
20
Suffix Trees
  • Example BANANA
  • Searching for suffixes of ANANA

A
NA
BANANA
NA

NA

0
2
4
5

NA
1
3
21
Suffix Trees
  • Example BANANA
  • Searching for suffixes of ANANA

A
NA
BANANA
NA

NA

0
2
4
5

NA
1
3
22
Suffix Trees
  • Example BANANA
  • Searching for suffixes of ANANA

A
NA
BANANA
NA

NA

0
2
4
5

NA
1
3
23
Suffix Trees
  • Example BANANA
  • Searching for suffixes of ANANA

A
NA
BANANA
NA

NA

0
2
4
5

NA
1
3
24
Suffix Trees
  • Example BANANA
  • Searching for suffixes of ANANA

A
NA
BANANA
NA

NA

0
2
4
5

NA
1
3
25
Suffix Trees
  • Example BANANA
  • Searching for suffixes of ANANA

A
NA
BANANA
NA

NA

0
2
4
5

NA
1
3
26
Memory Limitations
  • Suffix trees take up a fair bit of memory
  • GPUs have 100s of MBs, but this is still small
  • Divide the target sequence into k segments with
    overlaps

27
Cache Optimisation
  • Memory latency high, cache performance crucial
  • Were walking a tree here, not crunching numbers
    down an array
  • Can store read-only data in 2D textures nVidia
    caching scheme optimises access
  • Re-order and squish tree nodes into texel
    blocks such that
  • Nodes near root are level-ordered (BFS)
  • Nodes further down are ordered with descendants

28
Cache Optimisation
  • Texture cache organized in 2x2 blocks.
  • Try to place all children of a node are in the
    same cache block

Shamelessly cribbed from http//www.cbcb.umd.edu/
software/cmatch/FastExactStringMatching.ppt
29
Cache Optimisation
  • Reference Sequence stored in 4x216 blocks of a 2D
    array
  • Sequence A B C D E F G H

.
A E B F C G D H
.
a F ß ? G ? ? O
Why? It worked well.
30
Cache Optimisation
  • Memory layouts heuristically determined
  • nVidia cache details not public
  • Cache optimisation improves execution speed by
    several fold.

31
Conclusions
  • GPGPU isnt just good for arithmetic intensive
    applications
  • 5-11x speed-up for NGS data

32
Conclusions
  • Fine Print
  • 5-11x is for the Suffix Tree kernel on the GPU
  • Reality is different!
  • 3.5x speed-up for real data in terms of total
    application runtime.
  • Pretty constant across read lengths (35-700 bp)
  • Careful management of memory layout is crucial
  • Authors claim several-fold performance increase
    (could be difference between some improvement and
    none)

33
Conclusions
  • Runtime dominated by serial parts of MUMmer

34
Food for Thought
  • 8800 GTX costs 400, uses 100-150 watts
  • Quad Core 2 chip runs 250, uses 100-130 watts
  • Each core approx. 2x faster than their test CPU
  • MUMmerGPU maximally 3.5x faster than test CPU
  • What have we won here?

35
Food for Thought
  • Confusing reports
  • Fast Exact String Matching on the GPU (Schatz,
    Trapnell) claims up to 35x improvement
  • Earlier course paper (early/mid-2007)
  • Why from 35x down to 5-11x with MUMmerGPU?

36
My Impressions
  • (whatever theyre worth)
  • GPU is not a clear win (in this case)
  • Suffix trees seem unsuited
  • Cache locality trouble
  • O(n) footprint, but multiplicative constants are
    still substantial
  • Host CPUs seem to be as good or better (in and
    watts)

37
My Impressions
  • GPGPUs arent a great fit here
  • At least for this algorithm
  • MUMmerGPU isnt the order-of-magnitude win it
    claims to be
  • But this is a first-generation, general-purpose
    chip
  • geared toward number-crunching, not
    pointer-traversing
  • I dont think weve seen the last (nor the best)
    of GPUs
Write a Comment
User Comments (0)
About PowerShow.com