Computational Thinking for Biology - PowerPoint PPT Presentation

Loading...

PPT – Computational Thinking for Biology PowerPoint presentation | free to view - id: 22043a-MGFkM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Computational Thinking for Biology

Description:

Computational Thinking for Biology & Sequence Alignment ... Slide courtesy: Jeannette M. Wing. CT-biology & Seq. Alignment, SC|09 Education, Nov 14, 2009 ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 74
Provided by: Ana137
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Computational Thinking for Biology


1
Computational Thinking for Biology Sequence
Alignment
  • Ananth Kalyanaraman
  • Asst. Professor, School of EECS
  • Washington State University

2
About the Instructor
  • Ananth Kalyanaraman
  • Assistant Professor School of Electrical
    Engineering and Computer Science Washington
    State University Pullman, WA
  • Presenter Contact
  • Email ananth_at_eecs.wsu.edu
  • Website http//www.eecs.wsu.edu/ananth
  • Research Interests
  • Computational Biology and Bioinformatics
  • Parallel Algorithms and Applications
  • String Algorithms and Combinatorial Pattern
    Matching

3
Computing The Art of Problem Solving
4
What is Computing?
  • An art of problem solving
  • Algorithm
  • A systematic step-by-step approach to solve a
    problem

Algorithm
Input
Output
ComputingTools
E.g., computer, notebook, calculator,
bins, pen, pencil
Datastructures
5
So, Can We Compute Without a Computer?
Computer Science is no more about computers
than the music industry is about microphones.
Sherlock Holmes in the 21st Century
6
Computing in All Walks of Life!
7
Dictionary Lookup
  • Input a word to search
  • Output the meaning
  • Idea
  • Use the alphabetical listing within a dictionary
  • Linear search
  • Start at the first page
  • Binary search
  • Start at the middle page

8
Order of Processing
Stack first in last out
Queue first in first out
9
Sorting
  • Sorting mails
  • Idea
  • Use alphabetically or geographically sorted bins
    to sort mails

Slide courtesy Jeannette M. Wing
10
Traveling Salesman Problem
Slide courtesy Jeannette M. Wing
11
Pipelining Doing Laundry
Linear processing
6 hours to do 4 loads
Pipelined processing
3.3 hours to do 4 loads
Slide courtesy Jeannette M. Wing
12
Finding the Shortest Route
  • How do Google map, Mapquest, and your GPS work?
  • Input
  • A map
  • Source
  • Destination
  • Output
  • Directions to go from source to destination in
  • Shortest possible time, or
  • Shortest possible distance

Cost
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Tower of Hanoi
  • Goal Move all disks from peg A to peg C using
    peg B
  • Rules
  • Move one disk at a time
  • Larger disks cannot be placed above smaller disks

Invented by a French Mathematician Edouard Lucas,
1883
A
C
B
Question What is the minimum number of moves
necessary to solve the problem?
17
Try it out!
  • http//www.mazeworks.com/hanoi/index.htm
  • Towers of Hanoi Strategy
  • Solve smaller problems first, and then
    incrementally build up to the final solution
  • A nice way to understand RECURSION

18
Computational Thinking for Biology
19
Basic Molecular Entities
  • DNA
  • Double-stranded molecule
  • Computationally, each strand can be viewed as a
    string over alphabet a,c,g,t
  • Genome
  • Collection of all DNA in a cell
  • Gene
  • Encodes the recipe for producing RNAs and
    proteins
  • RNA
  • Single-stranded molecule derived from genes (w/
    alphabet a,c,g,u)
  • Protein
  • A sequence of amino acids

All of these are string data
Source http//rex.nci.nih.gov/behindthenews/ugt/0
5ugt/ugt05.htm
20
Computers Cells
0101000101010100100010100 101010101010000101010101
0 0010101010101010101010000
accagagatataagacccagagagat acacccagagagagataaccaaa
ga cccagaggtttaaaccagagattacca
21
Combinatorics of the genetic code
Logic
  • 42 lt 20 lt 43
  • Hence 3 nucleotides per codon
  • no less, no more

22
The Molecular Language vs. Prog. Language
Statics
DNA
transcription
mRNA
translation
Protein
Dynamics
23
Languages of many kind
  • if there is a language, should there not be a
    grammar?
  • If there is a grammar, can we build a machine to
    answer membership questions?

24
Example Application RNA Secondary Structure
25
Information Flow During Protein Synthesis
Gene
5
3
DNA
5
3
One gene can code for many proteins!
(alternativesplicing in eukaryotes)
Nucleargenome
26
How to understand biological networks?
  • Graph
  • directed, undirectedweighted
  • Vertices
  • points
  • Edges
  • relations

27
Bioinformatics/Computational Biology An
Interdisciplinary Field
  • General Schema in BCB Research
  • Ask questions of biological importance
  • Answer them through the development and
    application of computational tools
  • Validate/Verify the answers through biological
    and/or computational means

Antedisciplinary yet? Reading S.R. Eddy,
Antedisciplinary science. PLoS Computational
Biology 1(1)e6, 2005.
28
Bioinformatics vs. Computational Biology
  • Computational Biology
  • Hypothesis-driven
  • Definition by NIH The development and
    application of data-analytical and theoretical
    methods, mathematical modeling and computational
    simulation techniques to the study of biological,
    behavioral, and social systems
  • E.g., model gene regulatory networks
  • Bioinformatics
  • Data-driven
  • Definition by NIH Research, development, or
    application of computational tools and approaches
    for expanding the use of biological, medical,
    behavioral or health data, including those to
    acquire, store, organize, archive, analyze, or
    visualize such data
  • E.g., protein database search

29
Sequence ? Structure ? Function
Structure
Sequence Discovery
Gene structure prediction RNA structure
prediction Protein structure prediction
Genome Gene Regulatory elements RNA
products Proteins
DNA
Evolutionary Studies
Tree of life Speciation
Function
Gene to protein annotation Gene expression
analysis Microarray experiments RNA
interference Metabolic networks/pathway
Population Genetics
Haplotype analysis Nucleotide polymorphism
Protein Synthesis in an Eukaryotic Cell Source
Science Primer, NCBI, NIH. http//en.wikipedia.org
/wiki/ImageProteinsynthesis.png
30
An example of where Biology can help answer a CS
question
  • Genetic Algorithms
  • It is a search algorithmic technique used for
    solving optimization problems which are
    combinatorially explosive
  • E.g., Traveling salesman problem
  • Main idea
  • Start with a population of chromosomes (or
    possible solutions)
  • At every step, a new generation (or offspring) of
    chromosomes are formed by random cross-over
    events
  • Select fittest offsprings and carry over to next
    generation
  • Keep iterating until some score condition is met

31
Biology gt Engineering DNA Nanotechnology
Rothemund, Nature, 2006
DNA origami
  • DNA computing
  • DNA robots
  • nanomedicine, drug delivery

32
CS and Biology A symbiosis
  • CS way of thinking helps to
  • solve biological problems
  • Even understand biological concepts from a
    different perspective
  • Biological way of thinking helps to
  • Create new problem solving techniques
  • Engineer new engineering devices from biological
    elements
  • Both disciplines need each other

33
Some Interesting Problems
34
21st Century
  • Technology drives an information revolution

DNA sequencing
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGC
TGTACTGTCGTGTGTGTGTACTCTCCTCTCTCTAGTCTACGTGCTGTATG
CGTTAGTGTCGTCGTCTAGTAGTCGCGATGCTCTGATGTTAGAGGATGCA
CGATGCTGCTGCTACTAGCGTGCTGCTGCGATGTAGCTGTCGTACGTGTA
GTGTGCTGTAAGTCGAGTGTAGCTGGCGATGTATCGTGGT
Gene expression studies
Slide courtesy S. Batzoglou _at_ Stanford
35
Old vs. Next-generation DNA Sequencing
Technologies
  • Sanger sequencing (1981-)
  • 700-1000 bp
  • Next-generation sequencing (2007-)
  • 454/pyrosequencing
  • 400 bp reads, 1 Terabytes/year
  • Illumina/Solexa
  • 70 bp reads, 11 Terabytes/year
  • ABI/SOLiD
  • 50 bp reads, 5 Terabytes/year
  • Gigabytes in a matter of hours!

36
Genome Assembly
Input Multiple copies of the same genome
Output Unordered genome fragments
Next Task Assemble the genome from its fragments
Randomly cut each copy and generate copies
37
Like a Jigsaw Puzzle (not a perfect analogy but
close enough)
Genome
DNA fragments
Primary source of information for
assembly Overlapping Fragments
f1
Overlaps should account for errors and mutations
f2
38
MetaGenomics
  • Application of genomics to the study of microbial
    communities in their natural environment
  • Without necessitating the lab cultivation and
    culture of individual genomes

39
Source J. Handelsman, Microbiology and Molecular
Biology Reviews, 669-685 (2004)
40
A Recurring Situation
Sequence Analysis Required!
41
Genomics is becoming a data-intensive field
  • Human Genome Project, 2001
  • 27 million sequences
  • 20,000 CPU hours
  • Sorcerer II Global Ocean Sampling (Yooseph et
    al., 2007)
  • 28.6 million ORFs
  • 1,000,000 CPU hours!
  • And even more data to come

42
Good news Biological databases are growing
exponentially
And that is why we need Supercomputing!
43
Genome Assembly using Supercomputers
  • Time to solution can be reduced from months to
    days to even hours!
  • Solve bigger problems

44
Fine-grain Parallelism using Hardware Accelerators
These hardware accelerators could give anywhere
between 100x to 1000x speedup
IBM Cell Processor
GPUs (graphic procs)
Traditional CPUs (multi-core)
  • Algorithmic mapping onto special devices
  • Code optimization
  • Chip placements

45
Computer Scientists vs Biologists
46
Computer scientists vs Biologists
  • (almost) Nothing is ever true or false in Biology
  • Everything is true or false in computer science

Slide courtesy S. Batzoglou _at_ Stanford
47
Computer scientists vs Biologists
  • Biologists strive to understand the complicated,
    messy natural world
  • Computer scientists seek to build their own clean
    and organized virtual worlds

Slide courtesy S. Batzoglou _at_ Stanford
48
Computer scientists vs Biologists
  • Biologists are obsessed with being the first to
    discover something
  • Computer scientists are obsessed with being the
    first to invent or prove something

Slide courtesy S. Batzoglou _at_ Stanford
49
Computer scientists vs Biologists
  • Biologists are comfortable with the idea that all
    data have errors
  • Computer scientists are not

Slide courtesy S. Batzoglou _at_ Stanford
50
Computer scientists vs Biologists
  • Computer scientists get high-paid jobs sooner
    after graduation
  • Biologists typically have to complete one or more
    5-year post-docs...

Slide courtesy S. Batzoglou _at_ Stanford
51
Sequence Alignment
52
Sequenced Genomes
The NCBI Genome Project Report
As of November, 2009
Eukaryotes
  • 304 completed or assembled
  • 353 in progress

Prokaryotes
  • 1,000 completed
  • 2,091 in progress

53
Evolution
Slide courtesy S. Batzoglou _at_ Stanford
54
Evolution at the DNA level
Deletion
Mutation
ACGGTGCAGTTACCA
SEQUENCE EDITS
AC----CAGTCCACCA
REARRANGEMENTS
Inversion
Translocation
Duplication
Slide courtesy S. Batzoglou _at_ Stanford
55
Evolutionary Rates



next generation
OK



OK



OK



X



X



Still OK?



Slide courtesy S. Batzoglou _at_ Stanford
56
Sequence conservation typically point to
functionally similar regions
  • Alignment is the key to
  • Finding important regions
  • Determining function
  • Uncovering the evolutionary forces

Slide courtesy S. Batzoglou _at_ Stanford
57
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings x x1x2...xM, y
y1y2yN, an alignment is an assignment of
gaps to positions 0,, N in x, and 0,, N in y,
so as to line up each letter in one sequence
with either a letter, or a gap in the other
sequence
Slide courtesy S. Batzoglou _at_ Stanford
58
What is a good alignment?
  • AGGCTAGTT, AGCGAAGTTT
  • AGGCTAGTT- 6 matches, 3 mismatches, 1 gap
  • AGCGAAGTTT
  • AGGCTA-GTT- 7 matches, 1 mismatch, 3 gaps
  • AG-CGAAGTTT
  • AGGC-TA-GTT- 7 matches, 0 mismatches, 5 gaps
  • AG-CG-AAGTTT

Slide courtesy S. Batzoglou _at_ Stanford
59
Scoring Function
Alternative definition minimal edit
distance Given two strings x, y, find minimum
of edits (insertions, deletions, mutations) to
transform one string to the other
  • Sequence edits
  • AGGCCTC
  • Mutations AGGACTC
  • Insertions AGGGCCTC
  • Deletions AGG . CTC
  • Scoring Function
  • Match m
  • Mismatch -s
  • Gap -d

Challenge Too many possible alignments gtgt
2N
Slide courtesy S. Batzoglou _at_ Stanford
60
How do we compute the best alignment?
Sequence 2 (length N)
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Cell (i,j) stores the optimal score of aligning
Seq1 1..i Seq2 1..j
Sequence 1 (length M)
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
Score of an optimal alignment
Slide courtesy S. Batzoglou _at_ Stanford
61
Dynamic Programming
  • There are only a polynomial number of subproblems
  • Align x1xi to y1yj
  • Original problem is one of the subproblems
  • Align x1xM to y1yN
  • Each subproblem is easily solved from smaller
    subproblems
  • Then, we can apply Dynamic Programming!!!
  • Let
  • F(i,j) optimal score of aligning
  • x1xi
  • y1yj

Slide courtesy S. Batzoglou _at_ Stanford
62
Dynamic Programming (contd)
  • Notice three possible cases
  • xi aligns to yj
  • x1xi-1 xi
  • y1yj-1 yj
  • 2. xi aligns to a gap
  • x1xi-1 xi
  • y1yj -
  • yj aligns to a gap
  • x1xi -
  • y1yj-1 yj

m, if xi yj F(i,j) F(i-1, j-1)
-s, if not
substituion
F(i,j) F(i-1, j) - d
deletion
F(i,j) F(i, j-1) - d
insertion
Slide courtesy S. Batzoglou _at_ Stanford
63
Dynamic Programming (contd)
  • How do we know which case is optimal?
  • Inductive assumption
  • F(i, j-1), F(i-1, j), F(i-1, j-1) are optimal
  • Then,
  • F(i-1, j-1) s(xi, yj)
  • F(i, j) max F(i-1, j) d
  • F( i, j-1) d
  • Where s(xi, yj) m, if xi yj -s, if not

Slide courtesy S. Batzoglou _at_ Stanford
64
Sequence Alignment Example
Input
  • x AGTA
  • y ATA

Match score 1 Mismatch score -1 Gap score
-1
j
0
1
2
3
4
F(1, 1) maxF(0,0) s(A, A), F(0, 1)
d, F(1, 0) d max0 1,
-1 1, -1 1 1
0
i
1
2
3
G -
A A
T T
A A
Slide courtesy S. Batzoglou _at_ Stanford
65
The Needleman-Wunsch Matrix Global Alignment
x1 xM
Computing alignment takes O(MN) time O(MN)
space
y1 yN
Slide courtesy S. Batzoglou _at_ Stanford
66
A variant of the basic algorithm
  • Maybe it is OK to have an unlimited of gaps in
    the beginning and end

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------
zero penalty
zero penalty
  • Then, we dont want to penalize gaps in the ends

Slide courtesy S. Batzoglou _at_ Stanford
67
Different types of overlaps
Example 2 overlappingreads from a sequencing
project - useful in genome assembly
Example Search for a mouse gene within a human
chromosome
Slide courtesy S. Batzoglou _at_ Stanford
68
Parallelizing Sequence Alignment
  • Challenges Techniques

69
So the trouble with parallelizing the dynamic
programming algorithm is
Sequence 2
Cell (i,j) depends on values of cells at previous
row and previous column
j-1
j
i-1
i
Sequence 1
Computation at cell (i,j) perhaps should wait
for the 3 neighboring cells to be
computed F(i-1,j) F(i-1,j-1) F(i,j-1)
70
Parallelization Technique 1 (Edmiston Wagner)
Time step 1
Time step 2
Time step 4
Time step 3
Time step 5
Time step 7
Time step 8
Time step 9
Example procs 3
Time step 6
P0
P1
P2
Strategy Compute one anti-diagonal at each
parallel time step
71
Parallelization Technique 2 (Aluru et al.)
Modify recurrence to elimintate left dependency
Example procs 3
P0
P1
P2
Block decomposition
Strategy Use parallel prefix to compute the
scores along each row within each time step
72
References
  • Computational Thinking
  • J.M. Wing (2006), Computational thinking,
    Communications of the ACM, v. 49 n. 3.
  • P.J. Denning (2007), Computing is a natural
    science, Communications of the ACM, v. 50 n. 7.
  • Sequence Alignment
  • T.F. Smith and M.S. Waterman (1981),
    Identification of common molecular subsequences.
    Journal of Molecular Biology, 147195197.
  • S.F. Altschul et al (1990), Basic local alignment
    search tool. Journal of Molecular Biology,
    215403410.
  • J. Setubal and J. Meidanis (1997), Introduction
    to computational molecular biology. PWS
    Publishing Company, Boston, MA.
  • D. Gusfield (1997), Algorithms on strings, trees
    and sequences computer science and computational
    biology. Cambridge University Press, Cambridge,
    London.

73
More References
  • Parallel Sequence Alignment
  • E.W. Edmiston and R.A. Wagner (1987),
    Parallelization of the dynamic programming
    algorithm for comparison of sequences, Proc.
    International Conference on Parallel Processing,
    pp. 78-80.
  • X. Huang (1989), A space-efficient parallel
    sequence comparison algorithm for a
    message-passing multiprocessor, International
    Journal of Parallel Programming, 18(3)
    pp.223239.
  • S. Aluru et al. (2003), Parallel biological
    sequence comparison using prefix computations,
    Journal of Parallel and Distributed Computing,
    63 pp. 264272.
  • S. Rajko and S. Aluru 2004, Space and Time
    Optimal Parallel Sequence Alignments, IEEE
    Transactions on Parallel and Distributed Systems,
    15(12) pp. 1070-1081.
About PowerShow.com