Sequence Alignment - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Sequence Alignment

Description:

fr D. Fernandez-Baca ISU. BCB 444/544 F06 ISU Terribilini #8 ... Local: Smith-Waterman. NW and SW use dynamic programming. Variations: Gap penalty functions ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 43
Provided by: drena1
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment


1
Sequence Alignment
BCB 444/544 - Introduction to Bioinformatics
  • Lecture 8
  • 8_Sept08

Slides adapted from D. Fernandez-Baca ISU
2
Assignments Reading Exercises(before lecture)
  • vFri Sept 8
  • CH Chp 2 pp 34-59
  • Also, DQs MMs
  • Re Sequence alignment Dynamic programming
  • http//en.wikipedia.org/wiki/Sequence_alignment
  • (read all except sections on Multiple sequence
    alignment, Structural alignment, Phylogenetic
    analysis)
  • Mon Sept 11
  • Re Predicting Protein Function
  • Read Friedberg I, Harder T, Godzik A. (2006)
    JAFA a protein function annotation meta-server.
    Nucleic Acids Res. 34 (Web Server issue)W379-81
    PMID 16845030
  • http//nar.oxfordjournals.org/cgi/content/full/34
    /suppl_2/W379
  • Visit http//jafa.burnham.org/

3
Why compare sequences?
  • To determine whether two (or more) genes or
    proteins are evolutionarily related to each other
  • To identify structurally or functionally similar
    regions within proteins
  • Other?

4
Sequence Comparison Methods
  • Dot Matrix Analysis
  • Dynamic Programming
  • Word or k-tuple methods (BLAST and FASTA)

5
Dot matrices
c
g
g
a
c
a
c
a
c
g
6
Dot matrix comparison
7
Interpretation
  • Regions of similarity appear as diagonal runs of
    dots
  • Reverse diagonals (perpendicular to diagonal)
    indicate inversions
  • Reverse diagonals crossing diagonals (Xs)
    indicate palindromes

8
Dynamic Programming
9
Pair-wise sequence alignments
Idea Display one sequence above another with
spaces inserted in both to reveal similarity
  • A C A T - T C A - C
  • B C - T C G C A G C

10
Two types of alignment
S CTGTCGCTGCACG T TGCCGTG
Global alignment
Local alignment
CTGTCG-CTGCACG -TGC-CG-TG----
CTGTCGCTGCACG-- -------TGC-CGTG
CTGTCG-CTGCACG -TGCCG--TG----
Is this a better alignment?
11
Global alignment Scoring
CTGTCG-CTGCACG -TGC-CG-TG----
Reward for matches ? Mismatch penalty ? Space
penalty ?
score(A) ?w ?x - ?y
w matches x mismatches y spaces
12
Global alignment Scoring
Reward for matches 10 Mismatch penalty
2 Space penalty 5
C T G T C G C T G C - T G C
C G T G -
-5 10 10 -2 -5 -2 -5 -5 10 10 -5
Total 11
13
Optimum Alignment
  • The score of an alignment is a measure of its
    quality
  • Optimum alignment problem Given a pair of
    sequences X and Y, find an alignment (global or
    local) with maximum score

14
Alignment algorithms
  • Global Needleman-Wunsch
  • Local Smith-Waterman
  • NW and SW use dynamic programming
  • Variations
  • Gap penalty functions
  • Scoring matrices

15
Global Alignment Algorithm
16
Theorem. C(i,j) satisfies the following
relationships
Initial conditions
Recurrence relation For 1 ? i ? n, 1 ? j ? m
17
Justification
18
Example
Case 1 Line up Si with Tj
i
i - 1
S C A T T C A C T C - T T C A
G
j
j -1
Case 2 Line up Si with space
i - 1
i
S C A T T C A - C T C - T T
C A G -
j
Case 3 Line up Tj with space
i
S C A T T C A C - T C - T T
C A - G
j
j -1
19
Computation Procedure
C(0,0)
C(i,j)
C(n,m)
20
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
?
10
5
C
A
T
T
C
A
C
10 for match, -2 for mismatch, -5 for space
21
? C T C G C
A G C
?
C
A
T
T
C
A
C
Traceback can yield both optimum alignments
22
Local Alignment Motivation
  • Ignoring stretches of non-coding DNA
  • Non-coding (or "non-functional") regions may be
    more likely to contain mutations than coding
    regions.
  • Local alignment between two gene sequences is
    likely to be between two exons
  • Locating protein domains
  • Proteins of different kind and of different
    species often exhibit local similarities
  • Local similarities may indicate functional
    subunits

23
Local alignment Example
S g g t c t g a g T a a a c g a
Match 2 Mismatch and space -1
Best local alignment
g g t c t g a g a a a c g a -
Score 5
24
Local Alignment Algorithm
C i, j Score of optimally aligning a
suffix of s with a suffix of t.
  • Initialize top row and leftmost column to zero.

25
? C T C G C
A G C
?
C
A
T
T
C
A
C
1 for a match, -1 for a mismatch, -5 for a space
26
Some Results
  • Most pairwise sequence alignment problems can be
    solved in O(mn) time.
  • Space requirement can be reduced to O(mn), while
    keeping run-time fixed Myers88.
  • Highly similar sequences can be aligned in O(dn)
    time, where d measures the distance between the
    sequences Landau86.

27
Affine Gap Penalty Functions
  • Gap penalty h gk
  • where
  • k length of gap
  • h gap opening penalty
  • g gap continuation penalty

Can also be solved in O(nm) time using dynamic
programming
28
Database Searches
  • BLAST

29
BLAST
  • Basic Local Alignment Search Tool
  • Altschul, Gish, Miller, Myers, Lipman, J. Mol.
    Biol. 215 (1990)
  • Altschul, Madden, Schaffer, Zhang, Zhang, Miller,
    Lipman, Nucleic Acids Res. 25 (1997)
  • Main ideas
  • Increase search speed by finding fewer, but
    better, hot spots during initial screening phase
  • Uses longer word sizes
  • Integrate scoring matrix into first phase
  • Compare with FASTA, which requires exact matches

30
BLAST
31
Hits
  • For each word, evaluate score of match (exact or
    not) according to BLOSUM62 substitution matrix
  • e.g., for PQG exact match with PQG
  • score is 756 18
  • There are 20w possible w-length words, but
    considering only those with score gt t, greatly
    reduces number of matches
  • e.g., there are 203 8000 possible matches to
    PQG,
  • but only 50 achieve score gt t 13

32
BLAST Hits
  • A hit is a w-length word in the database that
    aligns with a word from the query sequence with
    score gt t
  • BLAST looks for hits instead of exact matches
  • Allows word size to be kept high for speed,
    without sacrificing sensitivity
  • Typically, w 3-5 for amino acids,
  • w 11-12 for DNA
  • t is the most critical parameter
  • ?t ?? ? background hits (faster)
  • ?t ?? ? ability to detect more distant
    relationships (at cost of increased noise

33
Extending a hit
  • After locating a hit, BLAST attempts to extend
    hit in both directions, until score has drops
    more than X below the maximum score yet attained.
  • Extension step typically accounts for gt 90 of
    execution time.

34
Extending a hit
35
Improvement 2-hit method
  • Do extensions only when there are two hits on the
    same diagonal within some distance A of each
    other (e.g., A 40)
  • Reduces sensitivity (ability to detect distantly
    related sequences)
  • To compensate, use lower t value (e.g., 11 rather
    than 13)
  • Because we only extend when there are two nearby
    hits, many fewer regions are extended

36
BLAST Terminology
  • Segment pair equal-length substrings of
    sequences S1 and S2
  • Locally maximal segment pair segment pair whose
    alignment score cannot be improved by extending
    or shortening it
  • Maximum segment pair (MSP) segment pair with
    maximum score over all segment pairs in the
    sequences S1 and S2
  • High-scoring segment pair (HSP) A segment pair
    with score higher than some cutoff score, s.
  • w is the length parameter t is the threshold
    parameter

37
Gapped BLAST
  • Allows local alignments with indels (similar to
    FASTA)
  • Local alignments from different diagonal are
    merged into a different local alignment followed
    by some indels followed by a second local
    alignment, etc.
  • equivalent to a path through the dynamic
    programming matrix composed of alternating
    diagonal sections and paths connecting them

38
Gapped BLAST
  • Original BLAST implicitly handled gaps by finding
    several distinct HSPs and calculating a
    statistical assessment of the combined result
  • Two or more HSPs each below the cutoff value
    might in combination rise to statistical
    significance
  • Gapped BLAST, extend hits by allowing gaps when
    hits are promising (exceed sg)
  • Advantage We can afford to miss some HSPs as
    long as at least one is found
  • Use dynamic programming, starting from center of
    each high-scoring region if s gt sg
  • sg is chosen such that gapped alignment is
    triggered in about 1/50 of the sequences compared

39
PSI-BLAST
  • Position-Specific Iterated BLAST
  • Generates a multiple alignment from statistically
    significant alignments produced by BLAST
  • Produces a Position-Specific Scoring Matrix
    (PSSM)
  • Can search the database using the PSSM
  • Match sequences to profile
  • Generate new profiles
  • Repeat (iteration)
  • Search gradually extends to increasingly
    divergent sequences

40
Flavors of BLAST
  • BLASTP - protein query against protein DB
  • BLASTN - DNA/RNA query against DNA (GenBank)
  • BLASTX - 6-frame translated DNA query against
    proteinDB
  • TBLASTN - protein query against 6-frame DNA
    translation
  • TBLASTX - 6-frame DNA query to 6-frame DNA
    translation
  • PSI-BLAST - protein "profile" query against
    protein DB
  • PHI-BLAST - protein pattern against protein DB

41
Questions?
  • What are substitution matrices?
  • 2 Major types PAM BLOSUM
  • PAM Point Accepted Mutation - relies on
    "evolutionary model" based on observed
    differences in closely related proteins
  • Model includes defined rate for each type of
    sequence change
  • Suffix number (n) reflects amount of "time"
    passed rate of expected mutation if n of amino
    acids had changed
  • PAM1 - for less divergent sequences (shorter
    time)
  • PAM250 - for more divergent sequences (longer
    time)
  • BLOSUM BLOck SUbstitution Matrix - based on
    aa substitutions observed in evolutionarily
    divergent proteins
  • Doesn't rely on a specific evolutionary model
  • Suffix number (n) reflects expected similarity
    average aa identity in the MSA from which the
    matrix was generated
  • BLOSUM45 - for more divergent sequences
  • BLOSUM60 - for less divergent sequences

42
Questions?
  • What does 6-frame translation mean?
Write a Comment
User Comments (0)
About PowerShow.com