Alignment of Pairs of Sequence - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Alignment of Pairs of Sequence

Description:

Synthesizing DNA Fragments by PCR (polymerase chain reaction) DNA. Heat. Anneal. Primer. ddNTP A. ddNTPs. Scanned Data from electrophoresed fragments ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 40
Provided by: luona
Category:

less

Transcript and Presenter's Notes

Title: Alignment of Pairs of Sequence


1
Alignment of Pairs of Sequence
Chapter-3
Luonan Chen
2
Synthesizing DNA Fragments by PCR (polymerase
chain reaction)
Primer
DNA
Heat Anneal
ddNTPA
ddNTPs
3
Scanned Data from electrophoresed fragments
4
Shotgun Sequencing for DNA
Repetitive sequences ?
A large DNA molecule
5
Sequence Analysis
  • Homology search similarity search ?
    combinatorial optimization (sequence alignment)
  • Motif search machine learning from motif library
    collecting common properties and structures
  • (Motif ? domain ? finger print)

Sequence Alignment Methods
Multiple sequence alignment
Alignment of pairs of sequence
6
Definition of Sequence Alignment
  • Sequence alignment is the procedure of comparing
    two or more sequences by searching for a series
    of individual characters or character patterns
    that are in the same order in the sequences.
    (bases or amino acids)
  • LGPSSKQTGKGS - SRIWDN
  • LN ITKSAGKGAIMRLFDA
  • - - - - - - - - TGKG - - - - - - - - -
  • - - - - - - - - AGKG - - - - - - - - -

Global Alignment
Local Alignment
7
Methods for Searching Similarity
  • Dot matrix analysis (intuitive)
  • DP algorithm (exact)
  • Word or k-tuple (FASTA, BLAST)
  • (heuristic)
  • Motivation Homology, motif, domain,
    classification, structure/function prediction
  • phylogenetic tree,
    interaction
  • Similarity is a measure of the matching
    characters in an alignment
  • Homology is a statement of common evolutionary
    origin. (genes are descended from a common
    ancestor)

8
Global vs. Local Alignments
  • Global alignment algorithms start at the
    beginning of two sequences and add gaps to each
    until the end of one is reached.
  • Local alignment algorithms finds the region (or
    regions) of highest similarity between two
    sequences and build the alignment outward from
    there.

9
Dot Matrix Analysis
  • 1) two sequences on vertical and horizontal axes
    of graph
  • 2) put dots wherever there is a match
  • 3) diagonal line is region of identity (local
    alignment)
  • 4) apply a window filter put dots when n among
    m match
  • (window size m and stringency n m15 n10
    for DNA, mn1 for protein )
  • --- applications similarity for different two
    sequences,
  • direct and inverted repeats for a sequence
    with itself.

10
Simple Dot Matrix Analysis
11
Dot matrix filtered with 4 base window and 3
stringency
12
Dot matrix analysis for similarity
The amino acid sequences of the phage ?cI
(horizontal sequence) and phage P22 c2 (vertical
sequence) repressors. The window size and
stringency are both 1.
13
Sequence Repeats by Dot Matrix
Polymorphic, SNP
diathesis
14
Scoring Similarity
Actually, mutation for A??G,T??C are more likely
than A??T, G??C
  • 1) Can only score aligned sequences
  • 2) DNA is usually scored as identical or not
  • 3) modified scoring for gaps - single vs.
    multiple base gaps (gap extension affine
    penalty)
  • 4) AAs have varying degrees of similarity
  • a. of mutations to convert one to another
  • b. chemical similarity
  • c. observed mutation frequencies
  • 5) Score systems PAM matrices based on
    evolutionary model of protein (or DNA) change
    (mutations), from a small data set
  • BLOSUM matrices designed to identify members
    of the same family, from a large data set.
  • --log odds score 2Snm/2 fold more likely than
    expected by chance

15
The PAM 250 scoring matrix
PAM percent accepted mutation 250 2.5
position changes (2.5107 years evolutionary
distance) M transition matrix pij
of PAM1 logMij/probjlogfij
mi/(fiprobj)logfij/(100fprobiprobj)
PAMn Pn f no. of mutations the
shorter and nearer the sequences, the smaller n.
16
Example of Scoring a Sequence Alignment
  • DNA ATGG T A (gap
    penalty-2)
  • AACG T T A
  • score 2 1 1 2 -2 2 2 Score2421-28
  • (scores are set artificially. Transition between
    A and G or C and T are more probable !)
  • Protein V D S - - C Y (gap
    opening penalty-10)
  • V E S L D C Y
    (gap extension penalty-8)
  • score 4 2 4 -10 -8 9 7 Score
    26-188
  • (scores are based on PAM250
    matrix)
  • --- Results depend on the choice of a scoring
    system.

17
DP Algorithm for Global Alignment(exact,
handling gap)
  • Sequences aa1a2am, bb1b2bn
  • Score Si,jS(a1a2ai, b1b2bj), s(aibj) from PAM
  • wx,wy the penalties for a gap of length x and y
    in a and b
  • Sijmax Si-1,j-1 s(aibj),
  • max(Si-x,j-wx) for x1,
  • max(Si,j-y-wy) for y1
  • -- The alignment from the position (m,n), trace
    back to (1,1)
  • -- Computation complexity O(nm) O(nm2 n2m)
    for nltm

Computation complexity can be reduced to O(nm) ?
Yes
18
Dynamic Programming
  • Dynamic Programming is a very general programming
    technique.
  • It is applicable when a large search space can be
    structured into a succession of stages, such
    that
  • the initial stage contains trivial solutions to
    sub-problems
  • each partial solution in a later stage can be
    calculated by recurring a fixed number of partial
    solutions in an earlier stage
  • the final stage contains the overall solution

19
Global Alignment by Needleman-Wunsch Algorithm
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
DP Algorithm for Local Alignment(exact, handling
gap)
  • Sequences aa1a2an, bb1b2bn
  • Score Hi,jH(a1a2ai, b1b2bj), s(aibj) from PAM
  • wx,wy the penalties for a gap of length x and y
    in a and b
  • Hijmax Hi-1,j-1 s(aibj),
  • max(Hi-x,j-wx) for x1,
  • max(Hi,j-y-wy) for y1,
  • 0
  • --the alignment from highest score position,
    trace back to a zero
  • --negative scores for mismatches, Hij gt 0,
    initial end gap penalty 0

24
Local Alignment by the Smith-Waterman Algorithm
25
(No Transcript)
26
Improvement of Algorithm
  • Computation complexity and storage O(mn)
  • Approximate algorithm, parallel computation
  • Substitution matrix (PAM, BLOSUM)
  • (PAM mutation matrix ? substitution matrix)
  • Gap penalties
  • Bayes Alignment
  • Assessing significance of sequence alignment
    (S comparing with scores R of random
    sequences)
  • P(SgtR) 1-e-Kmne-?R The Gumblel extreme value
    distribution, not normal dist.

27
What program to use for searching?
  • 1) BLAST is fastest and easily accessed on the
    Web
  • limited sets of databases
  • nice translation tools (BLASTX, TBLASTN)
  • 2) FASTA works best in GCG
  • integrated with GCG
  • precise choice of databases
  • more sensitive for DNA-DNA comparisons
  • FASTX and TFASTX can find similarities in
    sequences with frameshifts
  • 3) Smith-Waterman is slower, but more sensitive
  • known as a rigorous or exhaustive search
  • SSEARCH in GCG and standalone FASTA

28
FASTA
  • 1) Derived from logic of the dot plot
  • compute best diagonals from all frames of
    alignment
  • 2) Word method looks for exact matches between
    words in query and test sequence
  • hash tables (fast computer technique)
  • DNA words are usually 6 bases
  • protein words are 1 or 2 amino acids
  • only searches for diagonals in region of word
    matches faster searching

29
Query and Hash Table
Query A T G G G T C Test
sequence T G G A T C G A
2-Tuple
---
30
FASTA Algorithm
31
Makes Longest Diagonal
  • 3) after all diagonals found, tries to join
    diagonals by adding gaps (Connect the sequences
    with close offset value by the restricted DP with
    gap. )
  • 4) computes alignments in regions of best
    diagonals

32
FASTA Alignments
33
FASTA on the Web
  • Many websites offer FASTA searches
  • Various databases and various other services
  • Be sure to use FASTA 3
  • Each server has its limits
  • Be aware that you are depending on the kindness
    of strangers.

34
BLAST
  • Uses word matching like FASTA
  • Similarity matching of words (3 aas, 11 bases)
  • does not require identical words.
  • If no words are similar, then no alignment
  • wont find matches for very short sequences
  • Does not handle gaps.
  • Lower sensitivity but faster than FASTA (10
    times)
  • (good for motif, et al. due to high
    consensus without gap)
  • Use finite automaton for pattern recognition
  • New gapped BLAST (PSI-BLAST) is better

35
BLAST Algorithm
Add similar words besides those in the query.
36
Extend hits one base at a time
which are called HSP
37
HSPs are Aligned Regions
  • The results of the word matching and attempts to
    extend the alignment are segments
  • - called HSPs (High scoring Segment Pairs)
  • BLAST often produces several short HSPs rather
    than a single aligned region

38
Gapped Blast and PSI-Blast
  • Ungapped extension for finding HSP
  • Using window (e.g. 11), let HSP with highest
    scores be a seed
  • Gapped extension for the seed by DP.
  • PSI-Blast can be used for multiple sequence
    alignment.

39
Genome Alignment
  • How to match a protein or mRNA to genomic
    sequence?
  • There is a Genome BLAST server at NCBI
  • Each of the Genome websites has a similar search
    function
  • What about introns?
  • An intron is penalized as a gap, or each exon is
    treated as a separate alignment with its own
    e-score
  • Need a search algorithm that looks for consensus
    intron splice sites and points in the alignment
    where similarity drops off.
Write a Comment
User Comments (0)
About PowerShow.com