CAP5510 - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

CAP5510

Description:

BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ, MegaBLAST, PsiBLAST, PhiBLAST. Others ... BLAT, BLASTZ, MegaBLAST. FLASH, PatternHunter, SSAHA, SENSEI, WABA, ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 41
Provided by: tamerk
Learn more at: http://www.cise.ufl.edu
Category:
Tags: blat | cap5510

less

Transcript and Presenter's Notes

Title: CAP5510


1
CAP5510 BioinformaticsDatabase Searches for
Biological Sequences or Imperfect Alignments
  • Tamer Kahveci
  • CISE Department
  • University of Florida

2
Goals
  • Understand how major heuristic methods for
    sequence comparison work
  • FASTA
  • BLAST
  • Understand how search results are evaluated

3
What is Database Search ?
  • Find a particular (usually) short sequence in a
    database of sequences (or one huge sequence).
  • Problem is identical to local sequence alignment,
    but on a much larger scale.
  • We must also have some idea of the significance
    of a database hit.
  • Databases always return some kind of hit, how
    much attention should be paid to the result?
  • A similar problem is the global alignment of two
    large sequences
  • General idea good alignments contain high
    scoring regions.

4
Imperfect Alignment
  • What is an imperfect alignment?
  • Why imperfect alignment?
  • The result may not be optimal.
  • Finding optimal alignment is usually to costly in
    terms of time and memory.

5
Database Search Methods
  • Hash table based methods
  • FASTA family
  • FASTP, FASTA, TFASTA, FASTAX, FASTAY
  • BLAST family
  • BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
    MegaBLAST, PsiBLAST, PhiBLAST
  • Others
  • FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
  • Suffix tree based methods
  • Mummer, AVID, Reputer, MGA, QUASAR

6
History of sequence searching
  • 1970 NW
  • 1980 SW
  • 1985 FASTA
  • 1990 BLAST

7
Hash Table
8
Hash Table
  • K-gram subsequence of length K
  • Ak entries
  • A is alphabet size
  • Linear time construction
  • Constant lookup time

9
FASTP
  • Lipman Pearson, 1985

10
FASTP
  • Three phase algorithm
  • Find short good matches using k-grams
  • K 1 or 2
  • Find start and end positions for good matches
  • Use DP to align good matches

11
FASTP Phase 1 (1)
position 1 2 3 4 5 6 7 8 9 10 11 protein 1 n c s
p t a . . . . . protein 2 . . . . . a c s p r k
position in
offset amino acid protein A protein B pos
A - posB -----------------------------------------
------------ a 6 6
0 c 2 7
-5 k - 11 n
1 - p 4
9 -5 r -
10 s 3 8
-5 t 5
- ------------------------------------------------
----- Note the common offset for the 3 amino
acids c,s and p A possible alignment can be
quickly found protein 1 n c s p t a
protein 2 a c s p r k
12
FASTP Phase 1 (2)
  • Similar to dot plot
  • Offsets range from 1-m to n-1
  • Each offset is scored as
  • matches - mismatches
  • Diagonals (offsets) with large score show local
    similarities
  • How does it depend on k?

13
FASTP Phase 2
  • 5 best diagonal runs are found
  • Rescore these 5 regions using PAM250.
  • Initial score
  • Indels are not considered yet

14
FASTP Phase 3
  • Sort the aligned regions in descending score
  • Optimize these alignments using Needleman-Wunsch
  • Report the results

15
FASTP - Discussion
  • Results are not optimal. Why ?
  • How does performance compare to Smith-Waterman?
  • What is the impact of k?
  • How does this idea work for DNAs ?
  • K 4 or 6 for DNA

16
FASTA Improvement Over FASTP
  • Pearson 1995

17
FASTA (1)
  • Phase 2 Choose 10 best diagonal runs instead of 5

18
FASTA (2)
  • Phase 2.5
  • Eliminate diagonals that score less than some
    given threshold.
  • Combine matches to find longer matches. It incurs
    join penalty similar to gap penalty

19
FASTA Variations
  • TFASTAX and TFASTAY query protein against a DNA
    library in all reading frames
  • FASTAX, FASTAY DNA query in all reading frames
    against protein database

20
BLAST
  • Altschul, Gish, Miller, Myers, Lipman, 1990

21
BLAST (or BLASTP)
  • BLAST Basic Local Alignment Search Tool
  • An approximation of Smith-Waterman
  • Designed for database searches
  • Short query sequence against long database
    sequence or a database of many sequences
  • Sacrifices search sensitivity for speed

22
BLAST Algorithm (1)
  • Eliminate low complexity regions from the query
    sequence.
  • Replace them with X (protein) or N (DNA)
  • Hash table on query sequence.
  • K 3 for proteins

23
BLAST Algorithm (2)
  • For each k-gram find all k-grams that align with
    score at least cutoff T using BLOSUM62
  • 20k candidates
  • 50 on the average per k-gram
  • 50n for the entire query
  • Build hash table

PQGMCGPFILGTYC
QGM
PQG
PQG PQG 18 PEG 15 PRG 14 PSG 13 PQA 12
T 13
24
BLAST Algorithm (3)
  • Sequentially scan the database and locate each
    k-gram in the hash table
  • Each match is a seed for an ungapped alignment.

25
BLAST Algorithm (4)
  • HSP (High Scoring Pair) A match between a query
    word and the database
  • Find a hit Two non-overlapping HSPs on a
    diagonal within distance A
  • Extend the hit until the score falls below a
    threshold value, X

26
BLAST Algorithm (5)
  • Keep only the extended matches that have a score
    at least S.
  • Determine the statistical significance of the
    result

27
What is Statistical Significance?
  • Two one-on-one games, two scores.
  • Which result is more significant?
  • Expected maybe a random result.
  • Unexpected significant, may have significant
    meanings.

13 15
13 15
28
Statistical Significance
  • E-value The expected number of matches with
    score at least S
  • E Kmne-lambda.S
  • m, n sequence lengths
  • S alignment score
  • K, lambda normalization parameters
  • P-value The probability of having at least one
    match with score at least S
  • 1 e-E
  • The smaller these values are, the more
    significant the result
  • http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/gl
    ossary2.html

29
BLAST - Analysis
  • K (k-gram)
  • Lower more sensitive. Slower.
  • T (neighbor cutoff)
  • Lower Find distant neighbors. Introduces noise
  • X (extension cutoff)
  • Higher lower chances of getting into a local
    minima. Slower.

30
Sample Query
  • http//www.ncbi.nlm.nih.gov/BLAST/

Dhal_ecoli
I D R A M S A A R G V F E R G D W S L S S P A K
R K A V L N K L A D L M E A H A E E L A L L E T L
D T G K P I R H S L R D D I P G A A R A I R W Y A
E A I D K V Y G E V A T T S S H E L A M I V R E P
V G V I A A I V P W N F P L L L T C W K L G P A L
A A G N S V I L K P S E K S P L S A I R L A G L A
K E A G L P D G V L N V V T G F G H E A G Q A L S
R H N D I D A I A F T G S T R T G K Q L L K D A G
D S N M K R V W L E A G G K S A N I V F A D C P D
L Q Q A A S A T A A G I F Y N Q G Q V C I A G T R
L L L E E S I A D E F L A L L K Q Q A Q N W Q P G
H P L D P A T T M G T L I D C A H A D S V H S F I
R E G E S K G Q L L L D G R N A G L A A A I G P T
I F V D V D P N A S L S R E E I F G P V L V V T R
F T S E E Q A L Q L A N D S Q Y G L G A A V W T R
D L S R A H R M S R R L K A G S V F V N N Y N D G
D M T V P F G G Y K Q S G N G R D K S L H A L E K
F T E L K T I W I
31
BLASTN
  • BLAST for nucleic acids
  • K 11
  • Exact match instead of neighborhood search.

32
BLAST Variations
Program Query Target Type
BLASTP Protein Protein Gapped
BLASTN Nucleic acid Nucleic acid Gapped
BLASTX Nucleic acid Protein Gapped
TBLASTN Protein Nucleic acid Gapped
TBLASTX Protein Nucleic acid Gapped
33
Even More Variations
  • PsiBLAST (iterative)
  • BLAT, BLASTZ, MegaBLAST
  • FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
  • Main differences are
  • Seed choice (k, gapped seeds)
  • Additional data structures

34
Suffix Trees
35
Suffix Tree
  • Tree structure that contains all suffixes of the
    input sequence
  • TGAGTGCGA
  • GAGTGCGA
  • AGTGCGA
  • GTGCGA
  • TGCGA
  • GCGA
  • CGA
  • GA
  • A

36
Suffix Tree Example
37
Suffix Tree Analysis
  • O(n) space and construction time
  • 10n to 70n space usage reported
  • O(m) search time for m-letter sequence
  • Good for
  • Small data
  • Exact matches

38
Suffix Array
  • 5 bytes per letter
  • O(m log n) search time
  • Better space usage
  • Slower search

39
Mummer
40
Other Sequence Comparison Tools
  • Reputer, MGA, AVID
  • QUASAR (suffix array)
Write a Comment
User Comments (0)
About PowerShow.com