Sequence Comparison - PowerPoint PPT Presentation

Loading...

PPT – Sequence Comparison PowerPoint presentation | free to download - id: 1ac5da-M2RhZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Sequence Comparison

Description:

Dotplot - visual alignment of two sequences. Multiple Sequence Alignment -Two or more sequences ... Rat and Drosophila Groucho Gene. Intergenic comparison ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 69
Provided by: carljs
Learn more at: http://udel.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Sequence Comparison


1
Sequence Comparison
Intragenic - self to self. -find internal
repeating units. Intergenic -compare two
different sequences. Dotplot - visual alignment
of two sequences
Multiple Sequence Alignment -Two or more sequences
2
Overview
  • Why compare sequences
  • Homology vs. identity/similarity
  • DotPlots
  • Scoring
  • Match
  • Mismatch
  • Gap penality
  • Global vs. local alignment
  • Do the results make biological sense?

3
Why Align Sequences
  • Identify conserved sequences

4
Why Align Sequences
  • Identify conserved sequences
  • Identify elements that repeat in a single
    sequence.

5
Why Align Sequences
  • Identify conserved sequences
  • Identify elements that repeat in a single
    sequence.
  • Identify elements conserved between genes.

6
Why Align Sequences
  • Identify conserved sequences
  • Identify elements that repeat in a single
    sequence.
  • Identify elements conserved between genes.
  • Identify elements conserved between species.

7
Why Align Sequences
  • Identify conserved sequences
  • Identify elements that repeat in a single
    sequence.
  • Identify elements conserved between genes.
  • Identify elements conserved between species.
  • Regulatory elements

8
Why Align Sequences
  • Identify conserved sequences
  • Identify elements that repeat in a single
    sequence.
  • Identify elements conserved between genes.
  • Identify elements conserved between species.
  • Regulatory elements
  • Functional elements

9
Underlying Hypothesis?
10
Underlying Hypothesis?
  • EVOLUTION

11
Underlying Hypothesis?
  • EVOLUTION
  • Based upon conservation of sequence during
    evolution we can infer function.

12
Basic terms
  • Similarity - measurable quantity.
  • Similarity- applied to proteins using concept of
    conservative substitutions
  • Identity
  • percentage
  • Homology-specific term indicating relationship by
    evolution

13
Basic terms
  • Orthologs homologous sequences found in two or
    more species, that have the same function (i.e.
    alpha- hemoglobin).

14
Basic terms
  • Orthologs homologous sequences found it two or
    more species, that have the same function (i.e.
    alpha- hemoglobin).
  • Paralogs homologous sequences found in the same
    species that arose by gene duplication. ( alpha
    and beta hemoglobin).

15
Pairwise comparison
  • Dotplot
  • All against all comparison.
  • Every position is compared with every other
    position.

16
Pairwise comparison
  • Dotplot
  • All against all comparison.
  • Every position is compared with every other
    position.
  • Nucleic acids and proteins have polarity.

17
Pairwise comparison
  • Dotplot
  • All against all comparison.
  • Every position is compared with every other
    position.
  • Nucleic acids and proteins have polarity.
  • Typically only one direction makes biological
    sense.

18
Pairwise comparison
  • Dotplot
  • All against all comparison.
  • Every position is compared with every other
    position.
  • Nucleic acids and proteins have polarity.
  • Typically only one direction makes biological
    sense.
  • 5 to 3 or amino terminus to carboxyl terminus.

19
DotPlot
  • Dotplot- matrix, with one sequence across top,
    other down side. Put a dot, or 1, where ever
    there is identity.

20
DotPlot
  • Dotplot- matrix, with one sequence across top,
    other down side. Put a dot, or 1, where ever
    there is identity.

G A T C T
G A T C T
21
DotPlot
  • Dotplot- matrix, with one sequence across top,
    other down side. Put a dot, or 1, where ever
    there is identity.

.
G A T C T
G A T C T
22
DotPlot
  • Dotplot- matrix, with one sequence across top,
    other down side. Put a dot, or 1, where ever
    there is identity.

.
.
G A T C T
G A T C T
23
DotPlot
  • Dotplot- matrix, with one sequence across top,
    other down side. Put a dot, or 1, where ever
    there is identity.

.
.
G A T C T
.
.
G A T C T
24
DotPlot
  • Dotplot- matrix, with one sequence across top,
    other down side. Put a dot, or 1, where ever
    there is identity.

.
.
G A T C T
.
.
G A T C T
.
25
DotPlot
  • Dotplot- matrix, with one sequence across top,
    other down side. Put a dot, or 1, where ever
    there is identity.

.
.
G A T C T
.
.
G A T C T
.
.
.
26
(No Transcript)
27
Simple plot
  • Window size of sequence block used for
    comparison. In previous example
  • window 1
  • Stringency Number of matches required to score
    positive. In previous example
  • stringency 1 (required exact match)

28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
Dot Plot
  • Compare two sequences in every register.
  • Vary size of window and stringency depending upon
    sequences being compared.
  • For nucleotide sequences typically start with
    window 21 stringency 14

32
DotPlot
WINDOW 4 STRINGENCY 2
GATCGTACCATGGAATCGTCCAGATCA
GATC
(4/4)
GATC
- (0/4)
GATC
- (0/4)
GATC
(2/4)
33
This match from G and C out of the four
34
Top 3 Rows
35
Intragenic Comparison
  • Rat Groucho Gene

36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
Intergenic Comparison
  • Rat and Drosophila Groucho Gene

40
(No Transcript)
41
Intergenic comparison
  • Nucleotide sequence contains three domains.

42
Intergenic comparison
  • Nucleotide sequence contains three domains.
  • 50 - 350 - Strong conservation
  • Indel places comparison out of register

43
Intergenic comparison
  • Nucleotide sequence contains three domains.
  • 50 - 350 - Strong conservation
  • Indel places comparison out of register
  • 450 - 1300 - Slightly weaker conservation

44
Intergenic comparison
  • Nucleotide sequence contains three domains.
  • 50 - 350 - Strong conservation
  • Indel places comparison out of register
  • 450 - 1300 - Slightly weaker conservation
  • 1300 - 2400 - Strong conservation

45
Groucho
  • These three coding regions correspond to apparent
    functional domains of the encoded protein

46
Scoring Alignments
  • Quality Score
  • Score x for match, -y for mismatch

47
Scoring Alignments
  • Quality Score
  • Score x for match, -y for mismatch
  • Penalty for
  • Creating Gap
  • Extending a gap

48
Scoring Alignments
  • Quality Score
  • Quality 10(match)

49
Scoring Alignments
  • Quality Score
  • Quality 10(match) -1(mismatch)

50
Scoring Alignments
  • Quality Score
  • Quality 10(match) -1(mismatch) -
  • (Gap Creation Penalty)(of Gaps)

51
Scoring Alignments
  • Quality Score
  • Quality 10(match) -1(mismatch) -
  • (Gap Creation Penalty)(of Gaps) (Gap Ext.
    Pen.)(Total length of Gaps)

52
Z Score (standardized score)
  • Z (Scorealignment - Average Scorerandom)

Standard Deviationrandom
53
  • Quality ScoreRandomization
  • Program takes sequence and randomizes it X times
    (user select).
  • Determines average quality score and standard
    deviation with randomized sequences
  • Compare randomized scores with Quality score to
    help determine if alignment is potentially
    significant.

54
Randomization
  • It has become clear that
  • Sequences appear to evolve in a word like
    fashion.
  • 26 letters of the alphabet--combined to make
    words.
  • Words actually communicate information.
  • Randomization should actually occur at the level
    of strings of nucleotides (2-4).

55
Global Alignment
  • Global - Compares all possible alignments of two
    sequences and presents the one with the greatest
    number of matches and the fewest gaps.

56
Global Alignment
  • Global - Compares all possible alignments of two
    sequences and presents the one with the greatest
    number of matches and the fewest gaps.
  • Alignment will run from one end of the longest
    sequence, to the other end.

57
Global Alignment
  • Global - Compares all possible alignments of two
    sequences and presents the one with the greatest
    number of matches and the fewest gaps.
  • Alignment will run from one end of the longest
    sequence, to the other end.
  • Best for closely related sequences.

58
Global Alignment
  • Global - Compares all possible alignments of two
    sequences and presents the one with the greatest
    number of matches and the fewest gaps.
  • Alignment will run from one end of the longest
    sequence, to the other end.
  • Best for closely related sequences.
  • Can miss short regions of strongly conserved
    sequence.

59
Local Alignment
  • Identifies segments of alignment with the highest
    possible score.

60
Local Alignment
  • Identifies segments of alignment with the highest
    possible score.
  • Align sequences, extends aligned regions in both
    directions until score falls to zero.

61
Local Alignment
  • Identifies segments of alignment with the highest
    possible score.
  • Align sequences, extends aligned regions in both
    directions until score falls to zero.
  • Best for comparing sequences whose relationship
    is unknown.

62
Global Alignment
Local Alignment
63
Blast 2
Basic Local Alignment Search Tool E (expect)
value number of hits expected by random chance
in a database of same size. Larger numerical
value lower significance HIV sequence
64
  • Both Global (Gap) and Local (Bestfit) tools will
    (almost) always give a match.

65
  • Both Global (Gap) and Local (Bestfit) tools will
    (almost) always give a match.
  • It is important to determine if the match is
    biologically relevant.

66
  • Both Global (Gap) and Local (Bestfit) tools will
    (almost) always give a match.
  • It is important to determine if the match is
    biologically relevant.
  • Not necessarily relevant Low complexity regions.
  • Sequence repeats (glutamine runs)

67
  • Both Global (Gap) and Local (Bestfit) tools will
    (almost) always give a match.
  • It is important to determine if the match is
    biologically relevant.
  • Not necessarily relevant Low complexity regions.
  • Sequence repeats (glutamine runs)
  • Transmembrane regions (high in hydrophobes)

68
  • Both Global (Gap) and Local (Bestfit) tools will
    (almost) always give a match.
  • It is important to determine if the match is
    biologically relevant.
  • Not necessarily relevant Low complexity regions.
  • Sequence repeats (glutamine runs)
  • Transmembrane regions (high in hydrophobes)
  • If working with coding regions, you are typically
    better off comparing protein sequences. Greater
    information content.
About PowerShow.com