Ortholog: Two proteins, each from a different species, that are unambiguously related - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Ortholog: Two proteins, each from a different species, that are unambiguously related

Description:

by a single common ancestral protein ... U = n1*n2 n1 (n1 1) - Rank1. 2. If U Uc where Uc is the critical value from U table ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 51
Provided by: audrey80
Category:

less

Transcript and Presenter's Notes

Title: Ortholog: Two proteins, each from a different species, that are unambiguously related


1
Orthologs vs. Paralogs
Ortholog Two proteins, each from a different
species, that are unambiguously related by a
single common ancestral protein -- For a query
protein, there can only be one orthologous
protein in each other species (by
definition) Paralog Two or more proteins from
the same genome that are related by a
common ancestral protein that was duplicated
within the genome.
2
Species Tree
Species A
Species B
Paralogs
Orthologs
Protein XA
Protein XA
Protein YA
Protein XB
Protein XB
gene duplication after speciation
Protein XA
Proteins XA XB are Orthologs of each other
Protein YA
Protein YB
Protein XB
Proteins XA YA are Paralogs of each other
gene duplication before speciation
3
Protein XA
Proteins XA XB are Orthologs of each other
Protein YA
Protein YB
Protein XB
Proteins XA YA are Paralogs of each other
gene duplication before speciation
Assigning orthologs as the reciprocal best-Blast
hits
Species A
Species B
Protein XA
Protein XB
Protein YA
Protein YB
Here, Protein XA and Protein XB are assigned as
orthologs
4
Protein XA
Protein YA
Protein YB
Protein XB
Proteins XA YA are Paralogs of each other
gene duplications after speciation
Assigning orthologs as the reciprocal best-Blast
hits
Species A
Species B
Protein XA
Protein XB
Protein YA
Protein YB
Here, Protein XA and Protein XB are NOT assigned
orthologs
5
Protein XA
Protein YA
Protein YB
Protein XB
Proteins XA YA are Paralogs of each other
gene duplications after speciation
Assigning orthologs as the reciprocal best-Blast
hits
Species A
Species B
Protein XA
Protein XB
Protein YA
Protein YB
Therefore, one must be careful about what you
conclude from lack of orthologs Many
orthology assignments now incorporate length of
the Blast hit eg) top blast hit covering gt70
of the query protein
6
Homework Questions
ClustalW
What is the effect of setting the gap-open
penalty to the minimum? To the maximum?
Minimum gap-open penalty lots of smaller
gaps Maximum gap-open penalty fewer, longer
gaps
This would likely affect alignments of less
similar proteins by affecting the required gaps
7
Homework Questions
PAUP Use multiple ortholog comparison to infer
the evolutionary history of the species

8
Homework Questions
PAUP Use multiple ortholog comparison to infer
the evolutionary history of the species
Heuristic PAUP search 1) Stepwise addition of
species/sequences, chosen in random order
A
B
C
A
A
D
D
B
C
A
B
C
D
B
C
9
Homework Questions
PAUP Use multiple ortholog comparison to infer
the evolutionary history of the species
Heuristic PAUP search 1) Stepwise addition of
species/sequences, chosen in random order
A
B
C
A
A
F
E
D
D
B
C
B
C
A
E
D
B
C
10
Homework Questions
PAUP Use multiple ortholog comparison to infer
the evolutionary history of the species
Heuristic PAUP search 1) Stepwise addition of
species/sequences, chosen in random order Finds
local optima. 2) Branch swapping to try to
find global optimum (TBR tree bisection and
reconnection)
A
F
E
D
B
C
11
Homework Questions
PAUP Use multiple ortholog comparison to infer
the evolutionary history of the species
Heuristic PAUP search 1) Stepwise addition of
species/sequences, chosen in random order Finds
local optima. 2) Branch swapping to try to
find global optimum (TBR tree bisection and
reconnection)
A
F
E
D
B
C
12
Homework Questions
Homework Questions
PAUP Use multiple ortholog comparison to infer
the evolutionary history of the species
PAUP Use multiple ortholog comparison to infer
the evolutionary history of the species
Heuristic PAUP search 1) Stepwise addition of
species/sequences, chosen in random order Finds
local optima. 2) Branch swapping to try to
find global optimum (TBR tree bisection and
reconnection)
Heuristic PAUP search 1) Stepwise addition of
species/sequences, chosen in random order Finds
local optima. 2) Branch swapping to try to
find global optimum (TBR tree bisection and
reconnection)
Did you get the same tree running the identical
method multiple times? Why or why not?
13
Homework Questions
PAUP Use multiple ortholog comparison to infer
the evolutionary history of the species
Heuristic PAUP search 1) Stepwise addition of
species/sequences, chosen in random order Finds
local optima. 2) Branch swapping to try to
find global optimum (TBR tree bisection and
reconnection)
In this heuristic method, finding the local
optimum is dependent on the path through tree
space. A random stepwise-addition process
means that a different path will be taken each
run therefore, its possible to get trapped in
different local optimum areas of tree space (ie.
find a different local optimal tree each
run) This is minimized by 1) ten trials of
random tree finding each run and 2) branch
swapping
14
Homework Questions
Bootstrapping A method to test the significance
of your tree (not to find an optimal tree).
A
How much do you trust this tree?
D
B
E
F
C
This is the optimal tree found (lowest cost using
ALL the data)
15
Homework Questions
Bootstrapping A method to test the significance
of your tree (not to find an optimal tree).
A
How dependent is this tree on the
data? Bootstrapping builds a tree using a subset
of the aligned positions chosen randomly. 1)
Build 100 1,000 trees based on randomly chosen
subsets of the aligned positions (with
replacement). 2) Calculate a consensus tree and
show the frequency of each node in the 1,000
trees.
D
B
E
F
C
This is the optimal tree found (lowest cost using
ALL the data)
A
70
D
96
B
E
100
50
F
C
16
Homework Questions
Bootstrapping A method to test the significance
of your tree (not to find an optimal tree).
A
How dependent is this tree on the
data? Bootstrapping builds a tree using a subset
of the aligned positions chosen randomly. 1)
Build 100 1,000 trees based on randomly chosen
subsets of the aligned positions (with
replacement). 2) Calculate a consensus tree and
show the frequency of each node in the 1,000
trees.
D
B
E
F
C
This is the optimal tree found (lowest cost using
ALL the data)
A
70
Ambiguous relationship
D
B
E
100
50
F
C
17
Homework Questions
PAUP Use multiple ortholog comparison to infer
the evolutionary history of the species
If you are interested in building a species tree,
which tree will you trust more? -- one thats
based on one protein alignment? -- one thats
based on 3 protein alignments? -- one thats
based on 350 protein alignments?
18
Using protein evolution to infer functional
importance
Sequences can evolve by Neutral drift
randomly acquiring mutations over time. If
constant mutation rate, of substitutions
correlates roughly with time Positive/negative
selection certain mutations cause significant
increase or decrease in organismal fitness.
Pressure to either maintain or remove those
mutations in the population. Methods to
identify positive selection are a big focus of
molecular evolution. One method is based on Ka/Ks
19
Ka/Ks ratio to measure evolutionary rate
Ka/Ks ratio is the of amino acid changes
normalized by the expected rate of mutation Ka
of non-synonymous codon changes ( amino acid
changes) Ks of synonymous codon changes (
same amino acid, dif. codon) Ka/Ks 1
corresponds to expected (neutral)
evolution Ka/Ks gt1 corresponds to positive
selection more amino acid changes than
expected because advantageous changes become
fixed in the population Ka/Ks lt1 corresponds to
negative (purifying) selection fewer amino acid
changes than expected because deleterious
changes are removed from the population
20
Ka/Ks ratio to measure evolutionary rate
Problems with Ka/Ks ratio 1) Different
selective pressures act on different regions of a
protein. Would like to find selection acting on
subdomains or even individual amino acids 2)
Ka/Ks ratio only works for closely related
species, for which you can get DNA alignments
21
Ahnas section protein domains, domain
searches, etc.
Now, switch gears to gene expression,
transcription, DNA sequences comparisons
22
DNA microarray analysis
Part 1 this week
  • Brief overview of arrays uses
  • Types of arrays
  • Uses of arrays
  • Gene expression applications
  • Microarray Data Analysis
  • A. Initial data extraction
  • B. Assessing quality of replicates
  • C. Selecting interesting genes
  • Part 2 next week
  • D. Organizing microarray data
  • 1. Hierarchical clustering
  • 2. K-means clustering
  • E. Assigning p-values
  • III. Types of information that can be extracted
    from the data

23
Two general types of DNA arrays
Spotted arrays Preexisting nucleic acid spotted
onto a solid support (usually glass)
Synthesized arrays Nucleic acid
sequences synthesized directly onto a solid
support
Solid or reflective mask lets uv through at
specific places
uv light chemically activates oligos to accept a
nt
24
Two general types of DNA arrays
Spotted arrays Typically work with two-color
fluorescence two samples are hybridized to
one microarray
Synthesized arrays Usually operate with one
sample hybridized to each array (ratio
comparisons done mathematically)
25
Two general types of DNA arrays
Spotted arrays Typically work with two-color
fluorescence two samples are hybridized to
one microarray
Synthesized arrays Usually operate with one
sample hybridized to each array (ratio
comparisons done mathematically)
Can have many spots/sequences representing the
same gene on Both types of arrays
26
(No Transcript)
27
Using microarrays for gene expression analysis
With both types of arrays, want to detect
differences in transcript abundance
Different types of sample comparisons 1.
Changes in expression over time -- time 0
mRNA vs. time gt0 mRNA in the same tissue
sample example developmental time
courses 2. Comparison of wild-type mRNA vs.
mutant mRNA -- identify aberrant expression in
mutant cells 3. Comparison of different
tissues example liver mRNA vs. brain
mRNA 4. Comparison of many different samples
from a population example compare tissue
samples from many patients
28
(No Transcript)
29
Want to compare the red signal vs. green signal
in each spot on the array
P.O. Brown
30
Methods of signal detection local background
subtraction
On spotted DNA arrays measure Red signal and
Green signal from within each spot vs.
just-outside each spot
Red/green 10,312 / 2,855 3.612
31
On Affymetrix-type arrays it can be more
complicated
PM
MM
PM perfect match oligo MM mismatch oligo
(central nucleotide is mutated)
32
On Affymetrix-type arrays it can be more
complicated
PM
MM
PM perfect match oligo MM mismatch oligo
(central nucleotide is mutated)
  • Robust Multiarray Analysis (RMA Irizarry et al.
    (2003)) is commonly used to extract data
  • 1. Throw out elements where MM signal gt PM
    signal
  • 2. Local background subtraction
  • 3. Compare directly to a second array to get
    ratios (and normalize array signals see next)
  • 4. Convert to log2 scale
  • 5. Combine data from replicate probes

33
When comparing two different RNA samples, the
signal from the two samples needs to be
normalized
On spotted arrays The red and green channels
are scanned and detected separately, with
independent scan parameters. Example Imagine
the red signal is detected with much higher laser
intensity PMT settings Array would look
artifactually red.
On Affymetrics-type arrays Same principle one
sample-one array Need to adjust for overall
array intensity
34
center the distribution
Yang et al. 2002 NAR
35
(No Transcript)
36
Now each array list of bg-corrected, normalized
relative transcript values
Array 1
Array 2
37
Assessing replicates how well do the data agree
overall? linear regression
Where does the noise come from? -- can be
biological variation -- can be array artifacts
should define both types of variation
38
Now you have your data, in the form of
background-subtracted expression ratios,
extracted from the arrays. Now what?
39
Select differentially expressed genes to focus on
Methods of gene selection -- arbitrary
fold-expression-change cutoff example genes
that change gt3X in expression between
samples -- statistically significant change in
expression requires replicates
Gene X expression under condition 1
Gene X expression under condition 2
Expression difference
40
Select differentially expressed genes to focus on
Methods of gene selection -- arbitrary
fold-expression-change cutoff example genes
that change gt3X in expression between
samples -- statistically significant change in
expression requires replicates
Gene X expression under condition 1
Gene X expression under condition 2
Expression difference
41
Select differentially expressed genes to focus on
Methods of gene selection -- arbitrary
fold-expression-change cutoff example genes
that change gt3X in expression between
samples -- statistically significant change in
expression requires replicates
Use statistics to compare the mean variation
of 2 (or more) populations
Expression difference
42
Test if the means of 2 (or more) groups are the
same or statistically different
The null hypothesis H0 says that the two groups
are statistically the same -- you will either
accept or reject the null hypothesis
Choosing the right test parametric test if
your data are normally distributed with equal
variance nonparametric test if neither of the
above are true
Normal data
Not normal data
43
Test if the means of 2 groups are the same or
statistically different
The null hypothesis H0 says that the two groups
are statistically the same -- you will either
accept or reject the null hypothesis
If your two samples are normally distributed with
equal variance, use the t-test
t X1 X2
difference in the means
standard error of the difference in the means
SED
If t gt tc where tc is the critical value for the
degrees of freedom confidence level, then
reject H0
44
Test if the means of 2 groups are the same or
statistically different
The null hypothesis H0 says that the two groups
are statistically the same -- you will either
accept or reject the null hypothesis
If your two samples are normally distributed with
equal variance, use the t-test
t X1 X2
difference in the means
standard error of the difference in the means
SED
two-tailed t-test
one-tailed t-test
45
Test if the means of 2 groups are the same or
statistically different
The null hypothesis H0 says that the two groups
are statistically the same -- you will either
accept or reject the null hypothesis
If your two samples are NOT normally distributed
with equal variance, use Mann-Whitney
test (Wilcoxon Rank Sum test)
  • Combine data from sample 1 and sample 2
  • Rank each data point in the pooled dataset
  • Compare the average rank for sample 1 and sample
    2 values
  • Calculate U

U n1n2 n1 (n11) - Rank1 2
If U gt Uc where Uc is the critical value from U
table
46
The paired t-test for gene expression ratios
If your two samples are normally distributed with
equal variance AND your data were paired before
collection, use the paired t-test
Example Tumor sample before and after
treatment Gene expression differences expressed
as ratios eg) mutant vs. wt log2 ratio
5.0 4.3 6.7
t D
Average difference in expression
Standard error of the mean difference
SEM
If t gt tc where tc is the critical value for the
degrees of freedom (n-1) confidence level, then
reject H0
47
Test if the means of 2 (or more) groups are the
same or statistically different
The null hypothesis H0 says that the two groups
are statistically the same -- you will either
accept or reject the null hypothesis
ANOVA (ANalysis Of Variance) for comparing 2 or
more means
variation between samples
F
variation within samples
If F gt Fc where Fc is the critical value for the
degrees of freedom (n-1) confidence level, then
reject H0
ANOVA only tells you that at least one of your
samples is different may need to identify
which is different for gt2 sample comparisons
48
Example uses
You have 6 patients and 5 replicate liver
biopsies from each patient. The F-statistic (and
corresponding p-value) will tell you which genes
are differentially expressed in any of the 6
patients (but wont tell you which patient)
There is also a two-way ANOVA for multiple
variables
You have 6 patients, half of whom smoke, and 5
replicate liver biopsies from each patient.
49
Assessing minimizing error in calls
Type I error false positives FDR False
Discovery Rate Type II error false negatives
Balance between minimizing false positives vs.
false negatives
Assessing false positives vs. false negatives
sensitivity vs. specificity
Sensitivity (how well did you find what you
want) of true positives of total
positives ( true positives false
negatives) Specificity (how well did you
discriminate) of true negatives of
total negatives ( true negatives false
positives)
50
When working with many genes must correct for
multiple testing p lt 0.01 means that there is
a 1 in 100 chance that the observation is H0 But
if you have 30,000 genes, with 0.01 change that
each conclusion is wrong then you will get 300
false positives!
Adjust the p-value cutoff such that there is a 1
in 100 chance of false identification for each
gene p 0.01 / 30,000 trials p lt
3 x 10-7 is significant
(this is also known as Bonferroni correction)
Write a Comment
User Comments (0)
About PowerShow.com