Title: Finding functional DNA sequences from whole genome human vs' mouse alignments despite variation in c
1Finding functional DNA sequences from whole
genome human vs. mouse alignments despite
variation in conservation
- Penn State Univ. Ross Hardison, Webb Miller,
Laura Elnitski, Scott Schwartz, Shan Yang, Jia
Li, Francesca Chiaromonte - Univ. California at Santa Cruz David Haussler,
Krishna Roskin, Mark Diekens, Robert Baertsch,
Jim Kent - Cambridge Univ. Nick Goldman, Simon Whelan
- Institute for Systems Biology Arian Smit
2Human and mouse genomes have been aligned
- Human Dec. 2001 assembly
- About 96 coverage of euchromatic portion
- Mouse Arachne assembly of Feb. 2002 sequence
- 40 million reads,7x redundancy
- Assembled into a few supercontigs per chromosome
- About 96 coverage of euchromatic portion
- Aligned with blastz (PSU Miller, Zhang and
Schwartz) - Used computer cluster at UCSC
- 1024 cpus
- Job takes 2 days
3Whole genome human vs. mouse alignments can be
obtained from PipDispenser
http//bio.cse.psu.edu
4Alignments are tracks at the UCSC Human Genome
Browser
Tracks shown are under development. http//hgwdev-
baertsch.cse.ucsc.edu
http//genome.ucsc.edu
5Whole genome alignments reveal variation in
conservation between and within chromosomes
6Coverage of human DNA by alignments with mouse
repetitive DNA/ total DNA
aligned, nonrepetitive DNA/ nonrepetitive DNA
aligned DNA/total DNA
X
Blastz, Dec 2001 human vs Feb 2002 mouse
7Alignment coverage varies inversely with number
of breaks in conserved synteny
8Conservation varies along chromosomes
aln_nrc fraction on nonrepetitive, noncoding
DNA that aligns in 10 kb windows
Chr22
Chr7
9Autocorrelation of fraction of DNA that aligns
Although the correlation between values of aln
for 10 kb windows falls off rapidly, a
substantial correlation is retained for about 400
kb. The correlation falls below significance
rapidly for rep and exon but not GC.
10Variation in evolutionary rates revealed in
ancient repeats
K. Roskin, D. Haussler UCSC
11Neighboring bases affect frequency of
substitutions in ancient repeats
Graph from UCSC, similar results from A. Smit on
repeats and N. Goldman on 4-fold degenerate
sites. However, this effect was not seen in
noncoding, nonrepetitive DNA (Miller).
12p-values reflecting different divergence rates
reveal more significant alignments
Jia Li and Webb Miller HMMs to model local rate
variation, then use Markov model to assign
p-value given that local rate.
13What factors account for the variation in
conservation of noncoding DNA?
- Multivariate analysis of alignments on chr22.
- For non-overlapping 10 kb windows, measure
fraction of DNA that aligns and other genomic
parameters. - Analyze for single and multiple parameters that
predict the variation in conservation.
14Fraction of sequence aligning is associated with
fewer repeats and more GC and exons
15Negative correlation between aln and rep is
highly significant chromosome 22q (33.4 Mb)
16Multivariate analysis also shows that aligning
genomic DNA is associated with fewer repeats and
more GC
Correlation with fraction of aligning
nonrepetitive Parameter sequence
(aln_nr) GC Exon GC content 0.222 Exon
density 0.278 0.268 snp density 0.000 0.17
4 0.125 6-mer exact matches 0.062 0.013 0.044
Repeat density -0.327 -0.518 -0.266
Results of multivariate analysis of 3329
non-overlapping 10 kb windows comprising chr22.
17Sliced inverse regression finds two combinations
of parameters that explain only 16 of the
variation in fraction aligning (aln)
18Alu repeats insert randomly with respect to
fraction of a segment that aligns, but are
retained in regions of limited alignment
young
old
Chr7, 10 kb windows
19Nucleotide level alignment scores (ASPC, id) do
not correlate well with coverage by alignments
(aln)
Blue line is the lowess, smoothing parameter 0.5.
aspc alignment score per column
aln_nr fraction of nonrepetitive DNA that aligns
20Use measures of alignment quality to discriminate
between functional classes of DNA
- Types of quality scores
- Percent identity
- Context-dependent I-score
- Principal component analysis
- Frequency of exactly matching hexamers
- Datasets
- Chr22 alignments and annotations
- Whole genome alignments and annotations
- 95 known regulatory regions
21Fraction matching in alignments distinguishes
exons and regulatory regions (partially) from
neutrally evolving sequences
Denominator excludes gaps
Denominator includes gaps
22Principal component analysis of alignment
parameters for different classes of DNA on chr22
Two principal components (PCA1 and PCA2) account
for 99 of the variability in match (M),
transition (S), transversion (V) and gap (G)
space.
23Distinguishing functional segments from
nonfunctional DNA after PCA
The data are distinguished along two orthogonal
directions, D1 dominated by gaps and D2
dominated by matches. A combination of directions
is more effective than one direction.
24Alignment scores derived from PCA do not (yet)
improve discrimination among classes of DNA
Values assigned to matches, transitions,
transversions and gaps were derived from
coefficients in directions in the PCA1-PCA2 plane
that discriminate among the classes of DNA.