Title: RNA-Seq technology and it's application on dosage compensation between the X chromosome and autosomes in mammals
1RNA-Seq technology and it's application on dosage
compensation between the X chromosome and
autosomes in mammals
2011-12-05
2Outline
- RNA-Seq technologies and it's methodologies
- Application on dosage compensation model
3RNA-Seq technologies and it's methodologies
4Transcriptomics methods before RNA-Seq
- Hybridization-based approaches
- Genomic tiling microarrays
- Fluorescently labelled cDNA with microarrays
- Sequence-based approaches
- Sanger sequencing of cDNA or EST libraries
- Serial analysis of gene expression (SAGE)
- Cap analysis of gene expression (CAGE)
- Massively parallel signature sequencing (MPSS)
5A typical RNA-Seq experiment
6Sequencer used for RNA-Seq
- Illumina IG
- Applied Biosystems SOLiD
- Roche 454 Life Science
- Helicos Biosciences tSMS (has not yet been used
for published RNA-Seq studies, data from Jan.
2009)
7Direct RNA sequencing using the Helicos approach
a RNA that is polyadenylated and 3'
deoxy-blocked with poly(A) polymerase is captured
on poly(dT)-coated surfaces. A 'fill-and-lock'
step is performed, in which the 'fill' step is
performed with natural thymidine and polymerase,
and the 'lock' step is performed with
fluorescently labelled A, C and G Virtual
Terminator (VT) nucleotides and polymerase. This
step corrects for any misalignments that may be
present in poly(A) and poly(T) duplexes, and
ensures that the sequencing starts in the RNA
template rather than the polyadenylated tail. b
Imaging is performed to locate the positions of
the templates. Then, chemical cleavage of the
dyenucleotide linker is performed to release the
dye and prepare the templates for nucleotide
incorporation. c Incubation of this surface
with one labelled nucleotide (C-VT is shown as an
example) and a polymerase mixture is carried out.
After this step, imaging is performed to locate
the templates that have incorporated the
nucleotide. Chemical cleavage of the dye allows
the surface and DNA templates to be ready for the
next nucleotide-addition cycle. Nucleotides are
added in the C, T, A, G order for 120 total
cycles (30 additions of each nucleotide).
8Advantages of RNA-Seq compared with other
transcriptomics methods
9Quantifying expression levels RNA-Seq and
microarray compared
10Challenges for RNA-Seq
- Library construction
- Bias in the result from different library
construction (RNA fragmentation and cDNA
fragmentation) for large RNA - Strand-specific libraries are currently laborious
to produce - Bioinformatic challenges
- The development of efficient methods to store,
retrieve and process large amounts of data - Mapping reads to the genome
- Coverage versus cost
11DNA library preparation RNA fragmentation and
DNA fragmentation compared
a Fragmentation of oligo-dT primed cDNA (blue
line) is more biased towards the 3' end of the
transcript. RNA fragmentation (red line) provides
more even coverage along the gene body, but is
relatively depleted for both the 5' and 3' ends.
Note that the ratio between the maximum and
minimum expression level (or the dynamic range)
for microarrays is 44, for RNA-Seq it is 9,560.
The tag count is the average sequencing coverage
for 5,000 yeast ORFs. b A specific yeast gene,
SES1 (seryl-tRNA synthetase), is shown.
12Coverage versus depth
13Metholologies for RNA-Seq studies
- Mapping transcription start sites
- Strand-specific RNA-Seq
- Characterization of alternative splicing patterns
- Gene fusion detection
- Targeted approaches using RNA-Seq
- Small RNA profiling
- Direct RNA sequencing
- Profiling low-quantity RNA samples
14Mapping transcription start sites (TSSs)
15Mapping transcription start sites (TSSs)
- Advantages
- Low quantities of input RNA
- Pair-end sequencing enables identified TSSs to
specific transcripts - Pair-end sequencing alleviates the difficulty of
aligning single short reads to repeat regions - Disadvantages
- Primer dimers dominates sequencing data sets
- Dependent on cDNA synthesis or hybridization
steps - Be challenging for short-lived transcripts
16Strand-specific RNA-Seq
- Adaptors with known orientations are ligated to
the ends of RNAs or to first-strand cDNA
molecules - Direct sequencing of the first-strand cDNA
products - Selective chemical marking of the second-strand
cDNA synthesis products or RNA
17Characterization of alternative splicing patterns
a Sequence reads are mapped to genomic DNA or
to a transcriptome reference to detect
alternative isoforms of an RNA transcript.
Mapping is based simply on read counts to each
exon and reads that span the exonic boundaries.
One infers the absence of the genomic exon in the
transcript by virtue of no reads mapping to the
genomic location. b Paired sequence reads
provide additional information about exonic
splicing events, as demonstrated by matching the
first read in one exon and placing the second
read in the downstream exon, creating a map of
the transcript structure.
18Gene fusion detection
19Targeted approaches using RNA-Seq
20Targeted approaches using RNA-Seq
21Small RNA profiling
22Direct RNA sequencing
a RNA that is polyadenylated and 3'
deoxy-blocked with poly(A) polymerase is captured
on poly(dT)-coated surfaces. A 'fill-and-lock'
step is performed, in which the 'fill' step is
performed with natural thymidine and polymerase,
and the 'lock' step is performed with
fluorescently labelled A, C and G Virtual
Terminator (VT) nucleotides and polymerase. This
step corrects for any misalignments that may be
present in poly(A) and poly(T) duplexes, and
ensures that the sequencing starts in the RNA
template rather than the polyadenylated tail. b
Imaging is performed to locate the positions of
the templates. Then, chemical cleavage of the
dyenucleotide linker is performed to release the
dye and prepare the templates for nucleotide
incorporation. c Incubation of this surface
with one labelled nucleotide (C-VT is shown as an
example) and a polymerase mixture is carried out.
After this step, imaging is performed to locate
the templates that have incorporated the
nucleotide. Chemical cleavage of the dye allows
the surface and DNA templates to be ready for the
next nucleotide-addition cycle. Nucleotides are
added in the C, T, A, G order for 120 total
cycles (30 additions of each nucleotide).
23Profiling low-quantity RNA samples
a Single-molecule DNA and RNA sequencing
technologies could be modified for single-cell
applications. Cells can be delivered to flow
cells using fluidics systems, followed by cell
lysis and capture of mRNA species on the
poly(dT)-coated sequencing surfaces by
hybridization. Standard sequencing runs could
take place on channels with a 127.5 mm2 surface
area, requiring 2,750 images to be taken per
cycle to image the entire channel area. The
surface area needed to accommodate 350,000 mRNA
molecules contained in a single cell is 0.4 mm2
thus, only eight images per cycle would be
needed. Sequence analysis can be done with direct
RNA sequencing (DRS) or on-surface cDNA synthesis
followed by single-molecule DNA sequencing. b
Counter system workflow. Two probes are used for
each target site the capture probe (shown in
red) contains a target-specific sequence and a
modification that allows the immobilization of
the molecules on a surface the reporter probe
contains a different target-specific sequence
(shown in blue) and a fluorescent barcode (shown
by a green circle) that is unique to each target
being examined. After hybridization of the
capture and reporter probe mixture to RNA samples
in solution, excess probes are removed. The
hybridized RNA duplexes are then immobilized on a
surface and imaged to identify and count each
transcript with the unique fluorescent signals on
the capture and reporter probes.
24Reference
- Zhong, W. et al. RNA-Seq a revolutionary tool for
transcriptomics. Nature Reviews Genetics 10, 57
(2009). - Fatih, O. et al. RNA sequencing advances,
challenges and opportunities. Nature Reviews
Genetics 12, 87 (2011). - Jeffrey, A. M. et al. Next-generation
transcriptome assembly. Nature Reviews Genetics
12, 671 (2011) - Philipp, K. et al. New class of
gene-termini-associated human RNAs suggests a
novel RNA copying mechanism. Nature 466, 642
(2010).
25Application on dosage compensation model
26(No Transcript)
27(No Transcript)
28Background
- Ohno's hypothesis
- X-linked genes are expressed at twice the level
of autosomal genes per active allele to balance
the gene dose between the X chromosome and
autosomes. - Microarray data (XAA 1)
29Abstract from Xiong et al
- Mammalian cells from both sexes typically contain
one active X chromosome but two sets of
autosomes. It has previously been hypothesized
that X-linked genes are expressed at twice the
level of autosomal genes per active allele to
balance the gene dose between the X chromosome
and autosomes (termed 'Ohno's hypothesis'). This
hypothesis was supported by the observation that
microarray-based gene expression levels were
indistinguishable between one X chromosome and
two autosomes (the X to two autosomes ratio
(XAA) 1). Here we show that RNA sequencing
(RNA-Seq) is more sensitive than microarray and
that RNA-Seq data reveal an XAA ratio of 0.5 in
human and mouse. In Caenorhabditis elegans
hermaphrodites, the XAA ratio reduces
progressively from 1 in larvae to 0.5 in
adults. Proteomic data are consistent with the
RNA-Seq results and further suggest the lack of X
upregulation at the protein level. Together, our
findings reject Ohno's hypothesis, necessitating
a major revision of the current model of dosage
compensation in the evolution of sex chromosomes.
30Expression level definition
- Taking mouse as an example, we mapped all 25-mer
RNA-Seq reads to the genome sequence. Only those
reads uniquely mapped to exons were considered as
valid hits for a given gene. The expression level
of a gene is defined by the number of valid hits
to the gene divided by the effective length of
the gene, which is the total number of 25-mers in
the DNA sequences of the exons of the gene that
have no other matches anywhere in the genome. For
comparisons between tissues or developmental
stages, expression levels were normalized by
dividing the total number of valid hits in the
sample.
31Comparison of gene expressions measured by
microarray and RNA-Seq
Human liver is considered unless otherwise noted.
(a) Estimation variation measured by the fold
difference of microarray intensities of two
same-target probesets or of RNA-Seq signals from
two halves of the same gene. (b) Identical to a,
except that mouse liver is considered here. (c)
Comparison of the internal consistency of RNA-Seq
data and microarray data. The expression
differences from one-half of the nucleotides
(RNA-Seq) or a probeset (microarray) are shown
for 1,000 randomly picked gene pairs each with
twofold 0.01-fold expression difference from
the other half of nucleotides (RNA-Seq) or from
the other probeset (microarray). The central bold
line shows the median, the box encompasses 50 of
data points and the error bars include 90 of
data points. (d) Pearson's correlation (r) of
microarray and RNA-Seq expression signals (gray)
and of RNA-Seq signals from two independent
experiments (black). A certain fraction of genes
(x axis) with the highest expression according to
one of the RNA-Seq datasets are examined. Error
bars show 95 confidence intervals estimated by
bootstrapping. (e) Microarray consistently
underestimates expression differences between
genes. The microarray expression differences of
1,000 randomly picked gene pairs each with x-fold
(x 2 0.01, 4 0.02, 8 0.04, 16 0.08, 32
0.16, and 64 0.32) RNA-Seq expression
difference are shown. The central bold line shows
the median, the box encompasses 50 of data
points and the error bars include 90 of data
points. (f) Relative liver expressions of 55
mouse genes, measured by RNA-Seq, microarray and
qRT-PCR.
32Comparisons of RNA-Seq gene expression levels
between the X chromosome and autosomes in 12
human tissues and 3 mouse tissues
(a) The median expression levels of X-linked
genes (closed diamonds) and autosomal genes (open
circles) are compared. Median expressions of
autosomal genes were normalized to 1. Error bars
show 95 bootstrap confidence intervals. Sex
information is listed in the parantheses after
the tissue names (M, male F, female NA,
unknown). (b) XAA ratios of median expressions
from the human liver when X is compared to
individual autosomes. Error bars show 95
bootstrap confidence intervals.
33Test upregulation in Ohno's hypothesis
- Upregulation in Ohno's hypothesis
- In Ohno's hypothesis, upregulation is needed for
those X-linked genes that had existed in the
genome before the emergence of the X chromosome
X-linked genes that originated de novo on X
presumably do not require upregulation.
34Test upregulation in Ohno's hypothesis
35Comparison of RNA-Seq gene expression levels of
the X chromosome and autosomes in C. elegans
36Caveats in this RNA-Seq analysis
- The Illumina sequencing used here may be biased
toward certain sequences or nucleotides. - Reverse transcription during cDNA library
preparation is likely to be less efficient for
longer transcripts. - GC content may affect RNA-Seq results.
- A recent study using time-course microarray data
excluded lowly expressed genes, which is
inappropriate for measuring the absolute value of
XAA ratio.
37(No Transcript)
38Main idea
- Here we contend that the low estimate of the XAA
ratio by Xiong et al. stems from the
disproportionate contribution of
transcriptionally inactive genes, which are not
relevant for the evaluation of dosage
compensation mechanisms, to the X chromosome
average. We show that when only active genes are
considered, the RNA-seq data give XAA ratios
closer to 1, and the observed minor deviation of
the XAA ratio from 1 is within the range
expected when taking into account
chromosome-to-chromosome variability
39Key notes
- RPKM (the number of associated reads per kilobase
of exonic sequence per million of total reads
sequenced.) - We assert that the effect of a mechanism that
regulates transcriptional dosage compensation
pertains only to the expression magnitude of
transcriptionally active genes. - The fraction of undetected (RPKM 0) genes is
substantially higher on the X chromosome than on
autosomes, accounting for as much as 40 of all
the X-linked genes. - Threshold in the analysis (RPKM gt 1 with at
least 3 reads)
40Fraction of transcriptionally inactive genes on
autosomes and X chromosome
41The ratio of the median transcription magnitudes
of X-linked and autosomal genes
The XAA ratio estimates are shown based on the
set of genes with minimal transcription (RPKM 1
and at least 3 associated reads). Black error
bars show the 95 confidence interval (CI) based
on bootstrap estimates incorrectly assuming
independence of expression levels for neighboring
genes (plotted here for reference not used to
make inferences). Red bars show the range around
1 into which the XAA ratio is expected to fall
(95 CI) in the presence of twofold upregulation
of the X chromosome, taking into account
interchromosomal variation (sampling of
contiguous blocks of X-chromosome size from the
autosomal portion of the genome). The observed
XAA values (black dots) in all tissues fall
within this range, indicating that the observed
transcriptional magnitude of X-linked genes is
compatible with the presence of twofold
upregulation. The blue bars show the range around
0.5 into which the XAA ratio is expected to fall
in the absence of X-chromosome upregulation (50
of the autosomal expression level). The XAA
estimates for the first five samples fall outside
of this range, indicating that the X-linked
expression magnitude is significantly higher than
that expected in the absence of dosage
compensation. The XAA values for other samples
are within both the red and blue ranges,
indicating that the two hypotheses (XAA 1 and
XAA 0.5) cannot be clearly distinguished based
on these individual data sets.
42The chr. 10A and chr. 11A ratios illustrating
chromosome-to-chromosome variability
43Mouse RNA-seq data shows a lack of dosage
compensation
44Dependence of the XAA estimates on the RPKM
threshold
Dependence of the XAA estimates on the RPKM
threshold. The tissue-averaged XAA estimates are
shown (black) as a function of the minimal RPKM
threshold, from 0 (all genes, including those
with undetected expression) to RPKM 2. The error
bars correspond to the s.e.m. between different
tissues. The largest change in the ratio is
observed after exclusion of genes with undetected
expression (RPKM gt0). As the RPKM thresholds
increase, the XAA ratio largely stabilizes above
RPKM 1. The application of a RPKM threshold
increases the median expression level and can
artificially shift the XAA ratio closer to 1.
The shaded gray region shows the 95 confidence
envelope for the hypothetical X chromosome that
is expressed at 50 of the autosomal level (see
Supplementary Methods). For non-zero RPKM
thresholds, the observed XAA ratios lie outside
of this 95 confidence interval, showing that the
high XAA ratios are increased more than is
expected from only setting a RPKM threshold.
45Discussion