Title: Semester project: Microarrays and Statistics Part 2 of 2: Introduction to Microarrays
1Semester projectMicroarrays and
StatisticsPart 2 of 2Introduction to
Microarrays
2Alberto Macias-Duarte BioME-iPC FellowGraduate
StudentSchool of NaturalResourcesUniversity of
Arizona
3Definitions
GENETICS the study of heritable traits
genes. GENES the information to make
proteins. PROTEINS building blocks for the
machines that perform lifes functions within all
cells. GENOME contains all the genes more a
blueprint for the individual GENOMICS tools to
study ALL the genes of an individual (humans,
animals, plants, fungi, bacteria)
4Genes and chromosomes
5(No Transcript)
6Genes, Proteins and Molecular Machines
- Our DNA contains our genes
- Composed of four chemicals symbolized by A T C
G AT, CG
- The order of the letters determine the protein
each gene makes
- The types of proteins and where and when they
are made makes a person a person, a plant a plant
BUT proteins are not made directly from DNA
- In many organisms (us plants) most of the DNA
is not genes.
7The Central Dogma of Molecular Biology
- DNA codes for the production of RNA and RNA codes
for the production of protein (polypeptides)
8The Central Dogma of Molecular Biology
95 stages to understanding genomes
1. Identify the order of the A, T, C, G s
2. Determine where the genes are
3. Determine what each gene does
- Determine where each gene is expressed and how is
it controlled
5. Determine the nature and function of the
non-genic DNA
10Why is genome and gene information so important?
- Provides information and tools to address large
number of biological questions from ecosystems to
molecular networks - Questions can be tackled now that could not be
tackled before - Information can be used to improve human health,
agriculture and environment
115 stages to understanding genomes
1. Identify the order of the A, T, C, G s
In 2003 the human and first plant genome
sequences were announced
Explosion of genome sequences since that time!!!!
125 stages to understanding genomes
1. Determine where genes are
13These are often much larger than genes
14What kinds of sequences are in non-coding regions?
- Simple sequence repeats, CAAn
- Transposable elements
- DNA to DNA
- DNA to RNA to DNA (retrotransposons)
- Pseudogenes (pieces of genes that are non
functional) - Other
15In Eukaryotes RNA from Genes Is Extensively
Processed to Form mRNA
- mRNA are products of splicing from primary
transcript
exon
intron
165 stages to understanding genomes
3. Determine what each gene does
17Findings from comparing genomes
1. Many genes (50) are very similar between
diverse species highly conserved domains.
2. The genes that are shared often carry out key
functions that are similar in different species,
i.e. transcription, protein synthesis
3. Hypothesize function of gene in one species if
shares domains with a gene with known function
from another species.
4. The DNA sequences that are not genes are
rarely shared between species, but there are
common classes of sequences in many organisms,
i.e. transposons, ss repeats, etc.
5. Sizes of intergenic regions vary dramatically
even between closely related species major
contributor to differences in genome sizes.
185 stages to understanding genomes
- Determine where each gene is expressed and how is
it controlled
All the parts of animal bodies have the same DNA,
yet different tissues have distinct functions.
Only a subset of genes are functioning in each
tissue or organ.
How do different cells know which genes should
function and which should not and how are only
the correct genes expressed?
Field of Gene Regulation
19A powerful tool to investigate gene regulation
- Microarrays
- Monitor the whole genome in a single experiment
- Identify genes that are differentially expressed
between two treatments, tissues, developmental
stages or genotypes
20cDNA
Expressed Sequence Tag
Sequence the cDNAs made from different tissues,
developmental stages or disease states, etc.
21 Microarrays
Each spot is a different gene immobilized onto a
slide. Can look at 50,000 genes in one
experiment. Ask which genes are active and
which genes are not under different conditions,
in different tissues, in healthy and diseased
individuals, in wild type and mutant plants
22Genes more active in Tissue 1
Genes mores active in Tissue 2
Genes active to same extent in both tissues
Courtesy of R. Elumalai
23Monogenic Traits
- Usually on a discrete scale (absence vs.
presence O, A, B blood types, etc.) - Determined for a single gene
24Quantitative Traits
- Usually on a continuous scale (weight, yield,
height, drought tolerance, etc.) - Determined for many genes (polygenic)
25Microarrays
26Why do we need statistical Inference in
microarray data analysis?
1. mRNA content is a variable trait - among
individuals - among tissue types within
individuals - among cells within tissue type 2.
It is not practical to sample every individual,
tissue, or cell in the population of interest
27Why do we need statistical Inference in
microarray data analysis?
- Statistics allow us to make statements about an
entire population from a sample of that
population - Uncertainty about the statement is included (the
famous P-value)
28Simple example of a microarray experiment and
data analysis
Gene expression of seedlings exposed to high
salinity in wheat (Triticum)
29Microarray experiment2 treatments, 4 genes and 5
replicates
Treatment 1 Seedlings growing in saline soil
Treatment 2 Seedlings growing in normal soil
-
-
-
-
-
-
-
30Microarray chip with 4 probes
Gene 1
Gene 2
Gene 3
Gene 4
312 Treatments and 6 replicates
32Assignment of dyes to each treatment
Treatment 1 Seedlings growing in saline soil Red
Channel
Treatment 2 Seedlings growing in normal
soil Green Channel
-
-
-
-
-
-
-
33Expression of genes under salinity stress
Treatment 1 Seedlings growing in saline
soil Red Channel
Microarray
Treatment 2 Seedlings growing in normal
soil Green Channel
-
-
-
-
-
-
-
34Expression of genes under salinity stress
Treatment 1 Seedlings growing in saline
soil Red Channel
Microarray
Treatment 2 Seedlings growing in normal
soil Green Channel
-
-
-
-
-
-
-
35Expression of genes under salinity stress
Treatment 1 Seedlings growing in saline
soil Red Channel
Microarray
Treatment 2 Seedlings growing in normal
soil Green Channel
-
-
-
-
-
-
-
36Results for gene 1
37Data for Gene 1
Fluorescence intensity units
38Data for Gene 1
Do replicates look comparable?
39Data for Gene 1
Normalization makes every replicate comparable
40Data for Gene 1
One simple normalization is the Ratio R/G
41Data for Gene 1
One simple normalization is the Ratio R/G
42Data for Gene 1
One simple normalization is the Ratio R/G
43A twofold induction or repression of an
experimental sample, relative to the reference
sample, is indicative of a meaningful change in
gene expression
44Data for Gene 1
Further normalization log2-transformation
45Hypothesis test
Let M be mean log2(Ratio) for the population of
all seedings growing in saline soils compared to
seedlings growing in normal soils
How can you state the hypothesis of difference in
gene expression between the treatments in terms
of M?
46Hypothesis test
How can you state the null hypothesis of no
difference in gene expression between the
treatments in terms of M?
47Data for Gene 1
Does our M from sample supports the no-effect
hypothesis or the effect hypothesis?
Msample -0.019
48Data for Gene 1
How certain we are that is or is not supportive?
Msample -0.019
49The famous P-value
- P-value is the probability that chance alone
leads to a favorable result (to reject no-effect
hypotheses or null hypothesis) - P-value is the probability that our conclusion of
declaring an effect is wrong if the null
hypothesis is true - Accepted P-values in science are usually lt0.05
50Hypothesis testing
We cannot take many samples and calculate Msample
from each to see how likely or unlikely is
Msample -0.019 to occur if M 0 is true
51Hypothesis testing
Instead, mathematics show that the t-statistic
has a very well known distribution, i.e., we can
know the probability of all values of t
52Hypothesis testing
Lets do a little bit of work in MS Excel to
calculate the t-statistic and the P-value
53Hypothesis testing
It turns out that t -0.065 and P 0.95,
meaning that
the probability that an inference that M ? 0 is
wrong is 95 when actually M 0, i.e. we have
more support for the hypothesis that M 0
54Hypothesis testing
and therefore, the levels of expression for
gene 1 are the same for both treatments. Gene 1
may not be involved in the response of wheat
seedlings to salinity
Treatment 1 Seedlings growing in saline soil
Treatment 2 Seedlings growing in normal soil
-
-
-
-
-
-
-
55Thanks!