Title: Discovery of higherorder functional domains in the human genome
1Discovery of higher-order functional domains in
the human genome
- Bob Thurman
- Noble Lab, Department of Genome Sciences
- University of Washington
- Seattle
- 11 October, 2006
- ASHG 2006, New Orleans
2Idea
ENm005
1.7 Mb
- Idea of functional domains has been around for a
long time. - Now there are a variety of functional datasets
available in nearly continuous fashion across
genome. (ENCODE) - Apply modern computational techniques to these
datasets to delineate large-scale functionally
active and inactive regions of the genome.
RNA
H3K27me3
H3ac
TR50
A
A
I
3The tools
- Use wavelets to smooth disparate datasets out to
a common scale.
1.6kb scale
6.4kb scale
4The tools
- Use Hidden Markov Models (HMMs) to segment
regions based on data.
5The procedure
6Results single-track segmentations
A
A
A
A
I
I
I
I
H3ac segmentation, ENm005 (1.7Mb)
7Concordance of segmentations
- TR50 generally concordant with everything except
RNA - H3ac generally concordant with everything except
H3K27me3 - H3K27me3 not very concordant with anything except
TR50
8Simultaneous four-track segmentation
A
A
I
I
Concordance with single-track
9Enrichment of annotated genomic features
All
LTR
strict
loose
moderate
SINE Alus
LINEs (L1)
LINEs (L2)
non-exonic
EST overlap
CpG Islands
Simple repeats
mRNA Tx Starts
DNA transposons
Gencode Tx Starts
Spliced EST Tx Starts
Repeats
Conserved Elements (ENCODE MSA)
10Conserved non-coding sequence
- Against expectations, active domains are somewhat
depleted in CNS (18 depleted over random
expectation) - Does adding CNS track add anything? No even in
terms of CN elements, there is little difference.
In addition, there is very poor single-track
concordance with other data types.
11Cell-type differences
- Data for two different cell types (GM06990 and
HeLaS3) is available for H3ac and RNA expression. - Single-track segmentation concordance between
cell types
12Future work
- Scale up!
- More data types
- More cell types
- Model organisms whole genome functional data
already available
13Acknowledgments
- John Stamatoyannopoulos
- Bill Noble and the Noble lab
- Nathan Day
- Andrew Hemmaplardh
- HMMseg, a Java program for multi-variate HMM
segmentation with optional wavelet smoothing, to
be released soon
14FIN
15The data
- ENCODE project identify all functional elements
of the genome. - Pilot phase looks at 1(30Mb) of genome
comprising the ENCODE regions. - Functional datasets collected in common cell line
HeLaS3. - Histone modifications H3ac (Sanger) and H3K27me3
(UCSD) - RNA transcription levels (Affymetrix)
- DNA replication timing TR50 (University of
Virginia)
164-state segmenation stats
17Gene Ontology (GO) analysis
- Any classes of genes over-represented in
active/inactive states? - over-representation of genes involved in signal
transduction (particularly olfactory
G-protein-coupled receptors) within repressed
domains
18Enrichment of annotated genomic features
Affy RNA
H3K27me3
TR50
19Enrichment of annotated genomic features
A
A
A
A
I
I
I
I
H3ac
20High-confidence regions
- Intersection of 4 individual segmentations.
- 10.3Mb total (over 1/3 of ENCODE), and almost all
of that (except for 1.5kb) is concordant with the
4-track segmentation.
21Outline
- Continuous genome-wide data and scale
- Tools for analysis
- Wavelets and HMMs
- Results
- Definition of functional domains
22Continuous genome-wide data and scale
- A wide variety of nearly continuous genomic data
is now available in 30Mb (1) of the human
genome comprising the ENCODE regions. - We focus on the issue of scale. Use wavelets to
- normalize datasets collected on widely disparate
scales and - elucidate trends in single and combined datasets
at large scales.
23Segment regions using wavelets and Hidden Markov
Models (HMMs)
Approximate scale of 60kb gives minimum segment
size of 10kb.