Synopsis - PowerPoint PPT Presentation

About This Presentation

Title:

Synopsis

Description:

The ROC curve is constructed from the collection of true and false positive rates ... the True Positive Rate on the y-axis and the False Positive Rate on the ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 54

Provided by: tra51

Learn more at: https://fire.biol.wwu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Synopsis

1
(No Transcript)
2
Synopsis

It has been estimated that at least 40 of the
total human genome sequence contains the
integrated fragments of genomic parasites
Retroviruses, Retrotransposons, DNA transposons,
and parvoviruses can efficiently insert new
sequence into the human genome
These integrating elements can be powerful tools
for discovering . . .

3
What genomic features affect integration?

Each element shows a different pattern of
favorable integration sites
Favored specific nucleotide sequences can be
detected in the target DNA at the point of
integration for most of these elements
Post-integration genomic DNA is harvested, and
the DNA flanking the integrated element is cloned
and sequenced

4
Intention

Present a comprehensive statistical comparison
of the factors influencing integration frequency
by annotating each base pair in the human genome
for its likelihood of hosting integration events

5
Framework

7 types of integrating elements
17 different integration complexes (datasets)
200 variables (genomic features)
10,000 integration sites

6
Previous research provided extensive insertion
site data

HIV favors integration in active transcription
units (TUs)
MLV favors integration near gene 5 ends
ASLV integration is mostly random, but TUs seem
to be favored slightly

TUs are defined as regions of transcribed DNA
7
Previous research had provided extensive
insertion site data

SFV integration is mostly random, but is favored
slightly near CpG islands
SB favors integration in transcription units.
AAV-based vectors show a modest preference for
regions neat transcription start sites
Experiments concerning whether LINEs prefer to
integrate within TUs have been inconclusive

8
Some Variables (Genomic Features)

Genes and Exons Indicator variables for whether
the site falls into a gene or an exon
Gene or Expression Density The number of genes
or expressed genes per base pair in the region
surrounding the integration site
Dnase I Site Density The number or density of
DNAse I sites in regions surrounding the
integration

9
Some Variables (Genomic Features)

GC Content The GC percent in the 5kb region
containing the site
CpG Islands The site is in a CpG island
CpG Island Density The number or density of CpG
islands in the region surrounding the site
Transcription Start/Stop Features The relation
of the site to transcription start/stop position

10
Some Variables (Genomic Features)

Positional Weight in Flanking Sequence The
loglikelihood for integration versus control site
at each position in twenty bases of flanking
sequence (10 upstream and 10 downstream) and
their sum
Loglikelihood is defined as the log ratio of the
frequency of each of the four bases at each
position to the frequency in the controls

11
Integration Complexes (Datasets)
12
Control Site Generation

Each dataset has one of two types of control
Matched (preferred) the integration sites were
created using a restriction enzyme. The control
site matches the distance from the nearest
restriction site in the direction of
transcription
Random The control site is merely a random
sequence from the genome

13
The ROC Curve

Used to analyze the effects of genomic features
on integration
Provide a measurement of a predictor variables
ability to discriminate between two classes of
events
This measure can be interpreted as the
probability that a randomly drawn integration
site will have a value for its genomic feature
that exceeds that of a control

14
The ROC Curve

The area under the ROC curve is taken as a
measure of the association between genomic
feature and the likelihood of an integration event

15
The ROC Curve

The area under the curve is 1.0 when all
integration events have higher values for the
feature than any control event, and 0.0 for the
opposite case.

16
The ROC Curve

Values very near 1.0 occur when higher values of
the feature predict integration, and values very
near 0.0 occur when lower values of the feature
predict integration

17
The ROC Curve

When the area is 0.50, it is equally likely that
either has a higher value
Values near 0.50 are consistent with having no
predictive value

18
ROC Curve Construction

Values for the integration sites are tallied to
create the histogram and the upper tail areas of
the histogram, which shows the fraction of
integration sites (vertical axis) that have
values for the feature that exceed a given value
(horizontal axis)

19
ROC Curve Construction

Repeat this same procedure using data from the
control sites
Rotate this histogram and upper tail areas graph
90 clockwise
The ROC curve is constructed from the collection
of true and false positive rates

20
ROC Curve Construction

For every possible cutpoint, plot the True
Positive Rate on the y-axis and the False
Positive Rate on the x-axis
A cutpoint is defined as any value of a predictor

21
A Compact Representationof these Associations

The absolute difference between the area and 0.50
is plotted
Values around 0.0 indicate no useful predictive
information in the feature
Values near 0.50 indicate that the feature is
nearly perfect in separating integration sites
from the controls

22
Color-coded Heat Maps

Color-coded heat maps are matrices displaying
associations for each type of genomic feature
using rows of the matrix for features and columns
for data sets

23
Color-coded Heat Maps

Bright green represents ROC curve areas near 0.0
Black represents ROC curve areas of 0.50
Bright red represents ROC curve areas near 1.0

24
Effects of Nucleotide Sequence of the 20 Base
Pairs Surrounding the Point of Integration

To determine how important different features are
in directing integration towards a region, each
base in the interval is treated as the edge of an
integration site

25
Effects of Nucleotide Sequence of the 20 Base
Pairs Surrounding the Point of Integration

Each region is then scored for the expected
number of integration events over the interval,
and these interval scores are summed

26
Effects of Nucleotide Sequence of the 20 Base
Pairs Surrounding the Point of Integration

The summed values are then tested for their
ability to sort experimental integration sites
from controls

27
Effects of Nucleotide Sequence of the 20 Base
Pairs Surrounding the Point of Integration
Interval Size
Integrating Elements

Results are presented as areas under the ROC
curve for this variable

28
Integration in Transcription Units and the Effect
of Gene Activity

Analysis of DNA integration within TU's and exons

HIV (Red) positively correlated with TU's
Others varied from slight, negative (green) to
undistinguishable data (black)

This figure summarizes the effects of gene
density in differently sized genomic intervals
100kb-4 Mb
Utilized Affimetrix arrays to do transcriptional
profiling
Each expression scores for all genes in a
interval divided by interval width
All datasets resulted in weakly positive for
insertion in at least one integral. And
"There was no clear pattern of interval size,
type of gene call. or expression level.
Suggests that Gene density features were most
significant
-Strong effects seen in HIV and MLV datasets
Weakest response from non-dividing cells or
macrophage

31
How does G/C Content and Proximity to CpG Islands
Effect Integration?

On average, G/C Content implies
Gene rich
Short introns
High frequencies of ALu repeats
Low frequencies of LINEs
High Frequency of CpGs

2 MLVs where integration was positive
3 HIVs that were negatively correlated, A/T
preference
Other datasets showed weaker and less consistent
responses

33
Whoa!? I Thought HIV Integrated in In Gene
Enriched Regions?
34
Fig. 3 A
Fig. 4 A
A/T preference of HIV integrase-binding protein
35

GpC Island density
Increasing length 1K-32 M
Correlates to gene density
Within short regions, proximity to CpG islands
correlate to proximity to regulatory regions
Long intervals span many genes

36
DNase I Cleavage Sites

DNase I cleaves the sites in chromatin where the
binding of transcription factors occurs along
with the presence of CpG islands, and gene
control regions.

37
Integration Near Transcription Factor Binding
Motifs

Summarizes how integration is affected by its
proximity to transcription factor binding sites
TRANSFAC PWM- scores how well the integration
site or control matches a PWM and this score
generates an ROC describing the effects of that
PWM
Lack of strength when analyzed with other factors

38
Proximity to Transcription Start and Stop Features

To compare the integration frequency between
start and stop codons for experimental and
matched random controls expressed as ROC areas.
Fig 4C

Boundary.dx Distance from 5' or 3' end
Start.dx distance to the nearest gene start
sites
closer to the start (green)
Signed.dx High probability at the start sites
(red)
General.width- length of introns

40
Improved Models Incorporating Score.20 Together
with Other Genomic Features

Score.20 was the most effective method for
differentiating between site selection of the
different vehicles
Addition of other variables to accentuate our
results.
Non-redundant
Lack of correlation

41
Increase in ROC Area by the Addition of a Genomic
Feature

Histogram Found little correlation of score.20
with other features
Predictors of Integration targeting can be
constructed based on score.20 and another feature
The fitting process leads to values that rank
higher than random match controls

42
Fig. 5 D
43
A Single Model!

Regression models would be too complex
Want to analyze various features
Bayes Model Averaging (BMA)
Reinforces that score. 20 and other features are
independent
Models with high posterior probability were
collected and used to evaluate the importance of
various features
Random sites are scored for the logarithmic odds
of integration with BMA models

44
Hierarchical clustering

Major grouping of retrovirus HIV
Amongst our 17 datasets, with each branch
different element types were resolved
Verifies that integration site selection is
dominated by element encoded recombination enzymes

45
What genomic features influence integration of
new DNA?

What weve learned about each integrating element

HIV favors integration in active transcription
units (TUs)
MLV favors integration near gene 5 ends
ASLV integration is mostly random, but TUs seem
to be favored slightly

HIV- Found to be weakly attracted to integration
sites near DNase 1 cleavage domains over long
intervals. Probably because of the correlation
of HIV insertion sites and DNase 1 cut sites with
gene dense regions. Also revealed a strong
integration attraction to A/T rich sequences,
contradictory to previous presumptions
correlating insertion with C/G dense areas.
MLV- Integration associations with CpG islands
and DNase 1 hypersensitive sites found to be
amplified when a larger scale of interest is
used. The influence of the local nucleotide
sequence also increased with a larger interval.
Strong correlation for integration near areas of
gene expression.
ASLV- Integration near DNase 1 sites over long
genomic intervals favored.

46
What genomic features influence integration of
new DNA?

What weve learned about each integrating element

SFV integration is mostly random, but is favored
slightly near CpG islands
SB favors integration in transcription units.
AAV-based vectors show a modest preference for
regions neat transcription start sites
Experiments concerning whether LINEs prefer to
integrate within TUs have been inconclusive.
Specific sequence known to have effect on
integration.

SFV- Cell specific integration influences.
Integration near CpG islands and proximity to
DNase 1 cut sites more evident in stem cells then
fibroblasts.
SB- Contradictory results in regards to proximity
to CpG islands and gene density. Possibly because
of cell type specific integration influences.
AAV- Of all vectors, integration found least
favorable into TUs. Contradictory to previous
mouse liver studies.
L1- Supports previous studies suggesting strong
integration site nucleotide relationships.

47
What genomic features influence integration of
new DNA?
When asking this question, the scale of interest
is very important because it can influence the
results.
For example You use a vector that you think
integrates near the sequence GATTACA, When you
focus on a 20 bp segment, it can be very easy to
predict where the vector will integrate.
Conversely, if that same vector is integrated
into a 1kbp segment, or 20kb, or 3 billion base
pair segment, the integration site is going to
be harder to predict. Especially if there are
other, less understood influences acting in
concert. As seen in our case. Other factors
were seen to increase their influence with
increased area, as seen in MLV and ASLV.
48
Future Studies
With this catalog of vector-feature interactions,
we can better understand novel insertion
influences as theyre identified. They can be
studied and compared in cooperation with the
current comprehensive predictive models
incorporating all currently known genomic
features. In doing so, we will gain better
insertion prediction abilities with each new
independent variable genomic feature discovered.
One such new feature could be the relative
locations of nucleosomes, or other epigenetic
factors, like methylation or acetylation of the
DNA strand.
http//en.wikipedia.org/wiki/Nucleosome
49
Future Studies
This paper mentioned many potential future
studies surrounding each individual potential
insertion vector, for example, SB cell specific
integration and AAV likeliness of TU insertion.
Many other areas of research could collaborate
upon the findings presented in this article.
Stronger mathematical modeling systems could be
of great value.
http//www.bioscience.heacademy.ac.uk/network/sigs
/numeracy/
50
Future Studies
Also using a different approach utilizing the
advances in proteomics to isolate and identify
some of the functional proteins used by these
potential insertion vectors could expand our
understanding of the mechanisms used. A
bioinformatics data base could then be used to
see if there any DNA binding proteins, chromatin
related proteins, DNase proteins, DNA ligase
proteins, etc were found.
http//www.dartmouth.edu/toxmetal/TXQAas.shtml
51
Future Studies
A second novel use of the vector-feature
interaction library is as a reference in respect
to the feature in question. If you were working
with CpG islands, you could look up what kind of
insertion vectors have a probability of inserting
near your CpG island of interest.
http//www.pb.ethz.ch/research/chromatin_technics/
TDI.jpg/image
52
Big Future Studies
The purpose of this research was to better
understand the factors influencing various vector
insertions. This is useful for the hope of
creating a reliable, predictable, vehicle for
integrating DNA elements into humans. This
innovation could turn gene therapy into a
plausible reality. We need to be able to insert
desired segments with pin point accuracy as
illustrated at the beginning of this paper. A
previous study successfully treated human
X-SCID while also indirectly causing leukemia in
three of the patients, Unlike mice, it has to
work the first try, every try.
53
Gene Therapy
Typically gene therapy is most successful when
used to treat a single gene, or monogenic genetic
disorder

Cystic Fibrosis
Sickle Cell Anemia
Marfan Syndrome
Huntingtons Disease
Hereditary Hemochromatosis
Ornithine Transcarboxylase Deficiency (OTCD)
X-linked Severe Combined Immunodeficiency Disease
(X-SCID) "bubble baby syndrome."

http//www.annasslant.com/doctor-shot.jpg
For more information about gene therapy
visit http//www.ornl.gov/sci/techresources/Human_
Genome/medicine/assist.shtml

Write a Comment

User Comments (0)