Sequence features of DNA binding sites reveal structural class of associated transcription factor - PowerPoint PPT Presentation

Loading...

PPT – Sequence features of DNA binding sites reveal structural class of associated transcription factor PowerPoint presentation | free to view - id: 7808d-YmY0N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Sequence features of DNA binding sites reveal structural class of associated transcription factor

Description:

Sequence features of DNA binding sites reveal structural class ... BF T00027 AP-1; Species: clawed frog, Xenopus. BF T00029 AP-1; Species: human, Homo sapiens. ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 45
Provided by: c3110
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Sequence features of DNA binding sites reveal structural class of associated transcription factor


1
Sequence features of DNA binding sites reveal
structural class of associated transcription
factor
  • Narlikar L and Hartemink AJ. Bioinformatics. 2006
    Jan 1522(2)157-63.
  • Carol Sniegoski

2
The Central Dogma of Molecular Biology
Double-stranded chain of nucleotide bases (A-T,
C-G)
Single-stranded chain of nucleotide bases
(A,U,C,G)
Polypeptide chain
3
DNA Basics
  • Two chains form a double helix
  • Chains have orientation
  • 5 end is upstream 3 end is downstream
  • Sugar-phosphate backbone provides framework for
    bases (A,C,G,T)
  • Hydrogen bonds between complementary base pairs
    hold chains together
  • A pairs with T, C pairs with G

4
Protein Basics
  • Proteins are folded up polypeptide strings
  • Sequence determines form form determines
    function
  • Function is focused at key domains (active sites,
    binding sites)
  • Predicting form from sequence is an unsolved
    problem
  • Experimental methods NMR X-ray crystallography
  • Computational methods predicting de novo
    predicting based on sequence similarity to other
    known proteins

Ball-and-stick model
Space-filling model
Cartoon model
5
Protein Structure
Primary protein structure The order of amino acids
Secondary protein structure Common repeating
structures, often formed by hydrogen bonds
Tertiary protein structure The full 3-dimensional
folded structure
Quaternary protein structure Proteins organized
of multiple polypeptide chains
6
Protein Domains
  • Structural domains
  • Elements of tertiary structure
  • May be composed of one or more motifs (secondary
    structure)
  • Many domains appear in a variety of protein
    families
  • Domains are important to a proteins biological
    function

7
Proteins Do (Almost) Everything
8
Gene Expression Control Points
Activating the gene structure Initiating
transcription of mRNA from DNA Processing the
mRNA transcript Transporting the processed
transcript from nucleus to cytoplasm Translatin
g mRNA into protein Controlling mRNA degradation
9
Components Needed for Transcription
  • RNA polymerase (RNAP)
  • Enzyme that transcribes DNA into RNA.
  • DNA
  • Accessible DNA sequence to be transcribed (gene).
  • Various cis-acting DNA regulatory sequences
    located near the sequence to be transcribed.
  • (Cis-acting part of the DNA sequence affects
    one copy of a gene.)
  • The regulatory sequences serve as binding sites
    recognized by transcription factors.
  • Transcription factors (TFs)
  • Set of trans-acting accessory proteins required
    to initiate transcription.
  • (Trans-acting freely diffusible affects both
    copies of a gene.)
  • TFs have binding domains that recognize and bind
    to specific DNA sequences.

10
RNA Polymerase
  • The RNA polymerase protein transcribes DNA into
    RNA.
  • It is not responsible for knowing when or where
    to start transcription.

RNA polymerase
New RNA transcript
DNA double helix
11
DNA Regulatory Sequences
  • Characteristic regulatory sequences in DNA are
    bound by specific transcription factors.
  • Complexes of bound factors both locate and
    promote gene transcription.

Transcription startpoint
  • Promoter regions are usually located within 200
    bp upstream of startpoint.
  • Initiator (Inr) consensus sequence
    YYAN(T/A)YY, within 5 bp of startpoint
  • TATA box consensus sequence TATAAAA, 25 bp
    above startpoint
  • GC box consensus sequence GGGCGG
  • CAAT box consensus sequence CCAAT
  • Enhancer regions (not shown) are located farther
    upstream or downstream.

12
DNA Regulatory Sequences
  • Modular
  • Specific to a gene or a set of genes
  • Specific to a condition or range of conditions
  • Support complex control of gene transcription

gene
gene
Example DNA sequences
gene
gene
upstream
downstream
Transcription startpoint
TATA box
CAAT box
Octamer motif
GC box
13
Transcription Factors
  • Any factor that is needed for the initiation of
    transcription but is not part of RNA polymerase
  • Three operationally defined classes of
    transcription factors
  • General factors
  • Form an initiation complex with RNA polymerase
    around the transcription startpoint
  • Always required for initiation of transcription
  • Unregulated
  • Upstream factors
  • Bind to specific DNA consensus sequences
    (promoters and enhancers) upstream of the
    startpoint
  • Required for adequately efficient initiation of
    transcription
  • Unregulated
  • Inducible factors
  • Operate like upstream factors
  • Highly regulated
  • Responsible for controlling transcription
    patterns in time and space

14
Activating Inducible TFs (1)
15
Activating Inducible TFs (2)
16
Transcription Factors
  • Transcription factors bind to DNA and to each
    other to form complexes that initiate
    transcription

TFIIIB (with 3 subunits) now binds to its binding
site near the startpoint of transcription
TFIIIA binds to a site within the promoter region
Finally RNA polymerase binds and begins
transcribing the gene
TFIIIC binds to form a stable complex
17
Transcription Factors
  • Even factors bound to remote enhancers can
    contribute to the initiation complex

Enhancer
Gene

Basal transcription complex


Enhancer-bound complex
18
Binding Site Specificity
  • Many TFs DNA-binding domains use similar types
    of mechanisms.
  • Binding domain structures can be grouped into
    classes.
  • Each class binds particular sets of DNA sequences
    (binding sites).
  • Binding sites are usually somewhat degenerate
    (variable).
  • Two common models for characterizing binding
    sites

PSSM (Position-Specific Scoring Matrix)
Regular expressions
  • Construct a regular expression that matches only
    the sequences at known binding sites.
  • Can match variable-length sequences.
  • Does not provide information about probability or
    binding affinity.
  • Next slide.

19
PSSM
Position-Specific Scoring Matrix
  • Align known binding sites for the TF, all of
    length n.
  • Create a 4xn matrix showing the number of times
    each base appears at each position.
  • To determine the TFs binding affinity for
    sequence S, calculate
  • log( (PM) / (PB) ) .

Probability of seeing S in the motif
Probability of seeing S outside the motif
A 3 2 0 12 0 0 0 0 1 3 C 5 2 12
0 12 0 1 0 2 1 G 3 7 0 0 0 12 0 7
5 4 T 1 1 0 0 0 0 11 5 4 4
PSSM matrix built from an alignment of 12 binding
sites of length 10 bp for yeast TF Pho4p
20
The Experiment
Goal Predict the type of DNA-binding domain
that a TF has based on features of the DNA
sequences to which it binds.
Data Encoded data about TF factors classes and
the sequences to which they bind, as taken from
the TRANSFAC database.
21
TRANSFAC Database
TRANSFAC is a database on eukaryotic cis-acting
regulatory DNA elements and trans-acting factors.
It covers the whole range from yeast to human. It
started 1988 with a printed compilation and was
transferred into computer-readable format in
1990.
The FACTOR table contains 6133 entries in 50
classes, but this figure does not reflect the
number of independent transcription factors.
Homologous factors from different species such as
human and mouse SRF are given different entries
since they may differ in some molecular aspects.
Factors originally described by different
research groups as binding to different genes may
turn out identical when cloned. Also, more
factors are recognized as representatives of
whole TF families that are products of distinct
but similar genes or alternative splice products.
We have in general not entered proteins just
because of the presence of a putative DNA-binding
motif. Thus there are many more zinc finger or
homeo domain proteins known than are included in
FACTOR, but for many no data about DNA-binding
specificity or other gene regulatory features are
available.
The SITE table gives information on individual
(putatively) regulatory protein binding sites. It
contains 7915 entries. 6360 of them refer to
sites within 1504 eukaryotic genes. 1295 are
artificial sequences. 260 have consensus binding
sequences given in the IUPAC code.
22
TRANSFAC Classes
1 Superclass Basic Domains1.1 Class Leucine
zipper factors (bZIP). (IV) 1.2 Class
Helix-loop-helix factors (bHLH). (III) 1.3
Class Helix-loop-helix / leucine zipper factors
(bHLH-ZIP). 1.4 Class NF-1 1.5 Class RF-X
1.6 Class bHSH 2 Superclass
Zinc-coordinating DNA-binding domains 2.1
Class Cys4 zinc finger of nuclear receptor type.
(II) 2.2 Class diverse Cys4 zinc fingers.
2.3 Class Cys2His2 zinc finger domain. (I)
2.4 Class Cys6 cysteine-zinc cluster. 2.5
Class Zinc fingers of alternating composition 3
Superclass Helix-turn-helix 3.1 Class Homeo
domain. (IV) 3.2 Class Paired box. 3.3
Class Fork head / winged helix. (V) 3.4
Class Heat shock factors 3.5 Class
Tryptophan clusters. 3.6 Class TEA domain. 4
Superclass beta-Scaffold Factors with Minor
Groove Contacts 4.1 Class RHR (Rel homology
region). 4.2 Class STAT 4.3 Class p53
4.4 Class MADS box. 4.5 Class beta-Barrel
alpha-helix transcription factors 4.6 Class
TATA-binding proteins etc.
23
TRANSFAC Class Hierarchy
Transcription Factor ClassificationLast modified
2002-10-01 1 Superclass Basic Domains 1.1
Class Leucine zipper factors (bZIP). 1.1.1
Family AP-1(-like) components 1.1.1.1
Subfamily Jun 1.1.1.1.1 XBP-1 (human).
1.1.1.1.2 v-Jun (ASV). 1.1.1.1.3 c-Jun
(mouse) c-Jun (rat) c-Jun (human) c-Jun
(chick). 1.1.1.1.4 JunB (mouse).
1.1.1.1.5 JunD (mouse). 1.1.1.1.6 dJRA
1.1.1.2 Subfamily Fos 1.1.1.2.1 v-Fos
(FBR MuLV) v-Fos (FBJ MuLV) v-Fos (NK24).
1.1.1.2.2 c-Fos (mouse) c-Fos (human) c-Fos
(rat) c-Fos (chick). 1.1.1.2.3 FosB (mouse).
1.1.1.2.3.1 FosB1 1.1.1.2.3.2 FosB2
1.1.1.2.4 Fra-1 (mouse) Fra-1 (rat).
1.1.1.2.5 Fra-2 (chick) Fra-2 (human).
etc.
24
TRANSFAC Factors
Drilldown on 1.1 Class Leucine zipper factors
(bZIP) lists factors in the class
CL basic region leucine zipper 1.1. CC A
DNA-binding basic region is followed by a leucine
zipper. The leucine zipper consists of repeated
leucine residues at every seventh position and
mediates protein dimerization as a prerequisite
for DNA-binding. The leucines are directed
towards one side of an alpha-helix. The leucine
side chains of two polypeptides are thought to
interdigitate upon dimerization (knobs-into-holes
model). The leucine zipper dictates dimerization
specificity. Upon DNA-binding of the dimer, the
basic regions adopt alpha-helical conformation as
well. Possibly, a sharp angulation point
separates two alpha-helices of the subregions A
and B leading to the scissors grip model for the
bZIP-DNA complex. The DNA is contacted through
the major groove over a whole turn. BF T03820
ABF1 Species thale cress, Arabidopsis thaliana.
BF T03823 ABF2 Species thale cress,
Arabidopsis thaliana. BF T03824 ABF3 Species
thale cress, Arabidopsis thaliana. BF T03825
ABF4 Species thale cress, Arabidopsis thaliana.
BF T04543 ABI5 Species thale cress,
Arabidopsis thaliana. BF T04565 ACA1 Species
yeast, Saccharomyces cerevisiae. BF T00027 AP-1
Species clawed frog, Xenopus. BF T00029 AP-1
Species human, Homo sapiens. BF T00030 AP-1
Species monkey, Cercopithecus aethiops. BF
T00031 AP-1 Species rat, Rattus norvegicus. BF
T00032 AP-1 Species mouse, Mus musculus. BF
T03199 ARR1 Species yeast, Saccharomyces
cerevisiae. BF T02783 ATB-2 Species thale
cress, Arabidopsis thaliana. etc.
25
TRANSFAC Sites
Drilldown on factor ABF1 lists the sequences to
which it binds
SQ GGACGCGTGGC. SQ TGTCGTGGGGACACGTGGCATACGAGGC.
SQ TGTCGGGGACACGTGGCGCTAACGAGGC. SQ
TGTCGGGACACGTGGCGCAACACGAGGC. SQ
TGTCGGGACACGTGGCCCACCCGGAGGC. SQ
TGTCGGGACACGTGGCACAAATAGAGGC. SQ
TGTCGTCAATGGACACGTGGCTAGAGGC. SQ
TGTCGTCGGACACGTGGCACGAAGAGGC. SQ
GCCTCGACAGGACACGTGGCACGCGACA. SQ
TGTCGATCAATGGACACGTGGCAGAGGC. SQ
GCCTCGGTGACACGTGGCTTGACCGACA. SQ
TGTCGGAAGTGGTGACACGTGGCGAGGC. etc.
26
Feature Encoding (1)
  • Encode each TF as a 1390-length feature vector.
  • Dont worry about too many features the
    classifier will identify the important ones.
  • For 1387 features, calculate the arithmetic mean
    of the feature vectors for the sequences the TF
    binds.
  • Add 3 extra binary features indicating whether
    the TF is plant, animal, or fungus.

27
Feature Encoding (2)
  • Encode each binding site as a 1387-length feature
    vector.

1364 integer features encoding subsequence
frequency for subsequences up to length 5 41
4 features for subsequences of length 1 (A, T, C,
G) 42 16 for subsequences of length 2 (AA,
AT, AC, AG, TA, TT, TC, TG, ) 43 64 for
subsequences of length 3 44 256 for
subsequences of length 4 45 1024 for
subsequences of length 5
28
Feature Encoding (3)
  • 8 binary features encoding the presence or
    absence of an ungapped palindrome of half-length
    3, 4, 5, or 6, either spanning the whole sequence
    or not.
  • A palindromic sequence is equal to its
    complementary sequence read backwards.
  • A and T, C and G are complementary bases.
  • 1 for a palindrome of half-length 3, spanning
    (e.g., ACG CGT)
  • 1 for a palindrome of half-length 3, not
    spanning (e.g., ACG CGT )
  • 1 for a palindrome of half-length 4, spanning
    (e.g., ACGC GCGT)
  • 1 for a palindrome of half-length 4, not
    spanning (e.g., ACGC GCGT )
  • etc.

29
Feature Encoding (4)
  • 8 binary features encoding the presence or
    absence of a gapped palindrome of half-length 3,
    4, 5, or 6, either spanning the whole sequence or
    not.
  • A gapped palindrome is a palindrome with a
    non-palindromic insertion in the exact middle.
  • 1 for a gapped palindrome of half-length 3,
    spanning (e.g., ACG ... CGT)
  • 1 for a palindrome of half-length 3, not
    spanning (e.g., ACG CGT )
  • 1 for a palindrome of half-length 4, spanning
    (e.g., ACGC GCGT)
  • 1 for a palindrome of half-length 4, not
    spanning (e.g., ACGC GCGT )
  • etc.

30
Feature Encoding (5)
7 binary features encoding the presence or
absence of a special sequence identified in the
literature as over-represented in the binding
sites of certain classes of TF. Sequence Class
G . . G Cys2His2 (I) G . . G . . G Cys2His2
(I) GC . . GC . . GC Cys2His2 (I) AGGTCA
TGACCT Cys4 (II) CA . . TG bHLH (III) TGA .
TCA bZip (IV) TAAT ATTA Homeodomain (VI)
Regular expression representation . Any single
character. Any single character inside the
brackets. Either the expression preceding or
the expression following. Zero or more of the
preceding expression.
31
Encoding Example
Encode sequence GGACGCGTGGC.
Length 2 subsequence 6 features 1 or 2 10
features 0
Length 3 subsequence 9 features 1 55
features 0
Length 4 subsequence 8 features 1 248
features 0
Length 5 subsequence 7 features 1 1017
features 0
Length 1 subsequence A 1 C 3 G 6 T
1
Palindromes 1 feature 1 7 features 0
Gapped palindromes 8 features 0
Special sequences ?
At least 1345 of the 1387 features for this
binding sequence are zero-valued.
32
Dataset
n 587 columns, one for each TF
x1,2
x1,1
x1,587
.
.
.
x2,1
d 1390 rows, one for each feature
.
.
.
.
.
.
x1390, 1
y1,1
1-of-m class encoding
.
.
.
.
.
y6,1
33
SMLR Algorithm
Sparse Multinomial Logistic Regression
  • Learns a multi-class classifier
  • Simultaneously performs feature selection
  • Reports the probabilities of a sample belonging
    to each of the m classes, given m sets of feature
    weights, one for each class.

34
Linear Regression
Model/predict a dependent variable as a linear
function of independent variables
yi b1xi1 b2xi2 bnxn ei
Find the best-fit line (e.g., estimate the bis)
by minimizing the sum of the squares of the
vertical deviations from each data point to the
line
R2 ? yi f(xi b1, b2. ..., bn)2
35
Logistic Regression
Used when dependent variable y is binary. Logit
function of p is expressed as a linear
combination of xi .

logit(p) log ( p/(1-p) ) w0
w1x1 wnxn wTx
p
p P ( y 1 x, w)
e wTx 1 e wTx

x
probability that x belongs to class y, given x
and w
w w0 w1 wn T ,
x x0 x1 xn T
single weight vector of length d
d feature values for one sample
36
Multinomial Logistic Regression
Generalization of logistic regression. Used when
dependent variable y is multiclass.
(i)T
e w
x
p P ( y(i) 1 x, w)
probability that x belongs to the class
encoded by y(i) 1, given w
m
?
(j)T
x
e w
j1
w w(1)T w(2)T w(m)T T ,
x x0 x1 xd T ,
y y(1) y(2) y(m) T
one-of-m class encoding
d feature values for one sample
weight vectors of length d for each of m classes
37
Estimating w
In logistic regression, w is usually estimated
using maximum likelihood (ML). Want to find w
that maximizes the probability of classifying
samples correctly.
P ( yj xj , w ) probability of classifying
sample xj correctly, given the values of w.
n
log-likelihood l(w) ? log ( P ( yj xj , w
) )
j1
jT
e w
xj
n
wj indicates the weight vector for the class to
which xj belongs
? log ( )
m
(i)T
?
j1
e w
xj
i 1
n
m
? ( wjT Xj ) log ?
(i)T
xj
e w
j1
i1
n
m
m
? ? yj(i) w(i)T Xj log ?
(i)T
e w
xj
j1
i1
i1
This is only 1 when xj is in class i, 0 else
38
Estimating a Sparse w
We want w to be sparse, with many zero values,
deselecting many features.
Use the maximum a posteriori (MAP) method
Penalize the ML estimate by placing a prior
p(w) on the parameters w. Choose a prior
distribution that induces sparsity the Laplace
distribution.

w MAP argmax L(w) argmax ( l(w) log
p(w) )
w
w
probability that w comes from a Laplace
distribution
sum of log-likelihoods of xi being classified
correctly, given xi and w
39
Laplace Distribution
x - µ/b
p(x) (1/2b) e
? w1
p(w) e
? ?j wj
e
  • Remember ln p(w) is the MAP penalty function.

Larger wj ? smaller p(w) ? very negative ln
p(w) Smaller wj ? larger p(w) ? less negative
ln p(w) ln p(w) is at its max at ln p(w) 0
p(w) 1 e e0
1
? ?j wj
  • The ? parameter needs to be set appropriately.
  • Larger ? ? greater sparsity, fewer features
    selected.
  • Authors chose ?1 using cross-validation.

40
Results
  • 77 TFs misclassified during LOOCV, for 87
    accuracy.
  • 20 accuracy during LOOCV after permuting class
    labels
  • (28 accuracy expected).

(error)(TFs) TFs misclassified
.23(97) 22.31 .09(97) 8.73 .11(61)
6.71 .08(165) 13.2 .17(52) 8.84 .15(115)
17.25 --------------------- .13(587) 77.04
41
Results
  • Analyzed feature selection consistency across
    LOOCV trials.
  • Most features were selected either very
    infrequently (1047 features were selected in lt
    10 of trials)
  • or very frequently (290 features were selected
    in gt 90 of trials). This leaves 53 features
    selected inconsistently.

42
Results
  • Used trained classifier to predict TF class based
    on experimentally determined binding site motifs.
  • Used 14 TFs in TRANSFAC but not in training set.
  • TF binding sites were experimentally determined.
  • Motifs were extracted from the binding sites
    using PSSM.
  • Other potential binding sites with the same
    motifs were located using PSSM methods.
  • These binding sites formed the input data.
  • Class was predicted correctly for 12 of 14 TFs.

43
Conclusions
  • The authors have developed a multiclass
    classifier that assigns TFs DNA-binding domain
    classes based on features in their binding site
    sequences.
  • They argue that this capability demonstrates that
    DNA binding sites contain significant predictive
    information about TFs binding mechanisms.
  • They note that their classifier consistently
    selects certain features and argue for their
    biological plausibility.
  • Nearly 1/3 of features are predictors of Class I,
    zinc finger proteins with poor sequence
    specificity.
  • Palindromic features are predictors of Class II,
    zinc finger proteins that form dimers.
  • They argue that their method has implications for
    how TF binding sites should be modeled.
  • Regular expression models are not probabilistic
  • PSSM models are length invariant
  • They note that their classifier might be useful
    to biologists.
  • Help to engineer proteins that bind to specific
    DNA sequences
  • Predict which class of TF binds to sites find
    using conventional motif finding algorithms

44
Cell-Signaling Pathways
About PowerShow.com