Bioinformatics: overview - PowerPoint PPT Presentation

1 / 162
About This Presentation
Title:

Bioinformatics: overview

Description:

Netscape Navigator. Omniweb. Getting started. Open your web browser. Type in the address: ... and tetranucleotids as (AT)n,(CT)n, (CA)n, (GA)n, (GT)n or (CCT)n, ... – PowerPoint PPT presentation

Number of Views:372
Avg rating:3.0/5.0
Slides: 163
Provided by: christians6
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics: overview


1
Bioinformatics overview
  • Handling a computer
  • Opening and saving of files
  • Starting programs
  • Navigating the WWW
  • FTP
  • Browsing
  • Sequence data
  • Primary data
  • Sequence formats

2
Bioinformatics overview (2)
  • Databases
  • Entrez
  • SRS
  • Manipulation of DNA sequences
  • Restriction analysis
  • in silico cloning
  • Translation of nt sequence into protein
  • PCR
  • Primer design

3
Bioinformatics overview (3)
  • Comparison of two sequences
  • Dot matrix
  • Pairwise alignment
  • Multiple alignments
  • Database searches for similar sequences
  • FASTA
  • BLAST

4
Bioinformatics overview (4)
  • Sequence annotation
  • Intron/Exon prediction
  • Identification of conserved motifs
  • Identification of regulatory sequences
  • Organismal databases
  • D. melanogaster
  • A. thaliana
  • Expression profiling (chips)

5
Copy and Paste
To transfer text/sequence files into programs use
copy/paste
  • StrgC for copy StrgV for paste

6
Start a program
  • Double click on the program you wish to open
  • Microsoft word can be found under
  • Start -gt Programme-gt Microsoft Word

7
Create a Folder
  • Start - Programme - Windows Explorer - click
  • Desktop - click
  • Datei - Neu - Ordner - click
  • the new folder will appear on the screen
  • rename Neuer Ordner to EDV
  • click once on the icon and a second time on the
    text field to activate the editor and write EDV
  • Save all your future files into this folder

8
Navigating the Internet
  • File transfer protocol (FTP)
  • Allows a person or computer to retrieve and send
    files from/to another computer. Only copies are
    moved the original file remains untouched.
  • The network terminal protocol (TELNET)
  • allows a user to log in on any other computer on
    the network, turning the local computer in a
    terminal.

9
Navigating the Internet
  • Every computer in the internet has its own unique
    IP address
  • e.g. 193.171.103.86
  • Because these numbers are not intuitive they are
    often converted into a name
  • i122server.vu-wien.ac.at addresses the same
    computer
  • Subdirectories on this computer can be specified
  • i122server.vu-wien.ac.at/edv/start.html is a
    folder, which has been prepared for this course

10
Navigating the Internet
  • For internet access you need either a modem or a
    direct line
  • Once you are connected with one server you can
    access the full internet
  • For most purposes an internet browser is
    sufficient
  • Internet Explorer
  • Netscape Navigator
  • Omniweb

11
Getting started
  • Open your web browser
  • Type in the address http//i122server.vu-wien.ac
    .at/edv/start.html
  • Press return
  • You could make a bookmark of this page

12
Links
  • Rather than typing a new address each time, it is
    possible to click on specially marked text or
    symbols
  • After a single click you will connect to the
    address
  • To view the address before connecting, simply
    move your mouse above the link

13
Assignment 1
  • Open Netscape Communicator
    http//i122server.vu-wien.ac.at/edv/start.html
  • Create a new folder on the desktop
  • Open Microsoft Word and type in a random DNA
    sequence, save this sequence as text only file
    into your folder.
  • Software Windows-Explorer and Word

14
Sequence data
  • Automated DNA sequencing heavily relies on the
    support of computer algorithms
  • Data collection

15
Sequence data
  • The use of 4 different dyes requires intensive
    computer calculations to extract sequence
    information
  • Electropherograms

16
Sequence formats
  • While electorpherograms are useful during the
    sequencing project, after the completion
    sequences are stored as text.
  • Plain text contains only the sequence
    information
  • Fasta

17
Other sequence formats
  • GenBank

18
Manipulation of DNA Sequences I
Restriction endonucleases sticky ends XhoI
(c/tcgag), PstI (ctgca/g), ... blunt
ends SmaI (ccc/ggg), DraI (ttt/aaa),.... Rare
cutters large recognition sites Frequent
cutters small recognition sites Multiple
cloning site
19
Manipulation of DNA Sequences II
20
Assignment 2
1. Open the file pSKII.doc and try to find the
sequences for the sequencing primers M13-forward
(5' gtaaaacgacggccagt 3') and M13-reverse (5'
ggaaacagctatgaccatg 3') as well as the RNA
polymerase promoters T3 (5' aattaaccctcactaaaggg
3') and T7 (5 gtaatacgactcactatagggc 3').
2. Which primers are homologous to the single
stranded SKII sequence and which are
complementary? Software Word, JaMBW (Reverse,
Complement, Inverse) Download pSKII.doc
21
Characteristics of cloning vectors I
multiple cloning site region for universal
(M13 /-) sequencing primers RNA
polymerase (T7, T3, SP6) promoters genes for
selections
22
Watson-Crick DNA strands
The upper strand of the dsDNA is called, W
(Watson) for forward and the lower strand C
(Crick) for reverse.
  • The C strand is complementary (complement/reverse)
    to the W strand
  • C is in antisense to W

23
DNA and RNA Polymerases
DNA Polymerases need short primers to start DNA
synthesis RNA Polymerases need short
promoters Polymerases synthesize DNA/RNA only in
the 5 - 3 direction If open reading frame (ORF)
is coded by the W strand - the C strand codes
for the antisense gene Also the C strand can code
for ORFs - than the W strand codes the antisense
gene
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
Assignment 3
You received a cDNA clone and the sequence of the
insert (prc1edvkurs.doc) from your colleague. He
told you that the startcodon is the "atg" at
position 79. For synthesis of an antisense RNA
used as Northern Blot probe you have to subclone
the insert into another vector. The vector you
have in the lab is Bluescript (pSKII.doc).
Bluescript contains a multiple cloning site
flanked by sequences for the sequencing primers
M13-forward and M13-reverse and the RNA
polymerase promoters T3 and T7. a. Find the
multiple cloning site in the vector b. Find the
best cloning strategy using only one restriction
enzyme. c. Use directed cloning to ensure that
all clones could be used to produce an
antisense probe with the RNA polymerase T3. d.
Define a strategy to modify the clone of 3c for
the use of T7 RNA polymerase. Which enzymes would
you use? Software Word and Webcutter Download
prc1edvkurs.doc
28
Manipulation of DNA Sequences III
Polymerase chain reaction (PCR) http//bibiserv.t
echfak.uni-bielefeld.de/sadr/pcrtutor.html
29
Manipulation of DNA Sequences IV
Primer design size between 19 and 25
bases melting temperature 48 C and 60
C Tmforward Tmreverse Tm 2 (A
T) 4 (G C) minimum of G/Cs 9 - 11 ( 40 -
50) distance between primer pairs 10 bp - 40
kb annealing sites unique - 3 end avoid
mispriming primer-primer interaction hairpin
structures
30
(No Transcript)
31
(No Transcript)
32
Assignment 4
  • Microsatellites are highly polymorphic markers,
    which are extensively used for paternity testing,
    genome walking, provenance studies and analysis
    of population structures.
  • They consist of tandemly repeated simple
    sequences of di-, tri and tetranucleotids as
    (AT)n,(CT)n, (CA)n, (GA)n, (GT)n or (CCT)n,....
  • Their length variation results from DNA slippage
    a mechanism, which increases and decreases their
    repeat number. The repeats are flanked by unique
    sequences, which allow to design specific primers
    for the amplification of the microsatellite.
  • Please design primer pairs for the amplification
    of a microsatellite using the following criteria
  • product length 100 - 300 bp
  • annealing temperature higher than 55 C
  • primer length between 20 - 24 bp
  • Software Word and Primer3
  • Download microsatellite.doc

33
The importance of centralized databanks
34
EMBL Databank
35
EMBL SRS
36
Entrez
  • is a search and retrieval system that integrates
    information from databases at NCBI

37
(No Transcript)
38
PopSet-prealigned multiple data sets
39
Taxonomy Browser
40
Online Mendelian Inheritance in Man
  • This database is a catalog of human genes and
    genetic disorders. The database contains textual
    information and references. It also contains
    copious links to MEDLINE and sequence records in
    Entrez

41
Objectives
  • What is the function of this gene?
  • Do other genes have this functional motif?
  • Can I predict the higher order structure of this
    protein?
  • Is this gene a member of a known gene family?
  • Do other organisms have this gene?

42
General Database Search Issues
  • Search using amino acid sequence if possible
  • Why? Protein evolution is slower than DNA
    sequence evolution
  • Ask the program to translate your query sequence
    in all 6 possible reading frames.
  • Statistical theory is based on unrealistic
    assumptions consider searches as exploratory
    analyses.

43
Similarity Search Jargon
A similarity search of a database is performed by
aligning a query sequence to each sequence in the
database. If good matches are found, the search
returns a list of HSPs High-scoring Segment
Pairs.
44
Alignment Jargon
ancestor
Evolutionarily related sequences differ from one
other because of several processes
  • Substitutions
  • Insertions
  • Deletions

Observed sequences
45
Alignment Jargon
GCG ACG
Substitution
A?G
  • 1 mismatch
  • 2 matches

46
Alignment Jargon
ATCG A-CG
Insertion
?T
  • 0 mismatches
  • 3 matches
  • 1 gap

47
Alignment Jargon
Deletion
ATCG A-CG
  • 0 mismatches
  • 3 matches
  • 1 gap

48
Alignment Jargon
Results of insertion and deletion events can be
indistinguishable. Indel INsertion or DELetion
49
Sequence Alignment
  • Sequence alignment is simply the optimal
    assignment of substitution and indel events to a
    pair of sequences.
  • Global alignment align entire sequences
  • Local alignment find best matching regions of
    sequences

50
Alignment of pairs of sequences
  • Dot matrix analysis
  • Dynamic programming
  • Word (k-tuple) methods

51
Dot matrix
  • Sequence A is compared against B
  • Matching bases are marked on a AxB grid

52
Dot matrix
  • Sequence A is compared against B
  • Matching bases are marked on a AxB grid

53
Dot matrix
  • The background could be adjusted by changing the
    window size

54
Dot matrix
  • The background could be adjusted by changing the
    window size
  • (phage lambda and P22 repressor proteins)
  • 1/1 7/11 15/23

55
Dot matrix
  • Search for conserved regions and domains
  • Identify repeated nucleic acid and protein
    domains
  • Determine introns and exons
  • Find inverted repeats and stem-loop structures
  • regions of low complexity
  • frameshifts

56
(No Transcript)
57
Assignments 5 and 6
  • You isolated a cDNA clone (PlecDNA.doc) and you
    would like to know how many introns are in the
    gene. Fortunately you are working with a fully
    sequenced organism thus it is easy to retrieve
    the full genomic region (Plegenomic.doc).
  • a) How many introns does the gene contain?
  • b) What are the sequences (10 bp) around
    introns 2, 3 and the
  • corresponding exons borders?
  • Software Word and Dotlet
  • Download PlecDNA.doc, Plegenomic.doc

58
Assignment 6
  • The previous analysis showed that with the dot
    matrix program some useful interpretation can be
    made on DNA sequences. You have recently isolated
    a genomic fragment (Test.doc) and encouraged by
    the former results to analyze it with the dot
    matrix program.
  • How can you explain the pattern you see in the
    dot matrix?
  • Delete an internal portion of the sequence and
    compare the full versus the deleted sequence ?
  • What is the pattern on the dot matrix?
  • Software Word and Dotlet
  • Download Test.doc

59
Dynamic programming
  • The dynamic programming algorithm provides a
    reliable computation method for aligning
    sequences
  • The method has been proven mathematically to
    yield the optimal alignment (note there may be
    more than a single optimal alignment)
  • Both local and global alignments can be produced

60
Problem of alignment
  • Roughly n x m comparisons need to be made for two
    sequences of length n and m.
  • If the alignment is to include gaps of any length
    at any position in either sequence, the number of
    comparisons that must be made becomes
    astronomical
  • Dynamic programming is a method of sequence
    alignment that can take gaps into account but
    requires only a moderate number of comparisons

61
The algorithm
V D S C Y V D S L C Y
4 -3 -2 -1 -1
-3 6 0 -3 -3
-2 0 4 -1 -2
1 -4 0 -2 -1
-1 -3 -1 9 -2
-1 -3 -2 -2 7
Y Y
C C
L -
S S
D D
V V
7
9
-11
4
6
4
16
62
A sub-optimal alignment
V D S C Y V D S L C Y
4 -3 -2 -1 -1
-3 6 0 -3 -3
-2 0 4 -1 -2
1 -4 0 -2 -1
-1 -3 -1 9 -2
-1 -3 -2 -2 7
Y -
C Y
L C
S S
D D
V V
-11
-2
-2
4
6
4
-1
63
Measuring Alignment Quality(subjective criteria)
  • Good alignments should have
  • many exact matches
  • few mismatches
  • many of the mismatches should be similar
    residues
  • few gaps

64
Measuring Alignment Quality(objective criteria)
  • What is the expected number of HSPs with a score
    of at least S?
  • K constant dependent on the frequency of
    nucleotide
  • m, n length of sequences
  • ? loge (1/p), p probability of a match of
    identical bases (1/4 for equal base frequencies

65
Measuring Alignment Quality(objective criteria)
Bit scores Raw scores have little meaning without
detailed knowledge of the scoring system used. By
normalizing a raw score using One attains a
bit score S
66
Measuring Alignment Quality(objective criteria)
Bit scores Raw scores have little meaning without
detailed knowledge of the scoring system used. By
normalizing a raw score using One attains a
bit score S
67
Measuring Alignment Quality(objective criteria)
Bit scores The E value to a given bit score is
Bit scores subsume the statistical essence
of the scoring system, hence to calculate
significance one needs to know only the size of
the search space
68
Measuring Alignment Quality(objective criteria)
  • Significance of a HSP score
  • P(Sgtx) 1-exp (-Kmne-?x)
  • P(Sgtx) 1-exp (-E)
  • m, n effective length of query and databank
    sequence
  • E number of expected HSPs with score at least S


69
Measuring Alignment Quality(objective criteria)
  • Significance
  • Some programs provide E-values rather than
    P-values, as E is easier to understande.g.
    E-value of 5 vs. 10 corresponds to P-value 0.993
    and 0.99995
  • P-value is associated with E-value e.g if one
    expects to find 3 HSPs with score gtS, the
    probability of finding one is 0.95
  • When Elt0.01, P-values and E-values are nearly
    identical


70
Scoring matrices
  • Rationale
  • certain aa replacements occur often in a protein.
    Because proteins are functioning despite these
    changes the substituted aa are compatible with
    structure and function. Yet other substitutions
    are rare.
  • A scoring matrix is accounting for these
    differences

71
Scoring matrices
  • Dayhoff, 1978
  • PAM (point accepted mutation) matrices
  • Henikoff Henikoff, 1992
  • BLOSSUM (blocks amino acid substitution
    matrices)

72
PAM matrices
  • This family of matrices lists the likelihood of
    change from one aa to another in homologous
    proteins during evolution
  • Each matrix gives the changes expected for a
    given period of evolutionary time
  • Assumption
  • Each change in the current aa is independent of
    previous mutation events at that site.
  • aa changes observed in short evolutionary times
    can be extrapolated to longer periods

73
PAM matrices
  • aa substitutions that occur in a group of
    evolving proteins were estimated.
  • Because these changes are observed in closely
    related proteins, they represent aa substitutions
    that do not change the function of the protein -gt
    accepted mutations
  • 1572 changes in 71 groups of protein sequences
    were observed
  • The number of changes at each aa was counted on
    a phylogenetic tree
  • And divided by the exposure to mutation (aa
    frequency x number of aa in that group) PAM1
  • Asn, Ser, Asp, Glu (highly mutable) Cys, Trp
    (least mutable)

74
PAM matrices
  • The PAM 1 matrix gives the probability of a
    single change
  • To obtain PAM matrices for N mutations, the PAM1
    matrix is multiplied to itself N times
  • PAM250 represents a level of 250 change
    (corresponds to 20 similarity)
  • Computer simulations have shown that PAM250
    provides a better scoring alignment than lower
    numbered PAMs for distantly (14-27 similarity)
    proteins.

75
PAM log odds score
  • PAM matrices are usually converted in log odds
    matrices
  • The ratio of the hypothesis that the change
    represents an authentic evolutionary variation to
    the hypothesis that the change occurred because
    of random sequence variation (no biol.
    significance)
  • Phe-gtTry
  • Phe-Try score in PAM250 0.15
  • Frequency of Phe in data 0.04
  • Log odds score 10 x (0.15/0.04) 5.7

76
PAM250
77
BLOSUM matrices
  • 500 families of related proteins
  • Search for ungapped aa blocks that were present

78
Gap scores
  • The cost of introducing a gap must be higher than
    the cost for extending it
  • Wx g rx
  • g gap opening penalty
  • x length of the gap
  • r gap extension penalty

79
(No Transcript)
80
Assignment 7
  • You have obtained a peptid sequence
    (ASFPCLNGGTCNDQVNGYVCVCAQDTSVSTCET) and
  • would like to find its position in the full
    length protein.
  • Software Word and Blast2 Sequences
  • Download UEGF1.doc

81
Multiple alignments
  • Problem
  • Alignment of
  • two sequences (length N) N2 comparisons
  • 300 aa 9x104
  • three sequences (length N) N3 comparisons
  • 300 aa 2.7x107
  • -gt exact multiple alignments are not feasible for
    most data sets heuristic methods are required

82
Progressive methods for multiple alignment
  • PILEUP
  • Part of the GCG package
  • CLUSTALW
  • Available as local programs (Mac, PC, Unix)
  • Could be also run on remote computers

83
Progressive alignment algorithm
  • Produce a global pairwise alignment for all pairs
    of sequences
  • Full dynamic programming
  • K-tuple approach, similar to FASTA
  • Calculate the pairwise alignment scores
  • Built a tree based on the genetic distances
    derived from the alignment scores (NJ)
  • Align the sequences sequentially, guided by the
    phylogenetic relationships indicated by the tree

84
Progressive alignment weighting
  • Problem alike sequences will produce a bias in
    the alignment
  • Solution weighting of sequences based on
    alignment scores

0.2
A
0.2 0.3/2 0.35
0.3
0.1
B
0.1 0.3/2 0.25
0.5
C
0.5
85
Progressive alignment problems
  • Dependence on the initial pairwise alignments
  • No problem for closely related sequences
  • The more diverged the sequences are, the more
    problematic is the alignment
  • Choice of suitable scoring matrices and gap
    penalties that apply to the entire set of
    sequences
  • -gt Bayesian methods such as hidden Markov models
    (HMMs) may be preferable for distantly related
    sequences

86
Single sequence queries
  • Rationale a single sequence should be searched
    against a database to identify those sequences,
    which are most similar
  • Identification of a related gene in another
    organism
  • Identification of a related gene in the same
    organism
  • Similarity may provide clues about function

87
Data banks
  • Genomic sequences
  • Complete genomes
  • cDNA/proteins
  • ESTs (expressed sequence tags)

88
FASTA BLAST rationale
  • Main idea Good alignments are expected to share
    several aa. Hence, consecutive shared aa (words,
    k-tuples) could serve as an indicator of quality.
  • Observation HSPs of interest are usually longer
    than a single word, so look for multiple hits on
    the same diagonal, separated by a short distance

89
FASTA
  • FASTA3 is the latest version with increased
    ability to detect distantly related sequences
  • Input
  • k size of matching sequence patterns or words,
    called k-tuples
  • Similarity matrix
  • Compares query sequence pairwise with each
    sequence in the database

90
FASTA hashing algorithm
  • Search for k consecutive matches
  • Use a precompiled table that lists where in the
    database each possible word occurs
  • Generation of the table is in the order L (size
    of databank)
  • Use of the order N (size of query sequence)

91
FASTA hashing algorithm
  • word size 1 aa

92
FASTA algorithm
  • Hashing built a library of k consecutive
    residues and search the database represented by
    such a library
  • Note not database is searched, but the library
  • DNA k4-6 protein k1-2
  • Longer words result in a faster, but less
    sensitive search
  • Joining those matches within a certain distance
    of each other are joined along with the region
    between them into a longer matching region
    without gaps.

93
FASTA algorithm
  • Filtering the 10 best matching regions are
    rescored using a scoring matrix (BLOSUM or PAM)
  • Ends of the regions are trimmed to remove
    residues not contributing to the score
  • The best scoring region INIT1 is reported
  • Joining regions that are near enough are joined.
    The score of this larger region, including
    penalties for gaps needed to join the initial
    regions is reported as INITN.
  • Distance for proteins K132 k216

94
FASTA algorithm
  • Later versions of FASTA include an optimization
    step
  • When INITN reaches a certain threshold, the score
    of the region is recalculated to produce an OPT
    score by performing a full local alignment using
    dynamic programming.
  • This procedure increases sensitivity but
    decreases selectivity

95
Limitations of FASTA
  • FASTA can miss significant similarity since
  • For proteins, similar sequences do not have to
    share identical residues
  • Asp-Lys-Val is quite similar to Glu-Arg-Ile yet
    it is missed even with k-tuple size of 1 since no
    amino acid matches
  • For nucleic acids, due to codon wobble, DNA
    sequences may look like XXyXXyXXy where Xs are
    conserved and ys are not

96
BLAST (1)Basic Local Alignment Search Tool
  • Filter low complexity regions are removed
  • Divide query sequence into words (sliding by 1
    position)
  • Include imperfection based on a scoring matrix
    similar words which produce a score higher than T
    are assembled to a list
  • This step is included to permit not perfect
    matches between subject and query sequence
  • Usually about 50 entries per word (rather than
    20x20x208000)


97
BLAST (2)Basic Local Alignment Search Tool
  • Approach find segment pairs by first finding
    word pairs that score above a threshold, i.e.,
    find word pairs of fixed length w with a score of
    at least T
  • Key concept Seems similar to FASTA, but we are
    searching for words which score above T rather
    than that match exactly


98
BLAST (3)Basic Local Alignment Search Tool
  • Each database entry is scanned for a match to one
    of the list entries
  • Use the short matched regions (x) lying on the
    same diagonal and within distance A as starting
    points for a longer ungapped alignment between
    words


99
BLAST (4)Basic Local Alignment Search Tool
  • Extension of the alignment from the matching
    words in each direction along the sequences.
    Extension continues as long as the score
    increases.The extension is stopped when the
    accumulated score stops increasing and had just
    begun to fall a small amount below the best score
    found for a shorter extension.
  • The obtained segment is called high scoring
    segment pair (HSP)


100
BLAST (5)Basic Local Alignment Search Tool
  • Determine whether the HSP has a score larger than
    a cutoff score S
  • S is determined by examining the range of scores
    found by comparing random sequences and by
    choosing a value that is significantly greater
  • Determine significance of each HSP score
  • P(Sgtx) 1-exp (-Kmne-?x)
  • P(Sgtx) 1-exp (-E)
  • m, n effective length of query and databank
    sequence
  • E number of expected HSPs with score at least S


101
BLAST (6)Basic Local Alignment Search Tool
  • Significance
  • BLAST provides E-values rather than P-values, as
    E is easier to understande.g. E-value of 5 vs.
    10 corresponds to P-value 0.993 and 0.99995
  • P-value is associated with E-value e.g if one
    expects to find 3 HSPs with score gtS, the
    probability of finding one is 0.95
  • When Elt0.01, P-values and E-values are nearly
    identical


102
Selecting the BLAST program
103
FASTA-BLAST comparison
104
Significance of database searches (1)
  • All previous theory referred to the comparison of
    two sequences- how should one consider the entire
    set of sequences?
  • 1. Significance is independent of the length of a
    sequence-gt multiply pairwise significance with
    number of sequence entries (FAST A)
  • 2. Significance depends on length, as long
    sequences are composed of multiple distinct
    domains-gt treat entire database as a single
    sequence for calculation of significance

105
Significance of database searches (2)
  • Until now, only ungapped sequences were
    considered.
  • Computational experiments and analytical results
    suggest that the same theory could be applied to
    gapped alignments
  • For ungapped alignments the statistical
    parameters (?,K) can be calculated using analytic
    formulas
  • For gapped alignments these parameters must be
    estimated from a large-scale comparison of
    random sequences

106
Significance of database searches (3)
  • gapped alignments
  • FASTA local alignment scores are produced for
    the comparison of query and every databank
    sequence. Most of these scores involve unrelated
    sequences, they could therefore be used to
    estimate ? and K.Problemscores from pairs of
    related sequences should be excluded
  • BLAST ? and K are estimated for a selected set
    of substitution matrices and gap costs.The
    estimation could be done with real sequences, but
    has instead relied on random sequences

107
Hidden Markov Model (HMM)
  • HMMs offer a more systematic approach to
    estimating model parameters
  • HMMs could be compared to a kind of dynamic
    statistical profile
  • Like an ordinary profile, it is built by
    analyzing the distribution of aa in a training
    set of related proteins
  • The topology of a HMM can be visualized as a
    finite state machine

108
Hidden Markov Model (HMM)
Delete States
Insert States
A
Match States
C
C
Begin
End
G
Movement from stage n to n1 with a certain
transition probability
109
Hidden Markov Model (HMM)
  • More than one path leads to the same result

Delete States
Insert States
A
Match States
C
C
Begin
End
G
Movement from stage n to n1 with a certain
transition probability
110
Hidden Markov Model (HMM)
  • The probability of a given sequence is obtained
    by the sum of loge (transition probabilities)
  • Hidden Markov model, as the path is hidden
  • Transition probabilities are obtained by training
    on a set of sequences
  • Initialization by estimated transition
    probabilities
  • All possible paths generating a given sequence
    are visited proportional to the estimated
    transition probabilities
  • Counting the number of times a given transition
    was visited during the above step provides
    improved transition probabilities
  • The Viterbi algorithm is used on a trained HMM to
    determine the best path
  • The Viterbi algorithm is similar to dynamic
    programming

111
Hidden Markov Model (HMM)
  • HMM is a general technique that can be applied to
    many different questions
  • Multiple sequence alignment
  • Identification of conserved domains
  • Gene prediction
  • Protein secondary structure prediction

112
Single aa sequence query programs
  • Sequence similarity with query sequence
  • FASTA, BLAST
  • Alignment search with profile (scoring matrix
    with gap penalties)
  • PROFILESEARCH
  • Search with position specific scoring matrix
    (PSSM) representing ungapped sequence alignment
    (BLOCK)
  • MAST
  • Iterative alignment search for similar sequences
    that starts with query sequence, builds a gapped
    multiple alignment, and then uses this to augment
    the search
  • PSI-BLAST
  • Search query sequence for patterns representative
    of protein families
  • PROSITE, INTERPRO, PFAM, CDD/IMPALA

113
(No Transcript)
114
(No Transcript)
115
(No Transcript)
116
(No Transcript)
117
Comparison of EMBL NCBI
118
(No Transcript)
119
Assignments 8 to 10
  • You have isolated a number of proteins by their
    interaction with a protein known
  • to interact with RING finger proteins. By
    sequencing the protein you got
  • from human cell lines msvdmnsqgsdsneedydpnceeeeee
    eeddpgdie
  • from C.elegans mnsddeiymegsasseddmddeclsd and
    mddedmsctsgddyagygdedyyneadv
  • from Drosophila melanogaster mdsdndndfcdnvdsgnvss
    gddgdddfg and
  • mdsdiemdmesdndgeydddydyyntgedcd
  • from Saccharomyces cerevisiae mssgtendqfysfdesdss
    sielyeshntseftihglv
  • from Arabidopsis thaliana mdnnsvigsevdaeadesyvna
    aledgqtgkks and
  • mddyfsaeeeacyyssdqdsldgidneeselqpl
  • a. Find the complete protein sequences for every
    given peptide and align the sequence to find
    out about their overall homology.
  • b. Are there RING finger motifs in your proteins
    and if yes how many and where?
  • c. RING-Finger proteins share a common protein
    motif of
  • C-X2-C-X9-29-C-X1-3-H-X2-3-C/H-X2-C-X4-48-C-
    X2-C.
  • d. Are there other remarkable protein motifs?
  • Software Word, BLAST, FastA and ClustalW

120
Assignment 9
  • You received a manuscript submitted for
    publication. The authors claim that they have
    discovered a gene involved in abnormal muscle
    growth in salmon (hs heavy salmon). You should
    decide if the paper should be published.
  • b. What gene is it? Is it really a novel gene?
  • c. Do you support the authors claim that this
    is a salmon gene?
  • d. Could the authors claim be true?
  • Software Word, FastA, BLAST, Pubmed
  • Download hs_gene.doc

121
Assignment 10
  • Inspired by the manuscripts you reviewed, you
    decide to look for the gene in whales.
  • a. Make a sequence alignment to design primers
    for cross species amplification
  • b. Design primers that have a fair chance to
    amplify the gene from whales
  • c. You know that human contaminations are a
    problem in your lab. What would you do to
    minimize the risk of a human contamination?
  • Software Word, BLAST, FastA, ClustalW

122
Organismal databases
123
Arabidopsis thaliana
124
Drosophila BDGP (1)
125
Drosophila BDGP (2)
126
Drosophila Flybase
127
Drosophila NCBI
128
Assignments 11 to 13
  • In Drosophila microsatellites are very short. Try
    to find the longest dinucleotide microsatellite
    in D. melanogaster
  • Software FLYBASE, BDGP, BLAST,

129
Assignment 12
  • ITS sequences are widely employed to reconstruct
    the phylogeny of closely related species. The
    major advantage of ITS sequences is that you
    could use primers (located in the 18S and 28S
    rDNA) which are conserved across many species.
    You have used these conserved primers to amplify
    the complete ITS region form oaks. The PCR
    products were cloned and sequenced. In the folder
    oaks you find the results of your experiment.
  • Figure 1. Organization of the rDNA
  • a. Make a contig of your sequences
  • b. Define the boundaries of the genes with the
    spacers
  • c. Verify that your sequences originate from
    oaks.
  • Software Word, JaMBW, ClustalW, BLAST,FastA
  • Download oak1, oak2, oak3, oak4, oak5

130
Assignment 13
  • You received one pair of microsatellite primers,
    made PCR and found a highly interesting pattern
    in one population (no variability). Inspired by
    this result, you are interested to know more
    about the locus. Unfortunately, you found only
    the sequence of one of the primers
    (ttttgtcgttttcgttatg) and your friend has gone
    for a 6 months holiday. Fortunately, you are
    working with one of the best studied organisms
    Drosophila melanogaster so you have all
    possibilities to investigate!
  • a. What is the repeat motif of your
    microsatellite?
  • b. Which gene is in close proximity to the
    microsatellite?
  • c. On which chromosome is the gene located?
  • d.Determine the number of available transposon
    insertions in the gene
  • e. Where in the gene are the transposons
    inserted?
  • f. What would you do to obtain a flystock
    having the gene deleted?
  • Software FastA, BLAST, FLYBASE, BDGP

131
Gene prediction
132
Gene prediction
  • Goal identify those regions that code for
    proteins
  • Direct approach Look for stretches that can be
    interpreted as protein using the genetic code
  • Statistical approaches Use other knowledge about
    likely coding regions

5 UTR
Exons
Introns
3 UTR
133
Gene prediction direct approach
  • Genetic code
  • The universal genetic code is common to all
    organisms
  • Prokaryotes, mitochondria and chloroplasts often
    use slightly different genetic codes
  • More than one tRNA may be present for a given
    codon, allowing more than one possible
    translation product
  • Differences in genetic codes occur in start and
    stop codons only
  • Alternate initiation codons codons that encode
    amino acids but can also be used to start
    translation (GUG, UUG, AUA, UUA, CUG)
  • Suppressor tRNA codons codons that normally stop
    translation but are translated as amino acids
    (UAG, UGA, UAA)

134
Gene prediction direct approach
  • Reading Frames
  • Since nucleotide sequences are read three bases
    at a time, there are three possible frames in
    which a given nucleotide sequence can be read
    (in the forward direction)
  • Taking the complement of the sequence and reading
    in the reverse direction gives a total of six
    reading frames
  • Open reading frames are defined by a set of
    codons not interrupted by a stop codon
  • Note not all ORFs are actually used

135
Gene prediction direct approach
  • Statistical support by Ficketts statistic
    codon usage bias
  • Observation every third base tends to be the
    same one much more often than expected by chance.
  • The reason for this is codon usage bias
  • Different levels of expression of different tRNAs
    for a given amino acid lead to pressure on coding
    regions to conform to the preferred codon usage
  • Non-coding regions, on the other hand, feel no
    selective pressure and can drift

136
Gene prediction direct approach
  • Statistical support by Ficketts statistic
    codon usage bias
  • Example Glycine codon frequencies

137
Gene prediction direct approach
exon
138
Gene prediction direct approach
  • Problem the direct approach works well for
    Prokaryotes but not for Eukaryotes
  • Codon usage bias is not constant across genes
  • Introns in Eukaryotes

139
Gene prediction statistical approach
  • To discriminate between different regions of a
    gene, typical sequence elements are used as
    clues
  • Content sensor Region of residues with similar
    properties (introns, exons)
  • Signal sensor A specific signal sequence (may be
    a consensus)

5 UTR
Exons
Introns
3 UTR
140
Pre-mRNA splicing
141
Gene Finding Software
  • GENSCAN
  • HMMGENE
  • GENMARK
  • GRAIL

HMMs
Neural Network
142
Evaluation of gene predictions
  • One has to discriminate between
  • True positives (TP)
  • False positives (FP)
  • False negative (FN)
  • Sensitivity TP/(TPFN)
  • Specificity TP /(TPFP)
  • GRAIL was used for different human data sets
  • Sensitivity 0.48-0.65 specificity 0.61 - 0.72

143
Promoter prediction
  • Similar to gene prediction, known regulatory
    signals could be used to make predictions
  • Algorithms
  • Neuronal networks
  • HMMs

144
(No Transcript)
145
(No Transcript)
146
(No Transcript)
147
Analyzing Gene Expression (Microarray) Data
148
Assignments 14 and 15
  • You have transformed an Arabidopsis thaliana
    mutant with a genomic sequence (Annotierungssequen
    z.doc) and the presumable gene is sufficient to
    restore the function of the mutant gene.
  • a. Find the coding sequence
    b. Find the PolyA signal
  • c. Where is the TATA box motif located?
  • d. Locate the gene on the A. thaliana map
  • e. Are cDNA clones available for this gene?
  • f. Where is the gene expressed?
    g. Predict the protein sequence
  • h. Does this protein share homologies with other
    proteins?
  • i. Are there any related proteins in other
    plants/animals?
  • j. Do these homologies indicate a possible
    function?
  • k. Does the protein has some interesting domains?
  • l. Is there a transmembran domain? m.
    Predict the subcellular localization
  • SoftwareArabidopsis DatenbankTAIR,
    GENSCAN,Genfinder, MCB search, ExPasy,PLACE
  • Download Annotierungssequenz.doc

149
Assignment 15
  • Based on sequence polymorphism data your friend
    concluded that a given sequence has been the
    target of selection. He asked you for advice
    about the identified sequence. Make the best
    possible characterization of the sequence-not
    relying on a single source of information only.
  • Download Unknown.doc

150
Microarray Data
  • A snapshot of the amount of a particular gene
    being transcribed in a tissue
  • Measured for tens of thousands of genes
  • Use of multiple tissues on a single array allow
    for direct comparisons between tissues

151
Objectives of Microarray Studies
  • Gene discovery Which genes are affected when
    exposed to a treatment?
  • Hit it with a stick and see what happens
  • Disease diagnosis Given a profile of levels of
    expression for many genes, can the unknown
    treatment be predicted?
  • Tumor or disease classification
  • Time course experiments allow the study of
    co-regulation of genes, and for the
    reconstruction of regulatory networks
  • Pharmacogenomics
  • The goal of pharmacogenomics is to find
    correlations between therapeutic responses to
    drugs and the genetic profiles of patients.

152
Many computational and statistical problems
  • Image analysis (spot identification, background,
    etc.)
  • Data management and pipelining
  • Normalization of data
  • Clustering co-regulated genes
  • Classifying tissue types
  • Regulatory network inference
  • Promoter identification (when combined with
    genomic sequence data)

153
Microarray Technology
  • Spotted arrays
  • Attach entire sequence of genes to the array
  • Create cDNA from a tissue (expressed genes)
  • Wash the pool of cDNAs over the array
  • Complementary sequences bind
  • Oligonucleotide arrays (Affymetrix chips)
  • Attach short (25bp) oligos instead of entire genes

154
GTTCGA.... The gene
CAAGCT.... cDNA
Via reverse transcription
GUUCGA.... mRNA
155
Spotted arrays are usually treated with samples
from two different tissues, each labeled with a
different color of dye (Red and Green)
Highly expressed in tissue A
Highly expressed in tissue B
156
(No Transcript)
157
The Data
158
Goal Cluster genes that share a profile
Experiment
159
The approach is formally similar to
distance-based phylogenetic inference
  • Compute a matrix of pairwise profile similarity
    scores between genes
  • Use these scores in something like UPGMA
  • Eisen et al. 1998. Cluster analysis and display
    of genome-wide expression patterns. PNAS
    9514863-14868

160
(No Transcript)
161
Clustering Techniques
  • Bottom-up techniques
  • Each gene starts in its own cluster, and genes
    are sequentially clustered in a hierarchical
    manner
  • Top-down techniques
  • Begin with an initial number of clusters and
    initial positions for the cluster centers (e.g.,
    averages). Genes are added to the clusters
    according to an optimality criterion.

162
Clustering Techniques
  • Principal component techniques
  • Identify groups of genes that are highly
    correlated with some underlying factor
    (principal component).
  • Self-organizing maps
  • Similar to Top-down clustering, with restrictions
    placed on dimensionality of the final result.
Write a Comment
User Comments (0)
About PowerShow.com