Title: BS961
1BS961
2- Session 4 Introduction to bioinformatics.
- 1. Applications of sequence database searches to
generate useful biological information. - 2. Sequence motifs. RNA structures. Phylogenetic
trees. - 3. Molecular epidemiology and monitoring of
therapy. - 4. The transcriptome and applications of
microarray technology. - 5. Preparation for case studies on applications
of sequenced-based information on human health. - 6. Worksheet distributed.
3Objectives
- Discuss the applications of sequence database
searches to generate useful biological
information. - Explain what sequence motifs are.
4Sequence database searches (Strachan and Reid pp
468-471)
- Vast amount of sequence information in
databases, from cloning specific genes,
generation of random ESTs and genome projects. - Led to development of one aspect of
Bioinformatics- obtaining biological information
on gene structure and function and on proteins
from raw sequence data. - Done by comparing the sequence under study (a
gene or protein with unknown function), with
databases, to find similarities with genes with
known function. This can then give clues about
the function of unknown genes.
5Main programs used are
- BLASTN compares a nucleotide sequence against a
nucleotide sequence database - BLASTP compares an amino acid sequence against
a protein sequence database - These and many programs are available at
http//www.ebi.ac.uk/Tools/index.html These
programs find sequences most closely related to
the test sequence defined either by the greatest
number of matches or least number of mismatches.
6Typical scenario
- Full-length or partial amino acid sequence
(usually predicted from the nucleotide sequence
using a translation program such as the translate
tool at http//www.expasy.ch/tools/dna.html) is
compared against all protein sequences in the
SWISSPROT database.
7Typical scenario
- This may reveal the presence of closely related
proteins (homologs is the general term) in the
same organism (paralogs- paralogues) or the
equivalent protein in other organisms (orthologs-
orthologues). If the function of these is
already known, some idea may be derived about the
function of the unknown protein.
8Simplified scheme for origin of paralogs and
orthologs
9Simplified scheme for origin of paralogs and
orthologs
10Simplified scheme for origin of paralogs and
orthologs
Gene duplication
11Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
12Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
13Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
14Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
15Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
16Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
17Simplified scheme for origin of paralogs and
orthologs
18Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
19Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
20Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
21Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
22Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
23Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
24Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
Duplication
25Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
Duplication
A B
C D D
26Simplified scheme for origin of paralogs and
orthologs
and are paralogs and are paralogs and
are orthologs is an ortholog of both and
A B
C D D
27Simplified scheme for origin of paralogs and
orthologs
A and are paralogs and are paralogs and
are orthologs is an ortholog of both and
A B
C D D
28Simplified scheme for origin of paralogs and
orthologs
A and B are paralogs and are paralogs and
are orthologs is an ortholog of both and
A B
C D D
29Simplified scheme for origin of paralogs and
orthologs
A and B are paralogs C and D/D are paralogs
and are orthologs is an ortholog of both
and
A B
C D D
30Simplified scheme for origin of paralogs and
orthologs
A and B are paralogs C and D/D are paralogs
A and C are orthologs is an ortholog of both
and
A B
C D D
31Simplified scheme for origin of paralogs and
orthologs
A and B are paralogs C and D/D are paralogs A
and C are orthologs B is an ortholog of both D
and D
A B
C D D
32Example 1.
- Neurofibromatosis 2
- Genetic disorder leading to cranial and
peripheral nerve tumours. - Inherited dominantly.
- Affected individuals generally develop symptoms
of eighth-nerve dysfunction in early adulthood,
including deafness and balance disorder.
33Example 1.
- Neurofibromatosis 2
- Location of tumours mean that disease symptoms
are serious and can be fatal. - Disease tracked down to a defect in one gene
(neurofibromatosis type 2 gene (NF2). - What is the function of the protein encoded?
34Example 1.
- Neurofibromatosis 2
- Database searches using the protein sequence
identified related proteins- moesin, ezrin and
radixin. - These proteins act as structural links between
cell membrane proteins and intermediate filament
proteins. - This gave some initial clues to the function of
the NF2 gene product.
35Example 1.
- In next slide we see the output from such a
search, the match identified between the NF2
protein (Query) and one of the Subject sequence
i.e. one of the sequences in the database, which
turns out to be moesin.
36 gi16878176gbAAH17293.1AAH17293 (BC017293)
moesin Homo sapiens Length 577
Score 380 bits (975), Expect e-104
Identities 258/588 (43), Positives 354/588
(59), Gaps 23/588 (3) Query 78
DKKVLDHDVSKEEPVTFHFLAKFYPENAEEELVQEITQHLFFLQVKKQIL
DEKIYCPPEA 137 KKV DV KE P F F
AKFYPE EELQITQ LFFLQVK IL IYCPPE Sbjct
62 NKKVTAQDVRKESPLLFKFRAKFYPEDVSEELIQDITQRLFFLQVK
EGILNDDIYCPPET 121 Query 138 SVLLASYAVQAKYGDYDPS
VHKRGFLAQEELLPKRVINLYQMTPEMWEERITAWYAEHRG 197
VLLASYAVQKYGD VHK GLA LLPRV
WEERI W EHRG Sbjct 122 AVLLASYAVQSKYGDFNKEVH
KSGYLAGDKLLPQRVLEQHKLNKDQWEERIQVWHEEHRG
181 Query 198 RARDEAEMEYLKIAQDLEMYGVNYFAIRNKKGTE
LLLGVDALGLHIYDPENRLTPKISFP 257 RA
EYLKIAQDLEMYGVNYFINKKGEL LGVDALGLIY
RLTPKI FP Sbjct 182 MLREDAVLEYLKIAQDLEMYGVNYFSIK
NKKGSELWLGVDALGLNIYEQNDRLTPKIGFP 241
37Header
- gi16878176gbAAH17293.1AAH17293 (BC017293)
moesin Homo sapiens - Length 577
- Score 380 bits (975), Expect e-104
- Identities 258/588 (43), Positives 354/588
(59), Gaps 23/588 (3)
38Alignment
- Query 78 DKKVLDHDVSKEEPVTFHFL
- KKV DV KE P F F
- Sbjct 62 NKKVTAQDVRKESPLLFKFR
-
39Alignment
- Query 78 DKKVLDHDVSKEEPVTFHFL
- KKV DV KE P F F
- Sbjct 62 NKKVTAQDVRKESPLLFKFR
-
40Alignment
- Query 78 DKKVLDHDVSKEEPVTFHFL
- KKV DV KE P F F
- Sbjct 62 NKKVTAQDVRKESPLLFKFR
- Identical in both sequences
-
41Alignment
- Query 78 DKKVLDHDVSKEEPVTFHFL
- KKV DV KE P F F
- Sbjct 62 NKKVTAQDVRKESPLLFKFR
- Identical in both sequences
- Similar type of amoino acid in both sequences
42Header
- gi16878176gbAAH17293.1AAH17293 (BC017293)
moesin Homo sapiens - Length 577
- Score 380 bits (975), Expect e-104
- Identities 258/588 (43), Positives 354/588
(59), Gaps 23/588 (3)
43Example 2.
- Angiotensin-converting enzyme (ACE)
- Dipeptidyl carboxydipeptidases which cleaves two
amino acids from the C-terminus angiotensin I
giving angiotensin II. - Interesting because there are two allelic forms
of the gene, D and I. D gives higher levels of
ACE and sprinters and short distance swimmers
tend to have this allele- possibly because
angiotensin regulates muscle size.
44Example 2.
- Angiotensin-converting enzyme (ACE)
- Here we see a search using an unknown protein
from Anopheles gambiae which shows a match with
human ACE. - The know active site E (glutamic acid) and 2
histidines (H) used to bind a zinc atom needed
for the activity are conserved. - It shows that this protein from Anopheles is also
probably a carboxypeptidase.
45Example 2.
- QueryTTVNLEDLVVAHHEMGHIQYFMQYKD
- T V D HHE HIQYF Y
- SbjctTQVTHKDFITVHHELAHIQYFLNYRN
46Gene (protein) families (Strachan and Reid, p 153)
- A number of proteins, or protein domains,
clearly share the same evolutionary origin as
they have similar structures and functions which
is reflected in conserved amino acid sequences
47Classical gene families
- Have members which exhibit a high degree of
sequence identity over most of the gene length
(or the protein-encoding region at least). - This identifies the genes as being closely
related in evolutionary terms and therefore in
functional terms.
48Gene families encoding products with large,
highly conserved domains
- Have members which show extensive identity within
strongly conserved regions of the genes - The identity in the rest of the gene may be quite
low- seen in transcription factors.
49Gene families encoding products with very short
conserved amino acid motifs/patterns/profiles
- Have members not obviously related in sequence
terms, but with a common general function (e.g.
dehydrogenases, proteases). - They share short amino acid motifs, patterns or
profiles which can give a clue about general
function.
50Examples of motifs/patterns
- RGD - a motif recognised by integrins, proteins
involved in cell-cell interactions and found in
some viruses - DEAD box - several proteins which appear to act
as RNA helicases - GXSG found in trypsin-like proteases
- GDD - found in polymerases
- LIM domain cysteine-rich 56 amino acid domain
likely to be involved in protein-protein
interactions
51Gene superfamilies
- Have members which are functionally related in a
general sense and show only weak sequence
identity over a large segment, without
significant amino acid motifs. Share common
structural features e.g. immunoglobulin
superfamily.
52Alignments
- Following the identification of related proteins,
they can be aligned using programs such as
CLUSTALW. In this example, 4 human proteins have
been found and aligned. They are clearly
paralogs as they are very similar. Absolutely
conserved amino acids () would be expected to be
the ones critical for function. Other positions
require similar ( or .) amino acids.
53Example of a CLUSTAL analysis of human paralogs
in a classical gene family
- hrev5 MALARPRPRLGDLIEISRFGYAHWAIYVGDGYVV
HLAPASEIAGAGAASVLSALTNKAIV 60 - Hrev107 MRAPIPEPKPGDLIEIFRPFYRHWAIYVGDGYVV
HLAPPSEVAGAGAASVMSALTDKAIV 60 - TIG3 MASPHQEPKPGDLIEIFRLGYEHWALYIGDGYVI
HLAPPSEYPGAGSSSVFSVLSNSAEV 60 - hrev4 MAEGKPRPRPGDLIEIFRIGYEHWAIYVEDDCVV
HLAPPSEESECG--SITSIFSNRAVV 58 - . .
. . . - hrev5 KKELLSVVAGGDNYRVNNKHDDRYTPLPSNKIVK
RAEELVGQELPYSLTSDNCEHFVNHL 120 - Hrev107 KKELLYDVAGSDKYQVNNKHDDKYSPLPCTKIIQ
RAEELVGQEVLYKLTSENCEHFVNEL 120 - TIG3 KRGRLEDVVGGCCYRVNNSLDHEYQPRPVEVIIS
SAKEMVGQKMKYSIVSRNCEHFVAQL 120 - hrev4 KYSRLEDVLHAASWKVNNKLDGTYLPLPVDKIIQ
RTKKMVNKIVQYSLIEGNCEHFVNGL 118 - . .
. . . .
54- Hr5 NKIVKRAEELVGQELPYSLTSDNCEHFVNHL
- Hr107 TKIIQRAEELVGQEVLYKLTSENCEHFVNEL
- TIG3 EVIISSAKEMVGQKMKYSIVSRNCEHFVAQL
- Hrev4 DKIIQRTKKMVNKIVQYSLIEGNCEHFVNGL
- . . . .
55RNA structures
- RNA structures are vital to a number of processes
in the cell and in viruses etc. - Can be predicted through folding programs which
look at minimum free energy - Also suppression of codon variation
- Covariance can be useful
565 Terminal region of type 2 and type 1 5UTRs
Important in RNA replication
Type 2
Type 1
57(No Transcript)
58RNA folding
59Enterovirus
HEV-A
HEV-B
Inhibitor of RNase L ?
HEV-C
60 II
I
III
61Phylogenetic trees
Way of showing the relationship between proteins,
nucleic acids Different methjods for generating
these e.g. Neighbor joining Generate a distance
matrix expressing the relationship between each
element Find the closest neighbours and join
together Add on the closest neighbours to keep
branch lengths to a minimum
62Molecular epidemiology
Species origin of viruses Relationship between
viruses and geography/time etc
63Origin
- Derived from a virus which infects other
primates? - Classification- Lentivirus genus of Retroviridae
(lenti slow in Latin)
64(No Transcript)
65Monitoring treatment
Sequence of virus mutants linked to drug
resistance can be used to design treatment
strategies predict the outcome of
treatment monitor treatment
66Conclusions
- Biological information can be obtained from
analysing sequences, using programs such as BLAST
and CLUSTAL - There are copies of related genes in different
organisms- orthologs and in the same organism-
paralogs