BS961

About This Presentation

Transcript and Presenter's Notes

Title: BS961

1
BS961

Session 4
Bioinformatics

Session 4 Introduction to bioinformatics.
1. Applications of sequence database searches to
generate useful biological information.
2. Sequence motifs. RNA structures. Phylogenetic
trees.
3. Molecular epidemiology and monitoring of
therapy.
4. The transcriptome and applications of
microarray technology.
5. Preparation for case studies on applications
of sequenced-based information on human health.
6. Worksheet distributed.

3
Objectives

Discuss the applications of sequence database
searches to generate useful biological
information.
Explain what sequence motifs are.

4
Sequence database searches (Strachan and Reid pp
468-471)

Vast amount of sequence information in
databases, from cloning specific genes,
generation of random ESTs and genome projects.
Led to development of one aspect of
Bioinformatics- obtaining biological information
on gene structure and function and on proteins
from raw sequence data.
Done by comparing the sequence under study (a
gene or protein with unknown function), with
databases, to find similarities with genes with
known function. This can then give clues about
the function of unknown genes.

5
Main programs used are

BLASTN compares a nucleotide sequence against a
nucleotide sequence database
BLASTP compares an amino acid sequence against
a protein sequence database
These and many programs are available at
http//www.ebi.ac.uk/Tools/index.html These
programs find sequences most closely related to
the test sequence defined either by the greatest
number of matches or least number of mismatches.

6
Typical scenario

Full-length or partial amino acid sequence
(usually predicted from the nucleotide sequence
using a translation program such as the translate
tool at http//www.expasy.ch/tools/dna.html) is
compared against all protein sequences in the
SWISSPROT database.

7
Typical scenario

This may reveal the presence of closely related
proteins (homologs is the general term) in the
same organism (paralogs- paralogues) or the
equivalent protein in other organisms (orthologs-
orthologues). If the function of these is
already known, some idea may be derived about the
function of the unknown protein.

8
Simplified scheme for origin of paralogs and
orthologs
9
Simplified scheme for origin of paralogs and
orthologs
10
Simplified scheme for origin of paralogs and
orthologs
Gene duplication
11
Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
12
Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
13
Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
14
Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
15
Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
16
Simplified scheme for origin of paralogs and
orthologs
Gene duplication
Divergence possibly to related but different
functions
17
Simplified scheme for origin of paralogs and
orthologs
18
Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
19
Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
20
Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
21
Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
22
Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
23
Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
24
Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
Duplication
25
Simplified scheme for origin of paralogs and
orthologs
Evolution into 2 organisms
Divergence
Duplication
A B
C D D
26
Simplified scheme for origin of paralogs and
orthologs
and are paralogs and are paralogs and
are orthologs is an ortholog of both and
A B
C D D
27
Simplified scheme for origin of paralogs and
orthologs
A and are paralogs and are paralogs and
are orthologs is an ortholog of both and
A B
C D D
28
Simplified scheme for origin of paralogs and
orthologs
A and B are paralogs and are paralogs and
are orthologs is an ortholog of both and
A B
C D D
29
Simplified scheme for origin of paralogs and
orthologs
A and B are paralogs C and D/D are paralogs
and are orthologs is an ortholog of both
and
A B
C D D
30
Simplified scheme for origin of paralogs and
orthologs
A and B are paralogs C and D/D are paralogs
A and C are orthologs is an ortholog of both
and
A B
C D D
31
Simplified scheme for origin of paralogs and
orthologs
A and B are paralogs C and D/D are paralogs A
and C are orthologs B is an ortholog of both D
and D
A B
C D D
32
Example 1.

Neurofibromatosis 2
Genetic disorder leading to cranial and
peripheral nerve tumours.
Inherited dominantly.
Affected individuals generally develop symptoms
of eighth-nerve dysfunction in early adulthood,
including deafness and balance disorder.

33
Example 1.

Neurofibromatosis 2
Location of tumours mean that disease symptoms
are serious and can be fatal.
Disease tracked down to a defect in one gene
(neurofibromatosis type 2 gene (NF2).
What is the function of the protein encoded?

34
Example 1.

Neurofibromatosis 2
Database searches using the protein sequence
identified related proteins- moesin, ezrin and
radixin.
These proteins act as structural links between
cell membrane proteins and intermediate filament
proteins.
This gave some initial clues to the function of
the NF2 gene product.

35
Example 1.

In next slide we see the output from such a
search, the match identified between the NF2
protein (Query) and one of the Subject sequence
i.e. one of the sequences in the database, which
turns out to be moesin.

36
gi16878176gbAAH17293.1AAH17293 (BC017293)
moesin Homo sapiens Length 577
Score 380 bits (975), Expect e-104
Identities 258/588 (43), Positives 354/588
(59), Gaps 23/588 (3) Query 78
DKKVLDHDVSKEEPVTFHFLAKFYPENAEEELVQEITQHLFFLQVKKQIL
DEKIYCPPEA 137 KKV DV KE P F F
AKFYPE EELQITQ LFFLQVK IL IYCPPE Sbjct
62 NKKVTAQDVRKESPLLFKFRAKFYPEDVSEELIQDITQRLFFLQVK
EGILNDDIYCPPET 121 Query 138 SVLLASYAVQAKYGDYDPS
VHKRGFLAQEELLPKRVINLYQMTPEMWEERITAWYAEHRG 197
VLLASYAVQKYGD VHK GLA LLPRV
WEERI W EHRG Sbjct 122 AVLLASYAVQSKYGDFNKEVH
KSGYLAGDKLLPQRVLEQHKLNKDQWEERIQVWHEEHRG
181 Query 198 RARDEAEMEYLKIAQDLEMYGVNYFAIRNKKGTE
LLLGVDALGLHIYDPENRLTPKISFP 257 RA
EYLKIAQDLEMYGVNYFINKKGEL LGVDALGLIY
RLTPKI FP Sbjct 182 MLREDAVLEYLKIAQDLEMYGVNYFSIK
NKKGSELWLGVDALGLNIYEQNDRLTPKIGFP 241
37
Header

gi16878176gbAAH17293.1AAH17293 (BC017293)
moesin Homo sapiens
Length 577
Score 380 bits (975), Expect e-104
Identities 258/588 (43), Positives 354/588
(59), Gaps 23/588 (3)

38
Alignment

Query 78 DKKVLDHDVSKEEPVTFHFL
KKV DV KE P F F
Sbjct 62 NKKVTAQDVRKESPLLFKFR

39
Alignment

Query 78 DKKVLDHDVSKEEPVTFHFL
KKV DV KE P F F
Sbjct 62 NKKVTAQDVRKESPLLFKFR

40
Alignment

Query 78 DKKVLDHDVSKEEPVTFHFL
KKV DV KE P F F
Sbjct 62 NKKVTAQDVRKESPLLFKFR
Identical in both sequences

41
Alignment

Query 78 DKKVLDHDVSKEEPVTFHFL
KKV DV KE P F F
Sbjct 62 NKKVTAQDVRKESPLLFKFR
Identical in both sequences
Similar type of amoino acid in both sequences

42
Header

gi16878176gbAAH17293.1AAH17293 (BC017293)
moesin Homo sapiens
Length 577
Score 380 bits (975), Expect e-104
Identities 258/588 (43), Positives 354/588
(59), Gaps 23/588 (3)

43
Example 2.

Angiotensin-converting enzyme (ACE)
Dipeptidyl carboxydipeptidases which cleaves two
amino acids from the C-terminus angiotensin I
giving angiotensin II.
Interesting because there are two allelic forms
of the gene, D and I. D gives higher levels of
ACE and sprinters and short distance swimmers
tend to have this allele- possibly because
angiotensin regulates muscle size.

44
Example 2.

Angiotensin-converting enzyme (ACE)
Here we see a search using an unknown protein
from Anopheles gambiae which shows a match with
human ACE.
The know active site E (glutamic acid) and 2
histidines (H) used to bind a zinc atom needed
for the activity are conserved.
It shows that this protein from Anopheles is also
probably a carboxypeptidase.

45
Example 2.

QueryTTVNLEDLVVAHHEMGHIQYFMQYKD
T V D HHE HIQYF Y
SbjctTQVTHKDFITVHHELAHIQYFLNYRN

46
Gene (protein) families (Strachan and Reid, p 153)

A number of proteins, or protein domains,
clearly share the same evolutionary origin as
they have similar structures and functions which
is reflected in conserved amino acid sequences

47
Classical gene families

Have members which exhibit a high degree of
sequence identity over most of the gene length
(or the protein-encoding region at least).
This identifies the genes as being closely
related in evolutionary terms and therefore in
functional terms.

48
Gene families encoding products with large,
highly conserved domains

Have members which show extensive identity within
strongly conserved regions of the genes
The identity in the rest of the gene may be quite
low- seen in transcription factors.

49
Gene families encoding products with very short
conserved amino acid motifs/patterns/profiles

Have members not obviously related in sequence
terms, but with a common general function (e.g.
dehydrogenases, proteases).
They share short amino acid motifs, patterns or
profiles which can give a clue about general
function.

50
Examples of motifs/patterns

RGD - a motif recognised by integrins, proteins
involved in cell-cell interactions and found in
some viruses
DEAD box - several proteins which appear to act
as RNA helicases
GXSG found in trypsin-like proteases
GDD - found in polymerases
LIM domain cysteine-rich 56 amino acid domain
likely to be involved in protein-protein
interactions

51
Gene superfamilies

Have members which are functionally related in a
general sense and show only weak sequence
identity over a large segment, without
significant amino acid motifs. Share common
structural features e.g. immunoglobulin
superfamily.

52
Alignments

Following the identification of related proteins,
they can be aligned using programs such as
CLUSTALW. In this example, 4 human proteins have
been found and aligned. They are clearly
paralogs as they are very similar. Absolutely
conserved amino acids () would be expected to be
the ones critical for function. Other positions
require similar ( or .) amino acids.

53
Example of a CLUSTAL analysis of human paralogs
in a classical gene family

hrev5 MALARPRPRLGDLIEISRFGYAHWAIYVGDGYVV
HLAPASEIAGAGAASVLSALTNKAIV 60
Hrev107 MRAPIPEPKPGDLIEIFRPFYRHWAIYVGDGYVV
HLAPPSEVAGAGAASVMSALTDKAIV 60
TIG3 MASPHQEPKPGDLIEIFRLGYEHWALYIGDGYVI
HLAPPSEYPGAGSSSVFSVLSNSAEV 60
hrev4 MAEGKPRPRPGDLIEIFRIGYEHWAIYVEDDCVV
HLAPPSEESECG--SITSIFSNRAVV 58
. .
. . .
hrev5 KKELLSVVAGGDNYRVNNKHDDRYTPLPSNKIVK
RAEELVGQELPYSLTSDNCEHFVNHL 120
Hrev107 KKELLYDVAGSDKYQVNNKHDDKYSPLPCTKIIQ
RAEELVGQEVLYKLTSENCEHFVNEL 120
TIG3 KRGRLEDVVGGCCYRVNNSLDHEYQPRPVEVIIS
SAKEMVGQKMKYSIVSRNCEHFVAQL 120
hrev4 KYSRLEDVLHAASWKVNNKLDGTYLPLPVDKIIQ
RTKKMVNKIVQYSLIEGNCEHFVNGL 118
. .
. . . .

Hr5 NKIVKRAEELVGQELPYSLTSDNCEHFVNHL
Hr107 TKIIQRAEELVGQEVLYKLTSENCEHFVNEL
TIG3 EVIISSAKEMVGQKMKYSIVSRNCEHFVAQL
Hrev4 DKIIQRTKKMVNKIVQYSLIEGNCEHFVNGL
. . . .

55
RNA structures

RNA structures are vital to a number of processes
in the cell and in viruses etc.
Can be predicted through folding programs which
look at minimum free energy
Also suppression of codon variation
Covariance can be useful

56
5 Terminal region of type 2 and type 1 5UTRs
Important in RNA replication
Type 2
Type 1
57
(No Transcript)
58
RNA folding
59
Enterovirus
HEV-A
HEV-B
Inhibitor of RNase L ?
HEV-C
60

II

I

III

61
Phylogenetic trees
Way of showing the relationship between proteins,
nucleic acids Different methjods for generating
these e.g. Neighbor joining Generate a distance
matrix expressing the relationship between each
element Find the closest neighbours and join
together Add on the closest neighbours to keep
branch lengths to a minimum
62
Molecular epidemiology
Species origin of viruses Relationship between
viruses and geography/time etc
63
Origin

Derived from a virus which infects other
primates?
Classification- Lentivirus genus of Retroviridae
(lenti slow in Latin)

64
(No Transcript)
65
Monitoring treatment
Sequence of virus mutants linked to drug
resistance can be used to design treatment
strategies predict the outcome of
treatment monitor treatment
66
Conclusions

Biological information can be obtained from
analysing sequences, using programs such as BLAST
and CLUSTAL
There are copies of related genes in different
organisms- orthologs and in the same organism-
paralogs

Write a Comment

User Comments (0)

About PowerShow.com

BS961 PowerPoint PPT Presentation