A. Auchincloss - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

A. Auchincloss

Description:

Nucleic acid database in Japan: DDBJ: http://www.ddbj.nig.ac.jp ... 2D gel electrophoresis: http://biocadmin.otago.ac.nz/fmi/xsl/toothprint/home.xsl ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 54
Provided by: Auchin
Category:
Tags: auchincloss | xsl

less

Transcript and Presenter's Notes

Title: A. Auchincloss


1
Practical exercises
  • Answers

2
(No Transcript)
3
  • Nucleic acid database in Japan DDBJ
    http//www.ddbj.nig.ac.jp/
  • Microarrays data Arrayexpress
    http//www.ebi.ac.uk/microarray/
  • Mass spectrometry data PRIDE
    http//www.ebi.ac.uk/pride/,
  • OPD http//www.ebi.ac.uk/pride/
  • Protein-protein interaction INTACT
    http//www.ebi.ac.uk/intact/site/
  • DIP http//dip.doe-mbi.ucla.edu/ JCB
    http//www.imb-jenade/jcb/ppi/
  • rat enamel 2D gel electrophoresis
    http//biocadmin.otago.ac.nz/fmi/xsl/toothprint/h
    ome.xsl
  • (Last revision August 2006)
  • CFTR mutation http//www.genet.sickkids.on.ca/cft
    r/ This web site was last updated March 2007

4
  • Exercise 2
  • E.coli K12 recombinase A (recA) in different
    protein sequence databases
  • Find, if it exists, the entry corresponding to
    the E.coli (strain K12) recA protein sequence in
    the following protein sequence databases
  • EMBL http//www.ebi.ac.uk/embl/
  • RefSeq http//www.ncbi.nlm.nih.gov/RefSeq/
  • UniProtKB http//www.expasy.org/sprot/ or
    http//beta.uniprot.org/ find sequence(s) in
  • UniProtKB/Swiss-Prot and sequence(s) in
    UniProtKB/TrEMBL
  • - PIR-PSD http//pir.georgetown.edu/pirwww/dbinfo/
    pir_psd.shtml
  • - PDB http//www.rcsb.org/pdb/home/home.do
  • - UniParc (use SRS ) or the UniParc query tool.
  • EnsEMBL http//www.ensembl.org/index.html
  • Find the UniProtKB/Swiss-Prot entry corresponding
    to the RefSeq entry NP_036231
  • Hints
  • You can use the query tool provided by each
    database.
  • You can use SRS
  • You can use the crosslinks (if they exist) to go
    from one database to another...

5
EMBL U00096 RefSeq NC_000913 UniProtKB
P0A7G6, Swiss-Prot only (there are 2 fragments
in TrEMBL, but they are not from K12) PIR-PSD
G65049 RQECA. Retrieved from UniProt UniParc
UPI0000112C1C PDB 1AA3,1N03,1REA,1U94 etc,
easily retrieved from UniProt. EnsEMBL Not
possible, bacteria are not in EnsEMBL!
6
(No Transcript)
7
  • Exercise 3
  • Find the human erythropoietin protein sequence in
    UniProt.
  • BLASTp it at ExPASy (http//www.expasy.org/tools/
    blast/) restrict the BLAST to human sequences
    (Homo sapiens).
  • Look at the Blast results and guess from which
    database(s) the protein sequences are derived.
    How many distinct human erythropoietin protein
    sequences do you get?
  • Do the same, but at (http//www.ncbi.nlm.nih.gov/B
    LAST/) - How many distinct human erythropoietin
    protein sequences do you get?

8
BLASTp at ExPASy against UniProtKB
Only 2 entries one annotated in Swiss-Prot, the
other unannotated in TrEMBL. Looking at the
Swiss-Prot entry you see a lot of rich
information.
9
BLASTp at NCBI against nr
At least 9 entries RefSeq (ref, 1), GenPept
(embl, gb, 6) and PDB (pdb, 2). The Swiss-Prot
entry has most of these cross-references, and
more besides.
10
Exercise 4 Understanding BLAST output
Compare the results of BLASTp for entry
O05891 -against UniProtKB (http//www.expasy.org/t
ools/blast/) -against NCBI-nr (http//www.ncbi.nlm
.nih.gov/BLAST/) Look for the same best hits and
compare the scores, why are they different? Keep
the UniProtKB output, we will use it again in a
minute.
11
BLASTp at ExPASy against UniProtKB
BLASTp at NCBI against nr
12
NCBI BLAST FAQ http//www.ncbi.nlm.nih.gov/blast
/blast_FAQs.shtml
Q What is the Expect (E) value? The Expect value
(E) is a parameter that describes the number of
hits one can "expect" to see just by chance when
searching a database of a particular size. It
decreases exponentially with the Score (S) that
is assigned to a match between two sequences.
Essentially, the E value describes the random
background noise that exists for matches between
sequences. For example, an E value of 1 assigned
to a hit can be interpreted as meaning that in a
database of the current size one might expect to
see 1 match with a similar score simply by
chance. This means that the lower the E-value, or
the closer it is to "0" the more "significant"
the match is.
13
The Statistics of Sequence Similarity Scores
http//www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu
l-1.html
The E-value of equation (1) applies to the
comparison of two proteins of lengths m and n.
How does one assess the significance of an
alignment that arises from the comparison of a
protein of length m to a database containing many
different proteins, of varying lengths? One view
is that all proteins in the database are a priori
equally likely to be related to the query. This
implies that a low E-value for an alignment
involving a short database sequence should carry
the same weight as a low E-value for an alignment
involving a long database sequence. To calculate
a "database search" E-value, one simply
multiplies the pairwise-comparison E-value by the
number of sequences in the database.
The E value depends on the size of the database.
14
Exercise 5 Start site issues bacteria Take
the UniProtKB BLASTp output for O05891. Align the
first 9 sequences using ClustalW (tool on the
BLAST output page). What do you see, what is one
possible interpretation? Look at the entry in
UniProt, what can you see to strengthen this
interpretation?
15
(No Transcript)
16
MYCTF Mycobacterium tuberculosis strain F11. It
is not clear if it is a WGS or a fully finished
genome In either case there has probably been
an error in the start codon prediction. In
bacteria there are several other codons beside
ATG that can start a protein (Val (GTG) and Leu
(TTG)). That is probably what happened here, and
the fact that there is another potential start a
few residues upstream, that corresponded to
predictions for other Mycobacteria, was not
noticed O05891 has the KW Direct protein
sequencing, and one reference has the descriptor
Protein sequence of N-terminus
17
Exercise 6 BLASTp and UniRef Compare the
results of BLASTing P04150 against
UniProtKB, UniRef100, UniRef90 and UniRef50 (use
BLAST at ExPASy). Compare the results. In which
cluster(s) do you find the alternatively spliced
sequences (how many are there)?
18
The UniProt Non-redundant Reference (UniRef)
databases combine closely related sequences
(including some from UniParc) into a single
record to speed searches. One UniRef100 entry
-gt all identical sequences (including fragments)
reduction of 12 of DB. One UniRef90 entry -gt
sequences that have at least 90 identity
reduction of 45 of DB. One UniRef50 entry-gt
sequences that are at least 50 identical
reduction of 69 ofDB. Species
independent!! UniRef is useful for comprehensive
BLAST sequence searches by providing sets of
representative sequences.
19
First BLAST against UniProtKB
20
UniRef100 more further down the output They
are not all in the same cluster (remember 12
reduction in DB size)
21
UniRef90 Still not all in the same
cluster (remember 45 reduction in DB size)
22
UniRef50 All in the same cluster (remember 69
reduction in DB size)
23
  • Exercise 7
  • Different looks and tools for a same entry
    depending on the server...
  • Starting with the new UniProt server
    (http//beta.uniprot.org/)
  • Look for the amino acid sequence of human
    carbonic anhydrase 2.
  • Get the corresponding nucleic acid entries in
    EMBL and GenBank try to find a nucleic acid
    sequence derived from genomic DNA sequencing and
    another one derived from cDNA sequencing.
  • From the UniProtKB/Swiss-Prot entry, look at the
    data available for the variant Pro-92 and in
    particular its position in the 3D structure (Use
    the Astex viewer).
  • Starting with the NCBI server (http//www.ncbi.nlm
    .nih.gov/)
  • Look for the amino acid sequence of human
    carbonic anhydrase 2 using ENTREZ protein at the
    NCBI server.
  • b. Find the UniProtKB/Swiss-Prot entry and as
    above - Get the corresponding nucleic acid
    entries in EMBL and GenBank. - Find the data
    available for the variant Pro-92.

24
UniProt P00918, follow links
25
NCBI, Entrez protein, can also just type in P00918
26
Note differences in UniProt cross-reference
presentation, and in information present about a
given cross-reference
27
Feature table ordering is very different here,
numerical only
28
  • Exercise 8
  • Environmental sequences how to check the quality
    of a protein sequence...
  • Look at DQ284920 at EMBL (http//srs.ebi.ac.uk/srs
    6bin/cgi-bin/wgetz?-pagetop-newId) where does
    the sequence come from? How reliable is the
    translated CoDing Sequences (CDS)?
  • How many environmental sequences are found in the
    acid nucleic databases (use SRS (ENV)?
  • Look at DQ380558 can you find the protein
    sequence in UniProtKB? Where does the annotation
    come from (from which type of analysis)?

29
a.
30
b.
(March 11, 2200)
31
c.
32
(No Transcript)
33
(No Transcript)
34
  • Exercise 9
  • Genomic databases (I)
  • Look for the Swiss-Prot entry of the E.coli gene
    gutQ (http//beta.uniprot.org/).
  • Follow the link to EcoGene (EcoGene Database of
    Escherichia coli sequence and function) and find
    the chromosomal location.
  • Get the next E.coli gene on the same strand.
  • Follow the link to Swiss-Prot.
  • Find the subcellular localisation of the protein.
  • What regions and domains does the protein
    contain, visualize them.
  • Have a look at the domain structure in the
    different domain databases. In PROSITE, get the
    list of proteins with at least one common domain.

35
(No Transcript)
36
EcoGene page
37
Or in this pull down list
38
Note Currently the NiceProt view
39
Flavodoxin-like
Zinc metallo-hydrolase
Rubredoxin-like
40
(No Transcript)
41
From InterPro
or
42

43
From PROSITE
44
  • Exercise 10
  • Protein domain / family databases
  • How many different databases are used by
    InterPro?
  • Do an InterPro scan with the sequence on the next
    page.
  • How many different domains does the protein
    contain?
  • How many phosphopantetheine-binding domain does
    the protein contain?
  • How many different protein domain databases have
    a discriminator for the phosphopantetheine-binding
    domain? Are they using patterns, profiles or
    HMMs? What are the most frequent domains found in
    Mycobacterium tuberculosis H37Rv? (Go to the
    integr8 site complete proteome
    http//www.ebi.ac.uk/integr8/ProteomeAnalysisActio
    n.do?orgProteomeId30). What percentage of
    proteins in M.tuberculosis have a
    phosphopantetheine-binding domain?

45
MVHATACSEI IRAEVAELLG VRADALHPGA NLVGQGLDSI
RMMSLVGRWR RKGIAVDFAT LAATPTIEAW SQLVSAGTGV
APTAVAAPGD AGLSQEGEPF PLAPMQHAMW
VGRHDHQQLG GVAGHLYVEF DGARVDPDRL RAAATRLALR
HPMLRVQFLP DGTQRIPPAA GSRDFPISVA DLRHVAPDVV
DQRLAGIRDA KSHQQLDGAV FELALTLLPG ERTRLHVDLD
MQAADAMSYR ILLADLAALY DGREPPALGY TYREYRQAIE
AEETLPQPVR DADRDWWAQR IPQLPDPPAL PTRAGGERDR
RRSTRRWHWL DPQTRDALFA RARARGITPA MTLAAAFANV
LARWSASSRF LLNLPLFSRQ ALHPDVDLLV GDFTSSLLLD
VDLTGARTAA ARAQAVQEAL RSAAGHSAYP GLSVLRDLSR
HRGTQVLAPV VFTSALGLGD LFCPDVTEQF GTPGWIISQG
PQVLLDAQVT EFDGGVLVNW DVREGVFAPG VIDAMFTHQV
DELLRLAAGD DAWDAPSPSA LPAAQRAVRA ALNGRTAAPS
TEALHDGFFR QAQQQPDAPA VFASSGDLSY AQLRDQASAV
AAALRAAGLR VGDTVAVLGP KTGEQVAAVL GILAAGGVYL
PIGVDQPRDR AERILATGSV NLALVCGPPC QVRVPVPTLL
LADVLAAAPA EFVPGPSDPT ALAYVLFTSG STGEPKGVEV
AHDAAMNTVE TFIRHFELGA ADRWLALATL ECDMSVLDIF
AALRSGGAIV VVDEAQRRDP DAWARLIDTY EVTALNFMPG
WLDMLLEVGG GRLSSLRAVA VGGDWVRPDL ARRLQVQAPS
ARFAGLGGAT ETAVHATIFE VQDAANLPPD WASVPYGVPF
PNNACRVVAD SGDDCPDWVA GELWVSGRGI ARGYRGRPEL
TAERFVEHDG RTWYRTGDLA RYWHDGTLEF VGRADHRVKI
SGYRVELGEI EAALQRLPGV HAAAATVLPG GSDVLAAAVC
VDDAGVTAES IRQQLADLVP AHMIPRHVTL LDRIPFTDSG
KIDRAEVGAL LAAEVERSGD RSAPYAAPRT VLQRALRRIV
ADILGRANDA VGVHDDFFAL GGDSVLATQV VAGIRRWLDS
PSLMVADMFA ARTIAALAQL LTGREANADR LELVAEVYLE
IANMTSADVM AALDPIEQPA QPAFKPWVKR FTGTDKPGAV
LVFPHAGGAA AAYRWLAKSL VANDVDTFVV QYPQRADRRS
HPAADSIEAL ALELFEAGDW HLTAPLTLFG HCMGAIVAFE
FARLAERNGV PVRALWASSG QAPSTVAASG PLPTADRDVL
ADMVDLGGTD PVLLEDEEFV ELLVPAVKAD YRALSGYSCP
PDVRIRANIH AVGGNRDHRI SREMLTSWET HTSGRFTLSH
FDGGHFYLND HLDAVARMVS ADVR

46
c. 6 domains, 1 PTM, 1 family detected d. 2
phosphopantetheine-binding domains
47
Pfam HMM, PROSITE Profile
48
(No Transcript)
49
(No Transcript)
50
  • Exercise 11
  • Use of UniProtKB/Swiss-Prot for creating dataset
    and prediction tools.
  • Find proteins with the following EC numbers
    3.5.1.1, 3.5.1.38
  • Look for proteins which have been experimentally
    proven to have an active site.
  • Alignment the sequences.
  • From the alignment suggest a pattern based around
    the active threonine (do this manually).
  • Scan your pattern against UniProtKB/Swiss-Prot
    (http//expasy.org/tools/scanprosite/). How many
    matches do you find?
  • Compare your pattern with that found in the
    PROSITE database PS00144, (http//www.expasy.org/c
    gi-bin/prosite-search-ac?PDOC00132). How many
    matches in UniProtKB/Swiss-Prot are there with
    PS00144?
  • Can you do the same with the NCBInr data ?

51
PS00144, have to give the EC numbers ATGGTIAG
Scan against SP, get 13 hits PROSITE pattern
gives 517 hits against UniProt, 45 against SP
52
Done March 2007, ANA
53
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com