NCBI Molecular Biology Resources - PowerPoint PPT Presentation

About This Presentation
Title:

NCBI Molecular Biology Resources

Description:

... dbSTS, MMDB, OMIM, UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink, RefSeq Molecular Databases Primary Databases ... Man Genomes Taxonomy PubMed ... – PowerPoint PPT presentation

Number of Views:268
Avg rating:3.0/5.0
Slides: 75
Provided by: Pete1153
Category:

less

Transcript and Presenter's Notes

Title: NCBI Molecular Biology Resources


1
NCBI Molecular Biology Resources
  • A Field Guide

Nov. 6, 2001
2
NCBI Resources
  • About NCBI
  • NCBI Sequence Databases
  • Primary Database GenBank
  • Derivative Databases - RefSeq
  • Entrez Databases and Text Searching
  • BLAST Services
  • Genomic Resources

3
The National Center for Biotechnology Information
(NCBI)
  • Created as a part of the National Library of
    Medicine in 1988
  • Establish public databases
  • Research in computational biology
  • Develop software tools for sequence analysis
  • Disseminate biomedical information
  • Tools BLAST(1990), Entrez (1992)
  • GenBank (1992)
  • Free MEDLINE (PubMed, 1997)
  • Other databases dbEST, dbGSS, dbSTS, MMDB, OMIM,
    UniGene, GeneMap, Taxonomy, CGAP, SAGE,
    LocusLink, RefSeq

4
Molecular Databases
  • Primary Databases
  • Original submissions by experimentalists
  • Database staff organize but dont add additional
    information
  • Example GenBank
  • Derivative Databases
  • Human curated
  • compilation and correction of data
  • Example SWISS-PROT, NCBI RefSeq mRNA
  • Computationally Derived
  • Example UniGene
  • Combinations
  • Example NCBI Genome Assembly

5
What is GenBank? NCBIs Primary Sequence Database
  • Nucleotide only sequence database
  • Archival in nature
  • GenBank Data
  • Direct submissions individual records (BankIt,
    Sequin)
  • Batch submissions via email (EST, GSS, STS)
  • ftp accounts sequencing centers
  • Data shared nightly among three collaborating
    databases
  • GenBank
  • DNA Database of Japan (DDBJ).
  • European Molecular Biology Laboratory Database
    (EMBL) at EBI.

6
Entrez
NIH
NCBI
GenBank
  • Submissions
  • Updates
  • Submissions
  • Updates

EMBL
DDBJ
EBI
CIB
NIG
  • Submissions
  • Updates

SRS
EMBL
getentry
7
(No Transcript)
8
GenBank
9
GenBank on FTP site
ftpgt open ftp.ncbi.nlm.nih.gov . . ftpgt cd genbank
Release 125 243 files 55.23 Gigabytes
uncompressed
10
GenBank Divisions
Bulk Sequence Divisions PAT Patent
EST Expressed Sequence Tags (133
files) STS Sequence Tagged Site GSS Genome
Survey Sequence (41 files) HTG High Throughput
Genome (25 files) HTC High Throughput
cDNA CON Contig Traditional Divisions BCT INV MA
M PHG PLN PRI ROD SYN UNA VRL VRT
11
EST Division Expressed Sequence Tags
gtIMAGE275615 5' mRNA sequence GACAGCATTCGGGCCGAGA
TGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGC
C TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCC
AGCAGAGAATGGAAAGTCAAAT TTCCTGAATTGCTATGTGTCTGGGTTT
CATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA GAATTG
AAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTAT
CTCTTGTACTACAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCT
GCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC AAGTTNAGTTTAAG
TGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGC
CGCNTT TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTT
TAATATTGGATATGCTTTTG
nucleus 30,000 genes
gatccantgccatacg
ctcgccaattcnntcg
gtIMAGE275615 3', mRNA sequence NNTCAAGTTTTATGATTT
ATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA
CT TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGT
TTCATTCATTATAACAAATTTCC AATAATCCTGTCAATNATATTTCTAA
ATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA CTTAT
GCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTC
AAATCTGACCAAGAT GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCA
CCTCTANGTTGCCAGCCCTC
  • - isolate unique clones
  • sequence once
  • from each end

80-100,000 RNA gene products
12
STS Division Sequence Tagged Sites
  • Segment of gene, EST , mRNA or genomic DNA of
    known position (microsatellite)
  • PCR with STS primers gives unique product (one
    per genome)
  • Basis of Radiation Hybrid Mapping
  • UniGene
  • Genome Assembly
  • Related resource Electronic PCR
  • http//www.ncbi.nlm.nih.gov/genome/sts/epcr.c
    gi

13
RH mapping using STSs
14
ePCR Results Hexokinase 1 EST
SHGC-35892 dbSTS id 44155, GenBank Accession
G29974 Organism Homo sapiens Primer1
CATACGACACGGCTCACAAA Primer2 CTGTTTGTCTCGTGGGGG S
TS location 30..160 Chromosome 10 Expected
amplicon size 129, Observed amplicon size
130 Primers match in forward orientation Query
sequence 1 TTTTTGAATT GGTACAAAGT TTACTAGGTC
ATACGACACG GCTCACAAAG CGGTGGGAAA 61
TTCCAGTGAT GGCATTGTTT GTTGGTTGGT TCCTTTTATC
CAAATGGAGA CAAGACACAT 121 TTCCGCAGAC
GTGTCCACCT CCCCCCACGA GACAAACAGA ATGCAAGACT
GTCACACGCG 181 GCTAGGACTG GTTCCACGGA
CACACGATTT TGTGGCATTG ACACACCACG ATGCGATGCC 241
AGGCCACAGT GGGTGCCAGG AGGGGAGGAA GCAGCTAATG
CTATGCCCAC ACTCGCCTTC 301 AGCATGTGCC
CCGGGAGGAG GCCCGGCAGT GTCTGCTGGT GATAATACAT
TTCACACGGG 361 GAGGGGGAAC CAAGGATGAG
CTTTGGAGGC CAGAAGGCTG TCAGGTGGTG TG
15
Genome Sequencing
16
GSS Division Genome Survey Sequences
  • Genomic equivalent of ESTs
  • BAC and other first pass surveys
  • BAC end sequences
  • Whole Genome Shotgun (some)
  • RAPIDS and other anonymous loci

SP6 end
T7 end
17
HTG Division High Throughput Genome Records
40,000 to gt 350,000 bp
18
The GenBank Record
19
A Simple GenBank Record
LOCUS AF062069 3808 bp mRNA
INV 02-MAR-2000 DEFINITION Limulus
polyphemus myosin III mRNA, complete
cds. ACCESSION AF062069 VERSION AF062069.2
GI7144484 KEYWORDS . SOURCE Atlantic
horseshoe crab. ORGANISM Limulus polyphemus
Eukaryota Metazoa Arthropoda
Chelicerata Merostomata Xiphosura
Limulidae Limulus. REFERENCE 1 (bases 1 to
3808) AUTHORS Battelle,B.-A., Andrews,A.W.,
Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C. TITLE A
myosin III from Limulus eyes is a clock-regulated
phosphoprotein JOURNAL J. Neurosci. (1998) In
press REFERENCE 2 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA REFERENCE 3 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA REMARK Sequence update by
submitter COMMENT On Mar 2, 2000 this
sequence version replaced gi3132700.
20
GenBank Record, cont.
FEATURES Location/Qualifiers
source 1..3808
/organism"Limulus polyphemus"
/db_xref"taxon6850"
/tissue_type"lateral eye" CDS
258..3302 /note"N-terminal
protein kinase domain C-terminal myosin
heavy chain head substrate for PKA"
/codon_start1
/product"myosin III"
/protein_id"AAC16332.2"
/db_xref"GI7144485"
/translation"MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATV
YSAIDKQA
NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFF
PEFRGAFFKRGERESDNEVWLGI

EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHEN
SIIHRDIRAANIMF
SKEGYVKLIDFGLSASVKNTN
GKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG

ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFIS
ECLVKNPEYR
PCIQEIPQHPFLAQVEGKEDQLRSE
LVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ
BASE
COUNT 1201 a 689 c 782 g 1136
t ORIGIN 1 tcgacatctg tggtcgcttt
ttttagtaat aaaaaattgt attatgacgt cctatctgtt
3781 aagatacagt aactagggaa aaaaaaaa //
21
Sequence and Database Identifiers
  • Locus, accession, gi, version

Modification Date
Sequence length
mol-type mRNA ( cDNA) rRNA snRNA DNA

Locus Name
GB Division
LOCUS AF062069 3808 bp mRNA
INV 02-MAR-2000
DEFINITION Limulus polyphemus myosin III mRNA,
complete cds.
ACCESSION AF062069
Accession Number
VERSION AF062069.2 GI7144484
DEF line (Title)
Accession.version
gi number
22
Keywords, Source-organism
  • Legacy field
  • exception
  • EST
  • GSS
  • HTG

Accepted common name
KEYWORDS . SOURCE Atlantic horseshoe
crab. ORGANISM Limulus polyphemus
Eukaryota Metazoa Arthropoda Chelicerata
Merostomata Xiphosura Limulidae
Limulus.
Scientific name
Taxonomic lineage according to GenBank
23
Citation
REFERENCE 1 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE A myosin III from Limulus
eyes is a clock-regulated phosphoprotein
JOURNAL J. Neurosci. (1998) In press REFERENCE
2 (bases 1 to 3808) AUTHORS Battelle,B.-A.,
Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C. TITLE
Direct Submission JOURNAL Submitted
(29-APR-1998) Whitney Laboratory, University of
Florida, 9505 Ocean Shore Blvd., St.
Augustine, FL 32086, USA REFERENCE 3 (bases 1
to 3808) AUTHORS Battelle,B.-A.,
Andrews,A.W., Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C. TITLE
Direct Submission JOURNAL Submitted
(02-MAR-2000) Whitney Laboratory, University of
Florida, 9505 Ocean Shore Blvd., St.
Augustine, FL 32086, USA REMARK Sequence
update by submitter COMMENT On Mar 2, 2000
this sequence version replaced gi3132700.
Article
Submitter Block
Update history
Previous version
24
Feature Table
FEATURES Location/Qualifiers source
1..3808
/organism"Limulus polyphemus"
/db_xref"taxon6850"
/tissue_type"lateral eye" CDS
258..3302 /note"N-terminal
protein kinase domain
C-terminal myosin heavy chain head substrate for
PKA" /codon_start1
/product"myosin III"
/protein_id"AAC16332.2"
/db_xref"GI7144485"
/translation"MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATV
YSAIDK NKKVALKIIGHIAENLLDIETEYRIY
KAVNGIQFFPEFRGAFFKRGERESDNEVWL "
Biosource
Coding Sequence
Reading Frame
GenPept Protein Identifiers
25
Sequence
Indicates beginning of sequence data
BASE COUNT 1201 a 689 c 782 g 1136
t ORIGIN 1 tcgacatctg tggtcgcttt
ttttagtaat aaaaaattgt attatgacgt cctatctgtt
ltsequence omittedgt 3721 accaatgtta
taatatgaaa tgaaataaag cagtcatggt agcagtggct
gtttgaaata 3781 aagatacagt aactagggaa
aaaaaaaa //
End of record
26
NCBI Derivative Sequence Databases RefSeq
NCBI Reference Sequences mRNAs and
Proteins NM_123456 Curated mRNA NP_123456 Curated
Protein XM_123456 Predicted Transcript XP_123456
Predicted Protein Gene Records NG_123456
Reference Genomic Sequence Assemblies NT_123456
Contig (Mouse and Human Genomes) NC_123455
Chromosome (Microbial Genomes)
27
Curated RefSeq Records NM_, NP_
LOCUS NM_000492 6159 bp mRNA
PRI 26-JUL-1999 DEFINITION Homo sapiens
cystic fibrosis transmembrane conductance
regulator(CFTR) mRNA. ACCESSION NM_000492
RefSeq Nucleotide
LOCUS NP_000483 1480 aa
PRI 26-JUL-1999 DEFINITION cystic
fibrosis transmembrane conductance
regulator. ACCESSION NP_000483 PID
g4502785 VERSION NP_000483.1
GI4502785 DBSOURCE REFSEQ accession
NM_000492.1
RefSeq Protein
COMMENT REFSEQ This reference sequence was
derived from M55131. PROVISIONAL RefSeq
This is a provisional reference sequence
record that has not yet been subject to human
review. The final curated reference
sequence record may be somewhat different from
this one.
28
Alignment Generated Transcripts XM_, XP_
LOCUS XM_004980 6128 bp mRNA
PRI 16-NOV-2000 DEFINITION Homo sapiens
cystic fibrosis transmembrane conductance
regulator, ATP-binding cassette
(sub-family C, member 7) (CFTR), mRNA. ACCESSION
XM_004980 VERSION XM_004980.3 GI13631444
mismatch
29
RefSeq Human Contig NT_
mRNA complement(join(1255889..125
7642,1258986..1259091,
1259690..1259862,1271619..1271708,1281957..1282112
, 1296780..1297028,1309837..1
309937,1312742..1312969,
1313881..1314031,1317797..1317876,1320768..1321018
, 1321687..1321724,1329492..1
329620,1331893..1332616,
1334111..1334197,1336717..1336811,1364895..1365086
, 1375727..1375909,1382442..1
382534,1384204..1384450,
1387877..1388002,1389139..1389302,1390185..1390274
, 1393436..1393651,1415408..1
415516,1420187..1420297,
1444403..1444587)) /partial
/gene"CFTR"
/product"cystic fibrosis transmembrane
conductance regulator,
ATP-binding cassette (sub-family C, member 7)"
/transcript_id"XM_004980.1"
/db_xref"LocusID1080"
/db_xref"MIM602421"
/note"derived by automated computational
analysis using gene
prediction method Acembly. Supporting evidence
includes similarity to 9
proteins, 1 mRNAs See details in
AceView" gene
complement(1255889..1444587)
/gene"CFTR" /note"CF
MRP7 ABC35 ABCC7"
/db_xref"LocusID1080"
LOCUS NT_007935 1888399 bp DNA
CON 16-NOV-2000 DEFINITION Homo sapiens
chromosome 7 working draft sequence segment,
complete sequence. ACCESSION
NT_007935 VERSION NT_007935.1
GI11422165 KEYWORDS HTG. SOURCE human.
ORGANISM Homo sapiens Eukaryota
Metazoa Chordata Craniata Vertebrata
EuteleostomiMammalia Eutheria Primates
Catarrhini Hominidae Homo. REFERENCE
1 (bases 1 to 1888399) AUTHORS International
Human Genome Project collaborators. TITLE
Toward the complete sequence of the human genome
JOURNAL Unpublished COMMENT GENOME ANNOTATION
REFSEQ NCBI contigs are derived from
assembled genomic sequence data. They may include
both draft and finished sequence.
COMPLETENESS not full length.
CONTIG join(AC073042.31155..2680,gap(100),AC
074390.2119526..151445,
gap(100),AC074390.21..5245,gap(100),
complement(AC074390.217705..23645),gap(100),
AC074390.297658..119425,AC073042.3106479
..121155, AC074390.2164226..165036,AC
073042.370628..79503,gap(100),
AC073042.34627..6382,gap(100),AC073042.32781..45
26,gap(100), complement(AC073042.3183
627..209083),gap(100),
AC073042.379604..88622,gap(100),AC073042.3139234
..160437, gap(100),complement(AC073042
.36483..8319),gap(100),
complement(AC073042.339354..45372),gap(100),
complement(AC073042.321461..24064),gap(10
0), AC074390.2156347..160294,gap(100)
, complement(AC074390.25346..10750),g
ap(100), complement(AC074390.2153911.
.156246),gap(100), complement(AC074390
.223746..32402),gap(100),
complement(AC074390.2151546..153810),gap(100),
complement(AC074390.257277..75275),gap(
100), complement(AC074390.275376..975
57),gap(100),
Reordering draft sequence
30
Map View of RefSeqs
NT_
XM_
NM_
31
RefSeq Genome Records NG_
32
RefSeq ChromosomesNC_
LOCUS NC_002695 5498450 bp DNA
circular BCT 02-OCT-2001 DEFINITION
Escherichia coli O157H7, complete
genome. ACCESSION NC_002695 VERSION
NC_002695.1 GI15829254 KEYWORDS . SOURCE
Escherichia coli O157H7. ORGANISM
Escherichia coli O157H7 Bacteria
Proteobacteria gamma subdivision
Enterobacteriaceae
Escherichia. REFERENCE 1 (sites) AUTHORS
Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H.,
Kimura,S., Kurokawa,K., Ishii,K.,
Hattori,M., Tatsuno,I., Abe,H., Iida,T.,
Yamamoto,K., Ohnishi,M., Hayashi,T.,
Yasunaga,T., Honda,T., Sasakawa,C.
and Shinagawa,H. TITLE Complete nucleotide
sequence of the prophage VT2-Sakai carrying the
verotoxin 2 genes of the
enterohemorrhagic Escherichia coli O157H7
derived from the Sakai outbreak JOURNAL
Genes Genet. Syst. 74 (5), 227-239 (1999)
MEDLINE 20198780 PUBMED 10734605
33
Other NCBI Derivative Databases
UniGene - gene oriented expressed
sequence clusters LocusLink - central
resource and interface for known genes
34
NCBI Homepage
35
NCBI Homepage
36
Using Entrez
  • An integrated database search and retrieval system

37
Entrez Neighboring and Hard Links
Word weight
3-D Structure
3 -D Structure
VAST
Phylogeny
(MMDB)
Protein sequences
BLAST
BLAST
38
WWW Entrez
  • All of MEDLINE plus others
  • Abstracts
  • Links to online Journals

GenBank, EMBL, DDBJ RefSeq, PDB
GenBank, DDBJ, EMBL translations PDB, PIR,
SWISS-PROT, PRF, RefSeq
Reference Genomes Graphical views, assembled
sequence and mapping data
NCBIs MMDB - derived from PDB
39
Database Searching with Entrez
  • Using limits and field restriction to find
    mouse GAPD
  • Linking and neighboring with mouse GAPD

40
Entrez Nucleotides
41
Document Summaries MouseAll Fields
3 million records
Chicken not mouse !?
42
Entrez Nucleotides Limits Preview/Index
43
Entrez Nucleotides Limits
Mouse
Exclude unwanted categories of sequences
Gene Location Genomic DNA/RNA Mitochondrion Chloro
plast
Molecule Genomic DNA/RNA mRNA rRNA
Only From RefSeq GenBank EMBL DDBJ
44
Entrez Nucleotides Limits Organism
Mouse
45
Document Summaries MouseOrganism
46
Exclude Bulk Sequences, mRNA
47
Adding Terms Preview/Index
Accession All Fields Author Name EC/RN
Number Feature key Filter Gene Name Issue Journal
Name Keyword Modification Date Organism Page
Number Primary Accession Properties Protein
Name Publication Date SeqID String Sequence
Length Substance Name Text Word Title
Word Uid Volume
Search History
48
Mouse GAPD Records
49
Displaying Mouse GAPD Records
Summary Brief GenBank ASN.1 FASTA GI
list LinkOut PubMed Links Protein
Links Nucleotide Neighbors PopSet Links Structure
Links Genome Links Taxonomy Links OMIM Links
Formats
Links and neighbors (related records)
50
Entrez GenBank / GenPept
GenPept
51
FASTA Format
gtgi193425gbM60978.1MUSGAPDS Mus musculus
testis-specific isoform of glycerald GGCAGCCAGGCCA
TGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACT
GTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCAT
GTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGA
TCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT
CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGA
CAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCC
CCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGG
GTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATG
GAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAG
AATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAA
AGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGA
TCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTAT
AGGGAATCCCTACGTGGTGGAGTGT ACAGGCGTCTATCTGTCCATCGAG
GCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA CTG
CACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGAC
TATAACCCTGGCTCTAT GACCATTGTCAGCAATGCATCCTGTACCACCA
ACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC TTCGGGATCGT
GGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAG
TGGATGGGC CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCA
AAACATCATCCCATCGTCCACTGGGGCTGC CAAGGCTGTAGGCAAAGTC
ATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAAC
C CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCT
TCTTACTCGGCTATCACGGAGG CTGTGAAAGCTGCAGCCAAGGGACCTT
TGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC GGACTT
TAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCC
TCAATGACAACTTC GTGAAGCTTGTTGCCTGGTACGACAACGAATATGG
CTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA TGTTTAGCCGAGAG
AAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTG
ACTTCG GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAA
AAACGAGAATGCGC
FASTA Definition Line gtgi193425gbM60978.1MUSGA
PDS
gt
gi number
Locus Name
Database Identifiers gb GenBank emb EMBL dbj DDBJ
sp SWISS-PROT pdb Protein Databank pir PIR prf
PRF ref RefSeq
Accession number
52
Abstract Syntax Notation ASN.1
Seq-entry set level 1 , class nuc-prot
, descr title "Mus musculus
testis-specific isoform of glyceraldehyde
3-phosphate dehydrogenase (Gapd-S) mRNA, and
translated products" , update-date std
year 1994 , month 11 ,
day 9 , source org
taxname "Mus musculus" , common "house
mouse" , db db
"taxon" , tag id 10090
,
53
NCBI Toolbox
/
asn2ff.c
convert an ASN.1 entry to flat file format,
using the FFPrintArrayPtrs.

/ include ltaccentr.hgt include
"asn2ff.h" include "asn2ffp.h" include
"ffprint.h" include ltsubutil.hgt include
ltobjall.hgt include ltobjcode.hgt include
ltlsqfetch.hgt include ltexplore.hgt ifdef
ENABLE_ID1 include ltaccid1.hgt endif FILE
fpl Args myargs "Filename for asn.1
input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,
0,NULL, "Input is a Seq-entry","F", NULL ,NULL
,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL, "Input
asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG
_BOOLEAN,0.0,0,NULL, "Output
Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OU
T,0.0,0,NULL, "Show Sequence?","T", NULL ,NULL
,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL,
54
Protein Neighbors-Structure Links
55
Advanced Neighbors BLink
56
BLink
57
PubMed Link
58
Online Books
59
Entrez Structures
  • Molecular Modeling Database (MMDB) and Cn3D

60
MMDB Molecular Modeling Data Base
  • Derived from experimentally determined PDB
    records
  • Value added to PDB records including
  • Addition of explicit chemical graph information
  • Validation
  • Inclusion of Taxonomy, Citation, and other
    information
  • Conversion to parseable ASN.1 data description
    language
  • Structure neighbors determined by
  • Vector Alignment Search Tool (VAST)

61
Searching MMDB
62
Structure Summary
BLAST neighbors
VAST neighbors
Cn3D viewer
63
Cn3D Displaying Structures
Chloroquine
64
Structure Neighbors
65
Structural Alignments
Chloroquine
NADH
66
Why do we need similarity searching?
  • Identification and annotation
  • Incomplete or no annotations (GenBank)
  • Incorrectly annotated sequences
  • Evolutionary relationships
  • homologous molecules may
  • have similar functions

but it aint necessarily so!
67
Basic Local Alignment Search Tool
  • Widely used similarity search tool
  • Heuristic approach based on Smith Waterman
    algorithm
  • Finds best local alignments
  • Provides statistical significance
  • All combinations (DNA/Protein) query and
    database.
  • DNA vs DNA
  • DNA translation vs Protein
  • Protein vs Protein
  • Protein vs DNA translation
  • DNA translation vs DNA translation
  • www, email server, standalone, and network clients

68
Local Alignment Statistics
High scores of local alignments between two
random sequences follow Extreme Value Distribution
For ungapped alignments
Expected number with score S or greater E
Kmne-?S or E mn2-S
K scale for search space ? scale for scoring
system S bitscore (?S - lnK)/ln2
http//www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschu
l-1.html
69
Scoring Systems
  • Nucleic acids identity matrix
  • Proteins
  • Position Independent Matrices
  • PAM Matrices (Percent Accepted Mutation)
  • Implicit model of evolution
  • Higher PAM number all calculated from PAM1
  • PAM250 widely used
  • BLOSUM Matrices (BLOck SUbstition Matrices)
  • Empirically determined from alignment
  • of conserved blocks
  • Each includes information up to a certain level
    of identity
  • BLOSUM62 widely used
  • Position Specific Score Matrices (PSSM)
  • PSI and RPS BLAST

70
BLOSUM62
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3
-3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2
-1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
A R N D C Q E G H I L K M F P S
T W Y V X
71
Position Specific Substitution Rates
Active site serine
Typical serine
72
Position Specific Score Matrix (PSSM)
A R N D C Q E G H I L K M
F P S T W Y V 206 D 0 -2 0 2 -4 2 4
-4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G
-2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2
-1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1
-4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3
-3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1
-4 0 210 S -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6
-4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4
-4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3
212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0
-7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0
-2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G
-2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3
-5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5
-7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4
-2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6
-5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7
-5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5
-6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7
219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7
9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6
-7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N
-1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2
-1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1
-1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1
4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3
-4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1
-2 -2 -3 0 -2 -2 -2 -3
Serine scored differently in these two positions
Active site nucleophile
73
Gapped Alignments
  • Gapping provides more biologically realistic
    alignments
  • Statistical behavior not completely understood
    for gapped alignments
  • Gapped BLAST parameters must be found by
    simulations for each matrix
  • Affine gap costs -(abk)
  • a gap open penalty b gap extend penalty
  • A gap of length 1 receives the score -(ab)

74
Intermission
Write a Comment
User Comments (0)
About PowerShow.com