A Field Guide part 2 - PowerPoint PPT Presentation

1 / 124
About This Presentation
Title:

A Field Guide part 2

Description:

NCBI FieldGuide. A Field Guide. part 2. February 14, 2006 ... Primate division gbdiv pri[prop] EST division gbdiv est[prop] NCBI FieldGuide. Molecule Queries ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 125
Provided by: wayn108
Category:
Tags: field | guide | part | primate

less

Transcript and Presenter's Notes

Title: A Field Guide part 2


1
A Field Guide part 2
National Center for Biotechnology Information
UT-Health Science Center
February 14, 2006
2
GenBank Records
The Flatfile Format
3
A Typical GenBank Record
LOCUS NM_019570 4279 bp mRNA linear INV
28-OCT-2004 DEFINITION Mus musculus REV1-like(S.
cerevisiae)(Rev1l),mRNA ACCESSION
NM_019570 VERSION NM_019570.3 GI50811869
KEYWORDS .
4
GenBank Record Feature Table
5
GenBank Record Feature Table, cont.
GenPept identifier
6
GenBank Record sequence
7
Indexing for Nucleotide UID 59958365
Field Indexed Terms primary
accession NM_001012399 title Bos taurus
hemochromatosis (hfe), mRNA. organism Bos
taurus sequence length 1168 modification
date 2005/02/19 properties biomol
mrna gbdiv mam srcdb refseq
8
Global Entrez Search HFE
HFE
9
Entrez Nucleotide HFE
137 records
Not HFE
10
Smarter Query
hfetitle
AND humanorgn
11
hfetitle AND humanorgn (cont)
Primary data
12
Preview/IndexGateway to Advanced Searches
13
Preview/Index
14
Preview/Index Properties, srcdb
Properties
15
Preview/Index Properties, srcdb
AND srcdb refseqProperties
16
Preview/Index Properties, srcdb
AND srcdb ddbj/embl/genbankProperties
17
Database Queries
  • 1 hfe 137
  • 2 hfetitle AND humanorgn 42
  • 3 2 AND srcdb refseqprop 11
  • 4 2 AND srcdb ddbj/embl/genbankprop
    31

5 4 AND gbdiv priprop 29 4
4 AND gbdiv estprop 2
18
Molecule Queries
  • 1 hfe 116
  • 2 hfetitle AND humanorgn
    42
  • 3 2 AND biomol mrnaprop 29
  • 4 2 AND biomol genomicprop
    13

19
More Queries
Fields are database-specific
20
More Queries
Fields are database-specific
21
Other Entrez Databases
UniGene rat clusters that have at least one
mRNA ratorganism NOT 0mrna count
SNP uniquely mapped microsatellites on human
chr2 microsatSNP Class AND 1Map Weight AND
2Chromosome) AND humanorgn
UniSTS markers on the Genethon map of human
chromosome 12 GenethonMap Name AND
humanorganism AND 12chromosome
Structure structures of bacterial kinases with
resolutions below 2 Å bacteriaorganism AND
kinase AND 000.00002.00resolution
22
Genome Resources
Genomic Biology
23
Genomic Biology
24
Gen Biol Gen Resources
25
Map Viewer Genome Annotation Updates
26
Gen Biol Gen Resources
27
Genome Projects microb
28
Genome Projects microb
13 Eukaryotic Genome Sequencing Projects
Selected Complete 0, Assembly 2, In Progress
- 11
29
Genome Projects microb
13 Eukaryotic Genome Sequencing Projects
Selected Complete 0, Assembly 2, In Progress
- 11
30
Gen Biol Gen Resources
31
Gen Biol Gen Resources
32
Gen Biol Gen Resources
33
Gen Biol Gen Resources
34
Gen Biol Gen Resources
35
Genome Resources
Genomic Biology
UniGene
36
UniGene
Gene-oriented clusters of expressed sequences
  • Automatic clustering using MegaBlast
  • Each cluster represents a unique gene
  • Informed by genome hits
  • Information on tissue types and map locations
  • Useful for gene discovery and selection of
    mapping reagents

37
A Cluster of ESTs
query
5 EST hits
3 EST hits
38
UniGene Collections
39
UniGene Collections
Species UniGene
40
UniGene Hs build 188
41
UniGene Cluster Hs.95351Lipase,
hormone-sensitive (LIPE)
42
UniGene Cluster Hs.95351
43
UniGene Cluster Hs.95351 expression
44
UniGene Cluster Hs.95351 seqs
45
Get Sequences
web page
46
Genome Resources
Genomic Biology
47
E-PCR
Genomic sequence here
48
Options
49
Results
50
reverse e-pcr
51
reverse e-pcr
52
reverse e-pcr
53
reverse e-pcr
Gene
STS
LY6G6D lymphocyte antigen 6 complex, locus G6D
54
Genome Resources
Genomic Biology
55
List View
56
Human MapViewer
57
MapViewer Human ADAR
58
MV Hs ADAR
59
Maps Options
Maps Options
--Sequence maps-- Ab initio Assembly Repeats BES_C
lone Clone NCI_Clone Contig Component CpG
island dbSNP haplotype Fosmid GenBank_DNA Gene Phe
notype SAGE_Tag STS TCAG_RNA Transcript
(RNA) Hs_UniGene Hs_EST
--Cytogenetic maps-- Ideogram FISH
Clone Gene_Cytogenetic Mitelman
Breakpoint Morbid/Disease --Genetic
Maps-- deCODE Genethon Marshfield --RH
maps-- GeneMap99-G3 GeneMap99-GB4 NCBI
RH Standford-G3 TNG Whitehead-RH Whitehead-YAC
Mm_UniGene Mm_EST Rn_UniGene Rn_EST Ssc_UniGene Ss
c_EST Bt_UniGene Bt_EST Gga_UniGene Gga_EST Variat
ion
60
MapViewer
Component
Gene
UniGene
Repeats
61
Phenotype
Variation
Gene
62
Maps Options
Maps Options
63
Chimp ADAR
Human ADAR
Mouse ADAR
64
Genome Resources
Genomic Biology
Trace Archive
65
Trace Archive Page
66
Ciona savignyi Traces
67
(No Transcript)
68
Trace Archive BLAST Page
Potential access to sequences NOT yet in GenBank
69
Basic Local Alignment Search Tool
70
BLAST Web Searches, 2005
200,000
71
  • Precomputed BLAST Services
  • Nucleotide or protein Related Sequences
  • BLAST link BLink
  • Transcript clusters UniGene
  • Protein homologs HomoloGene

72
Link to Related Sequences
73
Related Sequences
Most similar
Least similar
74
BLink (BLAST Link)
75
BLink Output
76
Why Is BLAST So Popular?
  • Fast
  • - heuristic approach based on Smith Waterman
  • Local alignments
  • Statistical significance
  • - Expect value
  • Versatile
  • - blastn, blastp, blastx, tblastn, tblastx,
    rps-blast, psi-blast
  • - www, standalone, and network clients

77
Global vs Local Alignment
78
Global vs Local Alignment
Seq1 WHEREISWALTERNOW (16aa) Seq2
HEWASHEREBUTNOWISHERE (21aa)
79
How BLAST Works
  • Make lookup table of words for query
  • Scan database for hits
  • Extend alignment both directions
  • Ungapped extensions of hits (initial HSPs)
  • Gapped extensions (no traceback)
  • Gapped extensions (traceback - alignment details)

80
Protein Words
GTQ TQI QIT ITV TVE VED
EDL DLF ...
Make a lookup table of words
81
BLASTP Summary
Drop-off score Highest score current
score -X X dropoff value for gapped alignment
(in bits) blastn 30, megablast 20, tblastx 0, all
others 15
82
BLASTP Summary
High-scoring pair (HSP)
83
Scoring Systems - Nucleotides
Identity matrix
A G C T A 1 3 3 -3 G 3 1 3 -3 C 3 3
1 -3 T 3 3 3 1
-r 1 -q -3
CAGGTAGCAAGCTTGCATGTCA
raw score 19-9 10 CACGTAGCAAGCTTG-GTGTCA
84
Scoring Systems - Proteins
  • Position Independent Matrices
  • PAM Matrices (Percent Accepted Mutation)
  • Derived from observation small dataset of
    alignments
  • Implicit model of evolution
  • All calculated from PAM1
  • PAM250 widely used
  • BLOSUM Matrices (BLOck SUbstitution Matrices)
  • Derived from observation large dataset of
    highly conserved blocks
  • Each matrix derived separately from blocks with
    a defined percent identity cutoff
  • BLOSUM62 - default matrix for BLAST
  • Position Specific Score Matrices (PSSMs)
  • PSI- and RPS-BLAST

85
BLOSUM62
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3
-3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2
-1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1
A R N D C Q E G H I L K M F P S
T W Y V X
86
Position-Specific Score Matrix
Serine/Threonine protein kinases catalytic loop
DAF-1
87
Position-Specific Score Matrix
A R N D C Q E G H I L K M
F P S T W Y V 435 K -1 0 0 -1 -2 3
0 3 0 -2 -2 1 -1 -1 -1 -1 -1 -1 -1 -2 436 E
0 1 0 2 -1 0 2 -1 0 -1 -1 0 0 0 -1 0
0 -1 -1 -1 437 S 0 0 -1 0 1 1 0 1 1
0 -1 0 0 0 2 0 -1 -1 0 -1 438 N -1 0
-1 -1 1 0 -1 3 3 -1 -1 1 -1 0 0 -1 -1 1
1 -1 439 K -2 1 1 -1 -2 0 -1 -2 -2 -1 -2
5 1 -2 -2 -1 -1 -2 -2 -1 440 P -2 -2 -2 -2
-3 -2 -2 -2 -2 -1 -2 -1 0 -3 7 -1 -2 -3 -1 -1
441 A 3 -2 1 -2 0 -1 0 1 -2 -2 -2 0 -1 -2
3 1 0 -3 -3 0 442 M -3 -4 -4 -4 -3 -4 -4
-5 -4 7 0 -4 1 0 -4 -4 -2 -4 -1 2 443 A
4 -4 -4 -4 0 -4 -4 -3 -4 4 -1 -4 -2 -3 -4 -1
-2 -4 -3 4 444 H -4 -2 -1 -3 -5 -2 -2 -4 10
-6 -5 -3 -4 -3 -2 -3 -4 -5 0 -5 445 R -4 8
-3 -4 0 -1 -2 -3 -2 -5 -4 0 -3 -2 -4 -3 -3 0
-4 -5 446 D -4 -4 -1 8 -6 -2 0 -3 -3 -5 -6
-3 -5 -6 -4 -2 -3 -7 -5 -5 447 I -4 -5 -6 -6
-3 -4 -5 -6 -5 3 5 -5 1 1 -5 -5 -3 -4 -3 1
448 K 0 0 1 -3 -5 -1 -1 -3 -3 -5 -5 7 -4 -5
-3 -1 -2 -5 -4 -4 449 S 0 -3 -2 -3 0 -2
-2 -3 -3 -4 -4 -2 -4 -5 2 6 2 -5 -4 -4 450 K
0 3 0 1 -5 0 0 -4 -1 -4 -3 4 -3 -2 2 1
-1 -5 -4 -4 451 N -4 -3 8 -1 -5 -2 -2 -3 -1
-6 -6 -2 -4 -5 -4 -1 -2 -6 -4 -5 452 I -3 -5
-5 -6 0 -5 -5 -6 -5 6 2 -5 2 -2 -5 -4 -3 -5
-3 3 453 M -4 -4 -6 -6 -3 -4 -5 -6 -5 0 6
-5 1 0 -5 -4 -3 -4 -3 0 454 V -3 -3 -5 -6
-3 -4 -5 -6 -5 3 3 -4 2 -2 -5 -4 -3 -5 -3 5
455 K -2 1 1 4 -5 0 -1 -2 1 -4 -2 4 -3
-2 -3 0 -1 -5 -2 -3 456 N 1 1 3 0 -4 -1
1 0 -3 -4 -4 3 -2 -5 -2 2 -2 -5 -4 -4 457 D
-3 -2 5 5 -1 -1 1 -1 0 -5 -4 0 -2 -5 -1 0
-2 -6 -4 -5 458 L -3 -1 0 -3 0 -3 -2 3 -4
-2 3 0 1 1 -2 -2 -3 5 -1 -3
catalytic loop
88
Local Alignment Statistics
Expect Value E number of database hits you
expect to find by chance, S
More info The Statistics of Sequence
Similarity Scores
89
An Alignment BLAST Cannot Make
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACC
ACGCTATTCTTGCTGTTG
1
GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTA
CTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGA
TCATTAAGAACTCCTGGGGAGCCAGTT
61
GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGG
GCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTG
GTAAAAAC
121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAAC
AAC
Reason no contiguous exact match of 7 bp.
90
An Alignment BLAST Can Make
Solution compare protein sequences BLASTX
91
Other BLAST Algorithms
  • Megablast
  • Discontiguous Megablast
  • PSI-BLAST
  • PHI-BLAST

92
Megablast NCBIs Genome Annotator
  • Long alignments of similar DNA sequences
  • Greedy algorithm
  • Concatenation of query sequences
  • Faster than blastn less sensitive

93
MegaBLAST Word Size
Trade-off sensitivity vs speed
94
Discontiguous Megablast
  • Uses discontiguous word matches
  • Better for cross-species comparisons

95
Templates for Discontiguous Words
W 11, t 16, coding 1101101101101101 W 11,
t 16, non-coding 1110010110110111 W 12, t
16, coding 1111101101101101 W 12, t 16,
non-coding 1110110110110111 W 11, t 18,
coding 101101100101101101 W 11, t 18,
non-coding 111010010110010111 W 12, t 18,
coding 101101101101101101 W 12, t 18,
non-coding 111010110010110111 W 11, t 21,
coding 100101100101100101101 W 11, t 21,
non-coding 111010010100010010111 W 12, t
21, coding 100101101101100101101 W 12, t
21, non-coding 111010010110010010111
W word size matches in template t template
length
Reference Ma, B, Tromp, J, Li, M. PatternHunter
faster and more sensitive homology search.
Bioinformatics March, 2002 18(3)440-5
96
(No Transcript)
97
Discontiguous (Cross-species) MegaBLAST
98
Discontiguous Word Options
99
Disco. Megablast Example . . .
Query NM_078651 Drosophila melanogaster
CG18582-PA (mbt) mRNA, (3244 bp) /note mushroom
bodies tiny synonyms Pak2, STE20, dPAK2
Database nr (nt), Mammaliaorgn
  • MegaBLAST No significant similarity found.
  • Discontiguous megaBLAST numerous hits . . .

100
Ex Discontiguous MegaBLAST
101
Ex BLASTN
102
PSI-BLAST
Position-specific Iterated BLAST
  • Example Confirming relationships of purine
  • nucleotide metabolism proteins

103
PSI-BLAST
E value cutoff for PSSM
104
RESULTS Initial BLASTP
Same results as protein-protein BLAST different
format
105
Results of First PSSM Search
Other purine nucleotide metabolizing enzymes not
found by ordinary BLAST
106
Tenth PSSM Search Convergence
107
PHI-BLAST
108
Whats New?
109
BLAST Databases
  • Nucleotide
  • refseq_rna NM_, XM_
  • refseq_genomic NC_, NG_
  • env_nt
  • environmental samplefilter, e.g., 16S rRNA
  • Protein
  • refseq NP_, XP_
  • env_nr

110
New Formatter
Select lower case
Select red
111
BLAST Output Alignments Filter
low complexity sequence filtered
112
BLAST Output CDS Feature
113
Advanced Options
Limit to Organism
allfilter NOT ma
Example Entrez Queries allFilter NOT
mammaliaOrganism ray finned fishesOrganism s
rcdb refseqProperties Nucleotide
only biomol mrnaProperties biomol
genomicProperties OtherAdvanced e
10000 expect value -v 2000 descriptions -b
2000 alignments
-e 10000 -v 2000
114
Genome BLAST
115
Genome BLAST via Map Viewer
116
Example Human Genome BLAST
117
Human Genome BLAST Results
118
Human Genome BLAST MapViewer
119
Example Mapping Oligos Onto a Genome
?
gtforward CCATGGCGACCCTGGAAAAGC gtreverse CAGCAGCGG
CTGTGCCTGCGG
?
?
120
Map Oligos Onto Genome
gtCCATGGCGACCCTGGAAAAGCNNNNNNNNNNCAGCAGCGGCTGTGCCTG
CGG
-W 7 e 1000
121
Genome BLAST Results
122
Primer Alignments
reverse primer
forward primer
123
MapViewer
124
MapViewer
125
Sequence View (sv)
forward
reverse
126
Service Addresses
  • BLAST blast-help_at_ncbi.nlm.nih.gov
  • General Help info_at_ncbi.nlm.nih.gov
  • Wayne Matten matten_at_ncbi.nlm.nih.gov
Write a Comment
User Comments (0)
About PowerShow.com