Databases, archives, search tools' - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Databases, archives, search tools'

Description:

application of computational algoritms to the analysis of DNA and protein sequences. ... COMMENT Original source text: P.haemolytica (serotype 1, biotype A) DNA. ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 33
Provided by: hech
Category:

less

Transcript and Presenter's Notes

Title: Databases, archives, search tools'


1
Databases, archives, search tools. Bioinformatics
convergence of two historical trends in
biological research - storage of molecular
sequences in computer databases -
application of computational algoritms to the
analysis of DNA and protein sequences.
(Brown 2003 Biotechniques).
2
After the database lecture the student should
Understand the differences between primary and
secondary databases. Understand the
differences between sequence similarity search
and structured data search. Understand the
background for maintaining different versions of
databases with nearly the same content.
Understand the difference between curated and raw
databases. Understand the difference between
databases (Genbank non-redundant protein,
SwissProt), servers (NCBI, Expasy) and search
programs (Blast, Fasta). Why? Most
information developed in bioinformatics is stored
in databases. Often the same information exists
in different formats in different databases, and
different servers present the same data in
different more or less user-friendly ways. The
choice of database depends on the problem and
personal taste. The choice of server may even
depend on the time of day and the loads (number
of users) at the time.
3
Databases (DB). 1. Primary databases
archives repositories. 2. Secondary
databases specialized databases 3. Parallel
information mainly American (U. S. A.) versus
European (EU) databases. All databases are
listed in the first issue of Nucleic Acid
Researh (NAR) each year
4
(No Transcript)
5
Main bioinformatic institutes hosting databases
and servers verdenskort NCBI EBI
DDBJ USA, Bethesta Hinxton, Japan Maryla
nd England
International Nucleotide Sequence Database
Collaboration
6
Primary (repository) (archives) DB. Data
derived from direct experimental characterization
of DNA or protein. Authors submit their own
material which is curated by the
database. International public databases All
known nucleotide and protein sequences. GenBank
(funded 1982) hosted at NCBI since 1992 EMBL
(funded ?) hosted at EBI since 1994 DDJB (funded
1986) (DNA Data Bank Japan) Local databases at
institutions doing sequencing TIGR (funded
1992), Sanger (funded 1992) Other local
databases of sequencing projects linked to the
3-5 large primary databases Commercial DB not
assible to the public.
7
Secondary DB (specialized DB, derived
information resources) Information curated by
the DB. No direct submission from scientists.
Further analysis by the database. Swiss-prot
(hosted at SIB since 1998) (funded by Dr. Amos
Bairoch in 1985) Annotation, minimum redundancy,
integration with other DB, documentation. PDB
(Protein Data Bank) (funded in 1971) DB of
experimental determined three-dimentional
structural information. PIR (Protein Information
Resource) (funded in 1965 by Margaret O.
Dayhoff) Receives directly sequenced proteins.
Many, many more (see NAR)
8
Domain and motif specialized DB. Domain
compact units of proteins behaving
indepently Motifs conserved regions of proteins
which might be part of domain BLOCKS (USA)
(funded by Henikoffs) Multiple alignments of
conserved regions PRINTS (UK) (parallel to
BLOCKS based on OWL DB) Hierarchical gene family
fingerprints PROSITE (associated
Swiss-Prot) Biologically-significant protein
patterns and profiles ProDOM (automatic created
blocks, France) Pfam (manually defiend domains)
Multiple sequence alignments and hidden Markov
models of common protein domains
CDD   (Conserved Domain Database) Alignment
models for conserved protein domains
9
Domain, motif specialized DB. Domain compact
units of proteins behaving indepently Motifs
conserved regions of proteins which might be part
of domain Search tools for the domains DB DART
(Domain Architecture Retrieval Tool) SMART
(Simple Modular Architacture Research
Tool) Interpro Linking information in
PRINTS, PROSITE, ProDOM and Pfam
10
Database Category  Proteome Resources AAindex   
Physicochemical and biological properties of
amino acids GELBANK   2D gel electrophoresis
patterns from completed genomes REBASE
Restriction enzymes and associated methylases
SWISS-2DPAGE Annotated two-dimensional
polyacrylamide gel electrophoresis database
11
Database Category  Varied Biomedical
Content DBcat Catalog of databases DrugDB
Pharmacologically-active compounds generic and
trade names GlycoSuiteDB   N- and O-linked
glycan structures and biological source
information NCBI Taxonomy Browser Names of all
organisms that are represented in the genetic
databases with at least one nucleotide or protein
sequence probeBase   rRNA-targeted
oligonucleotide probe sequences, DNA microarray
layouts, and associated information PubMed
MEDLINE and Pre-MEDLINE citations
RefSeq   Reference sequence standards for
genomes, genes, transcripts, and proteins Tree
of Life Information on phylogeny and biodiversity
VirOligo   Virus-specific oligonucleotides for
PCR and hybridization
12
International bioinformatic resources (integrated
databases, programs and servers) NCBI (National
Center for Biotechnology Information) Division
of NLM on NIH campus. web-site
www.ncbi.nlm.nih.gov. Repository GenBank Data
retrieval Entrez, PubMed, LocusLink Entrez is
an integrated database retrival system
accessible all type of data Data analysis
BLAST, Electronic PCR, ORFfinder, and more
13
International bioinformatic resources (integrated
databases, programs and servers) EBI (European
Bioinformatics Institute) EMBL. Repository.
Europes primary collection of nucleotide
sequences UniProt Knowledgebase - a complete
annotated protein sequence database
Macromolecular Structure Database - European
Project for the management and distribution of
data on macromolecular structures ArrayExpress
- for gene expression data Ensembl - Providing
up to date completed metazoic genomes and the
best possible automatic annotation.
ToolsClustalw and many more
14
Example of repository GenBank Submission 35
by Bankit individual submissions. Rest bulk
submissions from sequencing centres. Gi (genetic
identifier)-number changes with new updates.
Accession number constant but extended by
version no. DNA sequences two letters, six
digits (old one letter 5 dig.). Protein
sequences three letters, five digits (old one
letter 5 dig.). Non-redundant (nr) ?
15
Example of protein search in DB Leucotoxin
Frey and Kuhnert 2002
16
GenBank DNA-sequence format LOCUS PASA1LKT
7801 bp DNA linear BCT 26-APR-1993 DEFINITION
Pasteurella haemolytica A1 leukotoxin gene,
encoding LktA, LktB, LktC and LktD proteins,
complete cds. ACCESSION M20730 VERSION
M20730.1 GI150492 KEYWORDS LktA protein LktB
protein LktC protein LktD protein. SOURCE
Mannheimia haemolytica ORGANISM Mannheimia
haemolytica Bacteria Proteobacteria
Gammaproteobacteria Pasteurellales
Pasteurellaceae Mannheimia. REFERENCE 1 (bases
1 to 7801) AUTHORS Lo,R.Y., Strathdee,C.A. and
Shewen,P.E. TITLE Nucleotide sequence of the
leukotoxin genes of Pasteurella haemolytica A1
JOURNAL Infect. Immun. 55 (9), 1987-1996 (1987)
MEDLINE 87306837 PUBMED 3040588 COMMENT
Original source text P.haemolytica (serotype 1,
biotype A) DNA. Submitted in computer readable
form by C.Strathdee21-SEP-1988. FEATURES
Location/Qualifiers source 1..7801
/organism"Mannheimia haemolytica"
/mol_type"genomic DNA" /db_xref"taxon7
5985"
17
GenBank DNA-sequence format CDS
470..973 /note"LktC protein" /codon_start1
/transl_table11 /protein_id"AAA25528.1"
/db_xref"GI150493" /translation"MNQ
SYFNLMNSSLHK.. CDS 989..3850 /note"LktA
protein" /codon_start1 /transl_table11
/protein_id"AAA25529.1" /db_xref"GI15
0494" /translation"MGTRLTTLSNGLKNTLTATKS..
ORIGIN 3 bp upstream of EcoRV site. 1
gatatcttgt gcctgcgcag taaccacaca cccgaataaa
agggtcaaaa gtgttttttt 61 cataaaaagt
ccctgtgttt tcattataag gattaccact ttaacgcagt
tactttctta
18
Genome level comparison COG (Clusters of
Orthologous Groups) A RNA processing and
modification B Chromatin structure and
dynamics C Energy production and
conversion D Cell cycle control and
mitosis E Amino acid metabolism and
transport F Nucleotide metabolism and
transport S Function unknown
19
(No Transcript)
20
(No Transcript)
21
Example of search in specialized
database. Selection of DB. More versions of
the same acc. no. Different types of
identifiers. Links (or lack of these) to other
specialized databases
22
Example of protein search in DB Leucotoxin
Frey and Kuhnert 2002
23
Example of protein search in DB Leucotoxin NCBI
http//www.ncbi.nlm.nih.gov/ Protein keywords,
Mannheimia haemolytica leukotoxin over 100
hits Swiss-prot http//expasy.org/sprot/ Wrong
name (Pasteurella) only one sequence
(P16535) Swiss-prot no. LKA1_PASHA NCBI with
P16535
24
Example with 16S rRNA based identification of
bacteria. Relevant for food -, veterinary and
environmental microbiology. 16S rRNA sequence
comparison preferred for classification/identifica
tion 16S rRNA genes are universially
distributed There is only one type of
ribosomes. No selection and no recombination
(in theory) 16S rRNA gene sequence derived
phylogeny reflects the natural relationship of
bacteria Current framework for bacterial
taxonomy Huge databases.
25
Example of sequence submission to a primary
database. Isolate P876, 16S rRNA gene sequence.
Length 1449 bp TGCAAGTCGA ACGGTAGCAG
GAAGAAAGCT TGCTTTCTTT GCTGACGAGT GGCGGACGGG
TGAGTAATGC TTGGGAATCT GGCTTATGGA
GGGGGATAAC TGTGGGAAAC TGCAGCTAAT ACCGCGTAAT
CTCTGAGGAG TAAAGGGTGG GACyTTAGGG CCACCTGCCA
TAAGATGAGC CCAAGTGGGA TTAGGTAGTT GGTGGGGTAA
AGGCCTACCA AGCCTGCGAT CTCTAGCTGG TCTGAGAGGA
TGACCAGCCA CACTGGAACT GAGACACGGT CCAGACTCCT
ACGGGAGGCA GCAGTGGGGA ATATTGCGCA ATGGGGGGAA
CCCTGACGCA GCCATGCCGC GTGAATGAAG AAGGCCTTCG
GGTTGTAAAG TTCTTTCGGT AATGAGGAAG GGGTGTTrTT
kAATAGATAG CATCATTGAC GTTAATTACA GAAGAAGCAC
CGGCTAACTC CGTGCCAGCA GCCGCGGTAA TACGGAGGGT
GCGAGCGTTA ATCGGAATAA CTGGGCGTAA AGGGCACGCA
GGCGGACTTT TAAGTGAGAT GTGAAATCCC CGAGCTTAAC
TTGGGAATTG CATTTCAGAC TGGGAGTCTA GAGTACTTTA
GGGAGGGGTA GAATTCCACG TGTAGCGGTG AAATGCGTAG
AGATGTGGAG GAATACCGAA GGCGAAGGCA GCCCCTTGGG
AATGTACTGA CGCTCATGTG CGAAAGCGTG GGGAGCAAAC
AGGATTAGAT ACCCTGGTAG TCCACGCTGT AAACGCTGTC
GATTTGGGGA TTGGGCTTTA AGCTTGGTGC CCGAAGCTAA
CGTGATAAAT CGACCGCCTG GGGAGTACGG CCGCAAGGTT
AAAACTCAAA TGAATTGACG GGGGCCCGCA CAAGCGGTGG
AGCATGTGGT TTAATTCGAT GCAACGCGAA GAACCTTACC
TACTCTTGAC ATCCTAAGAA GAGCTCAGAG ATGAGCTTGT
GCCTTCGGGA ACTTAGAGAC AGGTGCTGCA TGGCTGTCGT
CAGCTCGTGT TGTGAAATGT TGGGTTAAGT CCCGCAACGA
GCGCAACCCT TATCCTTTGT TGCCAGCGAT TTGGTCGGGA
ACTCAAAGGA GACTGCCAGT GACAAACTGG AGGAAGGTGG
GGATGACGTC AAGTCATCAT GGCCCTTACG AGTAGGGCTA
CACACGTGCT ACAATGGTGC ATACAGAGGG CAGCGAGAGT
GCGAGCTTAA GCGAATCTCA GAAAGTGCAT CTAAGTCCGG
ATTGGAGTCT GCAACTCGAC TCCATGAAGT CGGAATCGCT
AGTAATCGCA AATCAGAATG TTGCGGTGAA TACGTTCCCG
GGCCTTGTAC ACACCGCCCG TCACACCATG GGAGTGGGTT
GTACCAGAAG TAGATAGCTT AACCTTCGGG AGGGCGTTTA
CCACGGTATG ATTCATGACT GGGGTGAAGT CGTAACAGA
Submission to GenBank with BankIt
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Errors detected during automatic translation of
DNA to protein. When the sequence is curated at
the database.
31
BLAST 16S rRNA TGCAAGTCGA ACGGTAGCAG
GAAGAAAGCT TGCTTTCTTT GCTGACGAGT GGCGGACGGG
TGAGTAATGC TTGGGAATCT GGCTTATGGA
GGGGGATAAC TGTGGGAAAC TGCAGCTAAT ACCGCGTAAT
CTCTGAGGAG TAAAGGGTGG GACyTTAGGG CCACCTGCCA
TAAGATGAGC CCAAGTGGGA TTAGGTAGTT GGTGGGGTAA
AGGCCTACCA AGCCTGCGAT CTCTAGCTGG TCTGAGAGGA
TGACCAGCCA CACTGGAACT GAGACACGGT CCAGACTCCT
ACGGGAGGCA GCAGTGGGGA ATATTGCGCA ATGGGGGGAA
CCCTGACGCA GCCATGCCGC GTGAATGAAG AAGGCCTTCG
GGTTGTAAAG TTCTTTCGGT AATGAGGAAG GGGTGTTrTT
kAATAGATAG CATCATTGAC GTTAATTACA GAAGAAGCAC
CGGCTAACTC CGTGCCAGCA GCCGCGGTAA TACGGAGGGT
GCGAGCGTTA ATCGGAATAA CTGGGCGTAA AGGGCACGCA
GGCGGACTTT TAAGTGAGAT GTGAAATCCC CGAGCTTAAC
TTGGGAATTG CATTTCAGAC TGGGAGTCTA GAGTACTTTA
GGGAGGGGTA GAATTCCACG TGTAGCGGTG AAATGCGTAG
AGATGTGGAG GAATACCGAA GGCGAAGGCA GCCCCTTGGG
AATGTACTGA CGCTCATGTG CGAAAGCGTG GGGAGCAAAC
AGGATTAGAT ACCCTGGTAG TCCACGCTGT AAACGCTGTC
GATTTGGGGA TTGGGCTTTA AGCTTGGTGC CCGAAGCTAA
CGTGATAAAT CGACCGCCTG GGGAGTACGG CCGCAAGGTT
AAAACTCAAA TGAATTGACG GGGGCCCGCA CAAGCGGTGG
AGCATGTGGT TTAATTCGAT GCAACGCGAA GAACCTTACC
TACTCTTGAC ATCCTAAGAA GAGCTCAGAG ATGAGCTTGT
GCCTTCGGGA ACTTAGAGAC AGGTGCTGCA TGGCTGTCGT
CAGCTCGTGT TGTGAAATGT TGGGTTAAGT CCCGCAACGA
GCGCAACCCT TATCCTTTGT TGCCAGCGAT TTGGTCGGGA
ACTCAAAGGA GACTGCCAGT GACAAACTGG AGGAAGGTGG
GGATGACGTC AAGTCATCAT GGCCCTTACG AGTAGGGCTA
CACACGTGCT ACAATGGTGC ATACAGAGGG CAGCGAGAGT
GCGAGCTTAA GCGAATCTCA GAAAGTGCAT CTAAGTCCGG
ATTGGAGTCT GCAACTCGAC TCCATGAAGT CGGAATCGCT
AGTAATCGCA AATCAGAATG TTGCGGTGAA TACGTTCCCG
GGCCTTGTAC ACACCGCCCG TCACACCATG GGAGTGGGTT
GTACCAGAAG TAGATAGCTT AACCTTCGGG AGGGCGTTTA
CCACGGTATG ATTCATGACT GGGGTGAAGT CGTAACAGA
32
Four DB advises. Start with NCBI and/or
Swiss-prot Remember differences between 1.
Repository, archive 2. Specialized Parallel
resources often exists in Europe and USA. Find
help in the scientific litterature. Be aware of
errors in the DB. Cite the databases correctly
(see first issue of NAR each year)
Write a Comment
User Comments (0)
About PowerShow.com