Biological Databases - PowerPoint PPT Presentation

1 / 93
About This Presentation
Title:

Biological Databases

Description:

Others: Aberrant splicing db, Eukaryotic promoter db (EPD); RNA editing sites, ... AsDb - Aberrant Splicing db. ACUTS - Ancient conserved untranslated DNA sequences db ... – PowerPoint PPT presentation

Number of Views:141
Avg rating:3.0/5.0
Slides: 94
Provided by: lore151
Category:

less

Transcript and Presenter's Notes

Title: Biological Databases


1
Biological Databases
  • Bioinformatics Databases for the Molecular
    Biologist 8.9.2003
  • Lorenza Bordoli

2
Overview
  • Introduction to Biological Databases
  • Sequences DNA and Protein
  • Species Specific (Genomics)
  • Protein families and domains
  • Mutation and polymorphisms
  • Proteomics
  • 3D structures
  • Conclusions

3
Introduction
4
What is a database ?
  • A collection of
  • structured
  • searchable (index) -gt table of contents
  • updated periodically (release) -gt new edition
  • cross-referenced (hyperlinks) -gt links with
    other db
  • data
  • Includes also associated tools (software)
    necessary for db access/query, db updating, db
    information insertion, db information deletion.
  • Data storage/ressource management
  • flat files, relational databases, objet
    oriented, )

5
Why biological databases ?
  • Exponential growth in biological data.
  • Data (genomic sequences, 3D structures, 2D gel
    analysis, MS analysis, Microarrays.) are no
    longer published in a conventional manner, but
    directly submitted to databases.
  • Essential tools for biological research.

6
Some statistics
  • More than 1000 different biological databases
  • Variable size lt100Kb to gt10Gb
  • DNA gt 10 Gb
  • Protein 1 Gb
  • 3D structure 5 Gb
  • Other smaller
  • Update frequency daily to annually
  • Usually accessible through the web (free !?)
  • Amos links www.expasy.org/alinks.html
  • Biohunt http//www.expasy.org/BioHunt/
  • Google http//www.google.com/

7
ExPASy Server
ExPASy Web Server ExPASy Expert Protein
Analysis System
http//www.expasy.org/
8
  • Some databases in the field of molecular
    biology
  • AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
  • ARR, AsDb, BBDB, BCGD, Beanref,
    Biolmage,
  • BioMagResBank, BIOMDB, BLOCKS,
    BovGBASE,
  • BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
  • CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
  • ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
  • CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP,
    DictyDb,
  • Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract,
    ECDC,
  • ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
  • ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
  • GCRDB, GDB, GENATLAS, Genbank, GeneCards,
  • Genline, GenLink, GENOTK, GenProtEC,
    GIFTS,
  • GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
  • HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
  • HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
  • HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
  • KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,

9
Categories of databases for Life Sciences
  • Sequences (DNA, protein) (primary db)
  • Genomics (Species Specific)
  • Mutation/polymorphism
  • Protein domain/family (----gt tools)
  • Proteomics (2D gel, Mass Spectrometry)
  • 3D structure
  • Metabolism
  • Bibliography
  • Others (Microarrays, Protein protein
    interaction)

10
Sequence Databases
  • DNA/RNA
  • Protein

11
Ideal minimal content of a sequence database entry
  • Sequences !!
  • Accession number (AC) (unique identifier)
  • Taxonomic data
  • References
  • ANNOTATION/CURATION (gt not always the case !)
  • Keywords
  • Cross-references
  • Documentation

12
Sequence database example
SWISS-PROT (protein db) (flat file)
Accession number
Taxonomy
Reference
Annotations (comments)
Cross-references
Keywords
13
Sequence database example (cont.)
Annotations (features)
Sequence
14
Sequence Databases
  • DNA/RNA

15
Sequence Database 1. nucleotide sequences
  • The 3 main nucleic acid sequence databases (DNA)
    are
  • EMBL (Europe)/GenBank (USA) /DDBJ (Japan)
  • EMBL since 1982
  • Specialized databases for the different types of
    RNAs (i.e tRNA, rRNA, tmRNA, URNA,)
  • 3D structure (DNA and RNA)-gtPDB
  • Others Aberrant splicing db, Eukaryotic promoter
    db (EPD) RNA editing sites, Multimedia Telomere
    Resource,

16
Nucleotids and associated topics databases
(AMOSlinks) EMBL - EMBL Nucleotide
sequence db (EBI) Genbank - GenBank
Nucleotide Sequence db (NCBI) DDBJ - DNA
Data Bank of Japan dbEST - dbEST
(Expressed Sequence Tags) db (NCBI) dbSTS
- dbSTS (Sequence Tagged Sites) db (NCBI)
NDB - Nucleic Acid Databank (3D structures)
BNASDB - Nucleic acid structure db from
University of Pune AsDb - Aberrant
Splicing db ACUTS - Ancient conserved
untranslated DNA sequences db Codon Usage
Db EPD - Eukaryotic Promoter db
HOVERGEN - Homologous Vertebrate Genes db
IMGT - ImMunoGeneTics db Mirror at EBI
ISIS - Intron Sequence and Information System
RDP - Ribosomal db Project gRNAs db -
Guide RNA db PLACE - Plant cis-acting
regulatory DNA elements db PlantCARE -
Plant cis-acting regulatory DNA elements db
sRNA db - Small RNA db ssu rRNA - Small
ribosomal subunit db lsu rRNA - Large
ribosomal subunit db 5S rRNA - 5S
ribosomal RNA db tmRNA Website
tmRDB - tmRNA dB tRNA - tRNA compilation
from the University of Bayreuth uRNADB -
uRNA db RNA editing - RNA editing site
RNAmod db - RNA modification db
SOS-DGBD - Db of Drosophila DNA sequences
annotated with regulatory binding sites
TelDB - Multimedia Telomere Resource
TRADAT - TRAnscription Databases and Analysis
Tools Subviral RNA db - Small circular
RNAs db (viroid and viroid-like) MPDB -
Molecular probe db OPD - Oligonucleotide
probe db VectorDB - Vector sequence db
(seems dead!)
17
Sequence Database 1. DNA EMBL/GenBank/DDBJ
  • These 3 db contain mainly the same informations
    within 2-3 days (few differences in the format
    and syntax)
  • Contribution EMBL 10 GenBank 73 DDBJ 17
  • Serve as archives containing all sequences
    (single genes, ESTs, complete genomes, etc.)
    derived from
  • Genome projects (gt 80 of entries)
  • Sequencing centers
  • Individual scientists ( 15 of entries)
  • Patent offices (i.e. European Patent Office, EPO)
  • Currently 18 x106 sequences, 30 x109 bp
  • Sequences from gt 50000 different species

18
The tremendous increase in nucleotide sequences
EMBL datafirst increase in data due to the PCR
development
human
High throughput genomes (HTG)
mouse
rat
human
1980 80 genes fully sequenced !
19
EMBL/GenBank/DDBJ
  • Heterogeneous sequence qualities and length
    ESTs, genomes, variants, fragments
  • Sequence sizes
  • max 350000 bp /entry (! genomic sequences,
    overlapping)
  • min 10 bp /entry
  • Archive nothing goes out -gt highly redundant !
  • full of errors in sequences, in annotations, in
    CDS attribution.
  • no consistency of annotations most annotations
    are done by the submitters heterogeneity of the
    quality and the completion and updating of the
    informations
  • entries contain only the assembly data

20
(No Transcript)
21
EMBL/GenBank/DDBJ
  • Unexpected information you can find in these db
  • FT source 1..124
  • FT /db_xref"taxon4097"
  • FT /organelle"plastidchloropla
    st"
  • FT /organism"Nicotiana
    tabacum"
  • FT /isolate"Cuban cahibo
    cigar, gift from President Fidel
  • FT Castro"
  • Or
  • FT source 1..17084
  • FT /chromosome"complete
    mitochondrial genome"
  • FT /db_xref"taxon9267"
  • FT /organelle"mitochondrion"
  • FT /organism"Didelphis
    virginiana"
  • FT /dev_stage"adult"
  • FT /isolate"fresh road killed
    individual"
  • FT /tissue_type"liver"

22
EMBL entry example
  • ID HSERPG standard DNA HUM 3398 BP.
  • XX
  • AC X02158
  • XX
  • SV X02158.1
  • XX
  • DT 13-JUN-1985 (Rel. 06, Created)
  • DT 22-JUN-1993 (Rel. 36, Last updated, Version
    2)
  • XX
  • DE Human gene for erythropoietin
  • XX
  • KW erythropoietin glycoprotein hormone
    hormone signal peptide.
  • XX
  • OS Homo sapiens (human)
  • OC Eukaryota Metazoa Chordata Craniata
    Vertebrata Euteleostomi Mammalia
  • OC Eutheria Primates Catarrhini Hominidae
    Homo.
  • XX
  • RN 1
  • RP 1-3398

keyword
taxonomy
references
Cross-references
23
EMBL entry (cont.)
  • CC Data kindly reviewed (24-FEB-1986) by K.
    Jacobs
  • FH Key Location/Qualifiers
  • FH
  • FT source 1..3398
  • FT /db_xreftaxon9606
  • FT /organismHomo sapiens
  • FT mRNA join(397..627,1194..1339,1596
    ..1682,2294..2473,2608..3327)
  • FT CDS join(615..627,1194..1339,1596
    ..1682,2294..2473,2608..2763)
  • FT /db_xrefSWISS-PROTP01588
  • FT /producterythropoietin
  • FT /protein_idCAA26095.1
  • FT /translationMGVHECPAWLWLLLSL
    LSLPLGLPVLGAPPRLICDSRVLQRYLLE
  • FT AKEAENITTGCAEHCSLNENITVPDTKVN
    FYAWKRMEVGQQAVEVWQGLALLSEAVLRG
  • FT QALLVNSSQPWEPLQLHVDKAVSGLRSLT
    TLLRALGAQKEAISPPDAASAAPLRTITAD
  • FT TFRKLFRVYSNFLRGKLKLYTGEACRTGD
    R
  • FT mat_peptide join(1262..1339,1596..1682,22
    94..2473,2608..2763)
  • FT /producterythropoietin
  • FT sig_peptide join(615..627,1194..1261)
  • FT exon 397..627

CDS Coding sequence
annotation
sequence
24
EMBL The Genome divisionshttp//www.ebi.ac.uk/ge
nomes/
25
Nucleotide databases and  associated  genomic
projects/databases
  • Problem
  • Redundancy makes Blasts searches of the
    complete
  • databases useless for detecting anything behond
    the closest homologs.
  • Solutions
  • assemblies of genomic sequence data (contigs)
    and corresponding RNA and
  • protein sequences -gt dataset of genomic contigs,
    RNAs and proteins
  • annotation of genes, RNAs, proteins, variation
    (SNPs), STS markers,
  • gene prediction, nomenclature and chromosomal
    location.
  • compute connection to other resources
    (cross-references)
  • Examples RefSeq/Locus link (drosophila, human,
    mouse, rat and zebrafish),
  • TIGR (bacteria and plants),
    Ensembl (Eukaryota)

26
Nucleotide databases and  associated  genomic
projects/databases
  • LocusLink (http//www.ncbi.nlm.nih.gov/LocusLink/
    )
  • Focal point for genes and associated
    information (C. elegans, cow, fruit fly, human,
    human HIV type 1, mouse, rat and zebrafish)
  • RefSeq
  • Reference mRNAs and proteins for
    human, mouse, rat
  • UniGene
  • UniGene clusters, expression data
  • Ensembl
  • Provides a bioinformatics framework to organise
    biology around the sequences of large genomes.
    Available now are human, mouse, rat,fugu,
    zebrafish, mosquito, Drosophila, C. elegans, and
    C. briggsae,

27
Nucleotide databases and  associated  genomic
projects/databases
  • LocusLink
  • From gene loci to curated sequences and
    descriptive informations.

28
LocusLink
29
LocusLink
30
Nucleotide databases and  associated  genomic
projects/databases
  • RefSeq Reference sequences of genomic contigs,
    mRNAs, and proteins.

31
Nucleotide databases and  associated  genomic
projects/databases
  • UniGene UniGene is an experimental system for
    automatically partitioning GenBank sequences into
    a non-redundant set of gene-oriented clusters.
    Each UniGene cluster contains sequences that
    represent a unique gene, as well as related
    information such as the tissue types in which the
    gene has been expressed and map location.

32
Nucleotide databases and  associated  genomic
projects/databases
  • Ensembl Provides a bioinformatics framework to
    organise biology around the sequences of large
    genomes. Available now are human, mouse,
    rat,fugu, zebrafish, mosquito, Drosophila, C.
    elegans, and C. briggsae,

33
Sequence Databases
  • 2. Protein

34
Sequence Database 2. Proteins
  • SWISS-PROT created in 1986 (A.Bairoch)
    http//www.expasy.org/sprot/
  • TrEMBL created in 1996 complement to
    SWISS-PROT derived from EMBL CDS translations
    ( proteomic  version of EMBL)
  • PIR-PSD Protein Information Resources
    http//pir.georgetown.edu/
  • Genpept  proteomic  version of GenBank
  • Many specialized protein databases for specific
    families or groups of proteins (M. Primig)
  • Examples AMSDb (antibacterial peptides), GPCRDB
    (7 TM receptors), IMGT (immune system) YPD
    (Yeast),

35
The first protein db
36
Swiss-Prot
  • Collaboration between the SIB (Geneva) and
    EMBL/EBI (UK)
  • Fully manually annotated, non-redundant,
    cross-referenced, documented protein sequence
    database.
  • 113 000 sequences from more than 6800
    different species 70 000 references
    (publications) 550 000 cross-references
    (databases) 200 Mb of annotations
  • Weekly releases available from about 50 servers
    across the world, the main source being ExPASy

37
TrEMBL (Translation of EMBL)
  • It is impossible to cope with the quantity of
    newly generated data AND to maintain the high
    quality of SWISS-PROT -gt TrEMBL, created in 1996.
  • TrEMBL is automatically generated (from annotated
    EMBL coding sequences (CDS)) and annotated using
    software tools.
  • Contains all what is not in SWISS-PROT.
  • SWISS-PROT TrEMBL all known protein
    sequences.
  • Well-structured SWISS-PROT-like resource.

38
The simplified story of a Swiss-Prot entry
Some data are not submitted to the public
databases !! (delayed or cancelled)
cDNAs, genomes,
  •  Automated 
  • Redundancy check (merge)
  • Family attribution (InterPro)
  • Annotation (computer)

EMBLnew EMBL
CDS
TrEMBLnew TrEMBL
  •  Manual 
  • Redundancy (merge, conflicts)
  • Annotation (manual)
  • Swiss-Prot tools (macros)
  • Swiss-Prot documentation
  • Medline
  • Databases (MIM, MGD.)
  • Brain storming

Swiss-Prot
Once in Swiss-Prot, the entry is no more in
TrEMBL, but still in EMBL (archive)
39
Some nomenclature Example SRS6 at the Sanger
Center
http//www.sanger.ac.uk/srs6bin/cgi-bin/wgetz?-pag
etop
40
SWISS-PROT (SP)TrEMBL TrEMBL new (SWALL,
SPTR) (Standard)
(Preliminary)
  • TrEMBL SPTrEMBL REMTrEMBL
  • SPTrEMBL contains TrEMBL entries which will be
    integrated into SWISS-PROT.
  • REMTrEMBL contains TrEMBL entries which will
    never be integrated into SWISS-PROT
    (Immunoglobulins and T-cell receptors, Synthetic
    sequences, Patent application sequences
  • Small fragments, CDS not coding for real
    proteins)
  • TrEMBLnew contains entries which have not yet
    been integrated into TrEMBL (weekly update to
    TrEMBL)
  • SPTR (SWall) SWISS-PROT (SP)TrEMBL
    TrEMBLnew
  • ! Usually what we call TrEMBL is (SP)TrEMBL and
    does not include REMTrEMBL !

41
a Swiss-Prot entry overview
42
Protein name Gene name
43
(No Transcript)
44
(No Transcript)
45
Cross-references
46
Keywords
47
(No Transcript)
48
(No Transcript)
49
TrEMBL example
Original TrEMBL entry which has been integrated
into the SWISS-PROT EPO_HUMAN entry and thus
which is not found in TrEMBL anymore.
50
(No Transcript)
51
Swiss-Prot / TrEMBL a minimal of redundancy
  • Swiss-Prot and TrEMBL introduces some degree of
  • redundancy
  • Only 100 identical sequences are automatically
    merged
  • between SWISS-PROT and TrEMBL
  • Complete sequences or fragments with 1-3
    conflicts will be
  • automatically merged soon (genome projects check
    for chromosomal location and gene names)

52
Swiss-Prot / TrEMBL a minimal of redundancy
Human EPO Blastp results
53
Swiss-Prot and the cross-references (X-ref)
  • SWISS-PROT was the 1st database with X-ref.
  • Explicitly X-referenced to 36 databases
  • X-ref to DNA (EMBL/GenBank/DDBJ), 3D-structure
    (PDB),
  • literature (Medline), genomic (MIM, MGD,
    FlyBase, SGD, SubtiList,
  • etc.), 2D-gel (SWISS-2DPAGE), specialized db
    (PROSITE,
  • TRANSFAC)
  • Implicitly X-referenced to 17 additional db
    added by the ExPASy
  • servers on the WWW (i.e. GeneCards, PRODOM,
    HUGE, etc.)
  • Gasteiger et al., Curr. Issues Mol. Biol.
    (2001), 3(3) 47-55

54
Domains, functional sites, protein
families PROSITE InterPro Pfam PRINTS SMART Mendel
-GFDb
Human diseases MIM
Protein-specific dbs GCRDb MEROPS REBASE TRANSFAC
2D and 3D Structural dbs HSSP PDB
Organism-spec. dbs DictyDb EcoGene FlyBase HIV Mai
zeDB MGD SGD StyGene SubtiList TIGR TubercuList Wo
rmPep Zebrafish
Swiss-Prot
PTM CarbBank GlycoSuiteDB
2D-gel protein databases SWISS-2DPAGE ECO2DBASE HS
C-2DPAGE Aarhus and Ghent MAIZE-2DPAGE
Nucleotide sequence db EMBL, GeneBank, DDBJ
55
  • http//pir.georgetown.edu/

56
  • UniProt
  • United Protein database
  • Swiss-Prot TrEMBL PIR
  • Born in oct 2002
  • NIH pledges cash for global protein database
    The United States is turning to European
    bioinformatics facilities to help it meet
  • its researchers' future needs for databases of
    protein sequences.European institutions are set
    to be the main recipients of a 15-million,
  • three-year grant from the US National Institutes
    of Health (NIH), to set up
  • a global database of information on protein
    sequence and function known as the
  • United Protein Databases, or UniProt (Nature,
    419, 101 (2002))

57
Species Specific Databases
58
Species Specific Databases
  • Contain information on gene chromosomal location
    (mapping) and nomenclature, and provide links to
    sequence databases usually do not contain
    sequence (but crosslink to it)
  • all species whose genome has been sequenced,
    annotated and published come with their own
    species genome database that enables scientists
    to retrieve information about DNA and it's gene
    products (see also NAR Database issue 2003 at
    http//nar.oupjournals.org/content/vol31/issue1/
    and the new NAR Web server Issue 2003 at
    http//nar.oupjournals.org/content/vol31/issue13)
  • AMOS links http//www.expasy.org/alinks.htmlOrg
    anisms
  • species specific db session

59
Species Specific DBs examples
  • Human
  • GDB The Genome Database is the official central
    repository for genomic mapping data resulting
    from the Human Genome Initiative. Although GDB
    has historically focused on gene mapping, as the
    Genome Project moves from mapping to sequence to
    functional analysis, GDB's focus will be
    broadened. Extensions are under development in
    the representation of sequence-level genome
    content, including sequence variations, along
    with richer descriptions of function and
    phenotype.
  • GeneCards - Db integrating information on human
    genes
  • GeneLynx - Portal to the human genome

60
GeneCards
an electronic encyclopedia of biological and
medical information
61
Gene Lynx
62
Gene Lynx
63
Species Specific DBs examples
  • Mouse
  • MGI Mouse Genome Informatics provides
    integrated access to data on the genetics,
    genomics, and biology of the laboratory mouse
  • C.elegans
  • WormBase WormBase is an international consortium
    of biologists and computer scientists dedicated
    to providing the research community with
    accurate, current, accessible information
    concerning the genetics, genomics and biology of
    C. elegans and some related nematodes.
  • Yeast
  • SGD database of the molecular biology and
    genetics of the yeast Saccharomyces cerevisiae
  • Arabidopsis
  • TAIR The Arabidopsis Information Resource
    provides a comprehensive resource for the
    scientific community working with Arabidopsis
    thaliana, a widely used model plant. TAIR
    consists of a searchable relational database,
    which includes many different datatypes.
  • Drosophila
  • The Flybase FlyBase is a database of genetic
    and molecular data for Drosophila. FlyBase
    includes data on all species from the family
    Drosophilidae the primary species represented is
    Drosophila melanogaster.

64
Protein families domains DB
65
Protein families domains DB
  • Most proteins have  modular  structures
  • Estimation 3 domains / protein
  • Motifs conserved regions within a domain, can be
    identified by multiple sequence alignments
  • Protein motifs can be defined by different
    methods (descriptors)
  • Pattern
  • Profiles
  • HMMs

66
Protein families domains DB
  • Contains biologically significant motifs
    descriptors  (pattern / profiles/ HMM) formulated
    in such a way that, with appropriate computional
    tools, it can rapidly and reliably determine to
    which known family of proteins (if any) a new
    sequence belongs to.
  • Used as a tool to identify the function of
    uncharacterized proteins translated from genomic
    or cDNA sequences ( functional diagnostic )
  • Either manually curated (i.e. PROSITE, Pfam,
    etc.) or automatically generated (i.e. ProDom,
    DOMO)

67
Protein families domains DB
Interpro
PROSITE Patterns / Profiles ProDom Aligned
motifs (PSI-BLAST) (Pfam B) PRINTS Aligned
motifs Pfam HMM (Hidden Markov Models)
SMART HMM TIGRfam HMM DOMO Aligned
motifs BLOCKS Aligned motifs (PSI-BLAST) CDD(CDAR
T) PSI-BLAST(PSSM) of Pfam and SMART
68
Protein families domains DB PROSITE
  • Contains functional domains fully annotated,
    based on two methods patterns and profiles
  • Entries are deposited in PROSITE in two distinct
    files
  • Pattern/profiles with the list of all matches in
    Swiss-Prot
  • Documentation

69
PROSITE Documentation
70
PROSITE entry
Diagnostic performance
List of matches
71
PROSITE entry
Diagnostic performance
List of matches
72
PROSITE access
Search for an entry
Search for the occurrence Of domain in your
protein
73
Protein families domains DB Pfam
74
Protein families domains DB InterPro
  • Composite DB direct access to different protein
    families DBs

75
Protein families domains DB InterPro
76
Mutations Polymorphisms DB
77
Mutations Polymorphisms DB
  • Contain informations on sequence variations
    linked or not to genetic diseases
  • Mainly human but OMIA - Online Mendelian
    Inheritance in Animals
  • General db
  • OMIM
  • HMGD - Human Gene Mutation db
  • SVD - Sequence variation db
  • HGBASE - Human Genic Bi-Allelic Sequences db
  • dbSNP - Human single nucleotide polymorphism
    (SNP) db
  • Disease-specific db most of these databases are
    either linked to a single gene or to a single
    disease
  • p53 mutation db
  • ADB - Albinism db (Mutations in human genes
    causing albinism)
  • Asthma and Allergy gene db

78
Mutations Polymorphisms Definitions
  • SNPs single nucleotide polymorphisms occur
    approximately once every 100 to 300 bases
  • (distinction between sequencing error and
    polymorphism !)
  • c-SNPs coding single nucleotide polymorphisms
    (Single Nucleotide Polymorphisms within cDNA
    sequences)
  • SAPs single amino-acid polymorphisms
  • Missense mutation -gt SAP
  • Nonsense mutation -gt STOP
  • Insertion/deletion of nucleotides -gt frameshift

79
Mutations Polymorphisms DB examples
  • OMIM Online Mendelian Inheritance in Man
  • catalog of human genes and genetic disorders
  • contains a summary of literature and reference
    information. It also contains links to
    publications and sequence information.
  • TSC The SNP consortium
  • Public/private collaboration Bayer, Roche, IBM,
    Pfizer, Novartis, Motorola
  • SNPs dbSNP at NCBI
  • Collaboration between the National Human Genome
    Research Institute and the National Center for
    Biotechnology Information (NCBI)
  • Chromosome 21 dbSNP
  • A joint project between the Division of Medical
    Genetics of the
    University of Geneva Medical School and the SIB

80
(No Transcript)
81
Mutations Polymorphisms DB
  • Generally modest size lack of coordination and
    standards in these databases making it difficult
    to access the data.
  • There are initiatives to unify these databases
  • SVD Sequence Variation Database project at EBI
    (HMutDB)
  • (http//www2.ebi.ac.uk/mutations/)
  • HUGO Mutation Database Initiative (MDI).
  • Human Genome Variation Society
  • (http//www.genomic.unimelb.edu.au/mdi/dblist
    /dblist.html)

82
Proteomics DatabasesSWISS-2DPAGE
83
Proteomics DB
  • Contain informations obtained by 2D-PAGE images
    of master gels and description of identified
    proteins
  • Examples SWISS-2DPAGE, ECO2DBASE, Maize-2DPAGE,
    Sub2D, Cyano2DBase, etc.
  • Composed of image and text files

84
Proteomics DB SWISS-2DPAGE
85
Proteomics DB SWISS-2DPAGE
86
Proteomics DB SWISS-2DPAGE
87
3D structures DatabasesPDB
88
3D structures Databases PDB
  • Worldwide repository for the processing and
    distribution of 3-D biological macromolecular
    structure data
  • Proteins represent more than 90 of available
    structures (others are DNA, RNA, sugars, viruses,
    protein/DNA complexes)
  • http//www.pdb.org
  • Contains protein structures solved experimentally
    (X-Ray, NMR, EM)
  • Provides
  • Coordinates (often structure factors, NOEs,
    other experimental data)
  • stored as pdb or mmCIF file
  • Images
  • Links to derived data, e.g. similar structures,
    fold families, etc.

89
3D structures Databases PDB
  • SHEET 3 S10 PHE 66 PHE 70 -1 O ASN
    67 N LEU 60 12CA 68
  • SHEET 4 S10 TYR 88 TRP 97 -1 O PHE
    93 N VAL 68 12CA 69
  • SHEET 5 S10 ALA 116 ASN 124 -1 O HIS
    119 N HIS 94 12CA 70
  • SHEET 6 S10 LEU 141 VAL 150 -1 O LEU
    144 N LEU 120 12CA 71
  • SHEET 7 S10 VAL 207 LEU 212 1 O ILE
    210 N GLY 145 12CA 72
  • SHEET 8 S10 TYR 191 GLY 196 -1 O TRP
    192 N VAL 211 12CA 73
  • SHEET 9 S10 LYS 257 ALA 258 -1 O LYS
    257 N THR 193 12CA 74
  • SHEET 10 S10 LYS 39 TYR 40 1 O LYS
    39 N ALA 258 12CA 75
  • TURN 1 T1 GLN 28 VAL 31 TYPE VIB
    (CIS-PRO 30) 12CA 76
  • TURN 2 T2 GLY 81 LEU 84 TYPE
    II(PRIME) (GLY 82) 12CA 77
  • TURN 3 T3 ALA 134 GLN 137 TYPE I
    (GLN 136) 12CA 78
  • TURN 4 T4 GLN 137 GLY 140 TYPE I
    (ASP 139) 12CA 79
  • TURN 5 T5 THR 200 LEU 203 TYPE VIA
    (CIS-PRO 202) 12CA 80
  • TURN 6 T6 GLY 233 GLU 236 TYPE II
    (GLY 235) 12CA 81
  • CRYST1 42.700 41.700 73.000 90.00 104.60
    90.00 P 21 2 12CA 82
  • ORIGX1 1.000000 0.000000 0.000000
    0.00000 12CA 83
  • ORIGX2 0.000000 1.000000 0.000000
    0.00000 12CA 84
  • ORIGX3 0.000000 0.000000 1.000000
    0.00000 12CA 85
  • SCALE1 0.023419 0.000000 0.006100
    0.00000 12CA 86

Coordinates of each atom
90
3D structures Databases PDB
91
Conclusions
92
Database retrieval tools
  • Query tools associated with the Databases
  • Sequence Retrieval System (SRS, Europe) allows
    any flat-file db to be indexed to any other
    allows to formulate queries across a wide range
    of different db types via a single interface,
    without any worry about data structure, query
    languages
  • Entrez (NCBI) less flexible than SRS but
    exploits the concept of  neighbouring , which
    allows related articles in different db to be
    linked together, whether or not they are
    cross-referenced directly
  • ATLAS specific for macromolecular sequences db
    (i.e. NRL-3D)
  • .

93
Where and what ?
  • Swiss Institute of Bioinformatics (Geneva,
    Lausanne, Basel)
  • Swiss-Prot
  • PROSITE
  • TrEMBL
  • ExPASy server
  • EMBnet server BLAST

http//www.isb-sib.ch
  • European Bioinformatics Institute (UK/D)
  • EMBL
  • INTERPRO
  • TrEMBL

http//www.ebi.ac.uk/embl/
  • National Center for Biotechnology Information
    (USA)
  • BLAST
  • Genebank
  • Lokuslink
  • RefSeq

http//www.ncbi.nlm.nih.gov/
  • Ensembl
  • Pfam
  • The Wellcome Trust Sanger Institute (UK)

http//www.sanger.ac.uk/
Write a Comment
User Comments (0)
About PowerShow.com