DNA Databanks - PowerPoint PPT Presentation

About This Presentation
Title:

DNA Databanks

Description:

DNA Databanks. Speaker: Yu-Chung Chang ???. Institute of Biochemistry ... Sakura. Sequin. e-mail. Sequin. Diskette. Sequin. DNA Databases at NCBI. Nucleotides. dbEST ... – PowerPoint PPT presentation

Number of Views:237
Avg rating:3.0/5.0
Slides: 86
Provided by: yuch
Category:
Tags: dna | databanks | sakura

less

Transcript and Presenter's Notes

Title: DNA Databanks


1
DNA Databanks
  • Speaker Yu-Chung Chang ???
  • Institute of Biochemistry
  • National Yang-Ming University

2
Biological Databases
  • DNA databanks
  • GenBank, DDBJ, EMBL,
  • Protein databases
  • PIR, Swiss-Prot, PRF, GenPept, TrEMBL, PDB,
  • EST databases
  • dbEST, DOTS, UniGene, GIs, STACK,
  • Structure databases
  • MMDB, PDB, Swiss-3DIMAGE,
  • Pathway databases
  • KEGG, BRITE, TRANSPATH,
  • Integrated databases
  • SRS
  • Motif or cis-element databases
  • Prosite, Pfam, BLOCKS, TransFac, PRINTS, URLs,
  • Gene, protein disease databases
  • GeneCards, OMIM, OMIA,
  • Taxonomy databases
  • Literature databases
  • PubMed, Medline,
  • Patent database
  • Apipa, CA-STN, IPN, USPTO, EPO, Beilstein,
  • Others
  • RNA databases,

3
DNA Databanks
  • cDNA resources
  • Genbank (NCBI), Nucleotide Sequence Database
    (EMBL), DDBJ , MGC,
  • Genomic DNA resources
  • HTG, dbGSS, GOLD, ERGO,
  • EST resources
  • dbEST, UniGene, GIs, STACKS, DOTS,
  • Others
  • dbSTS, UniSTS, dbSNP, TransFac, ISIS, Repbase,
    ...

4
GenBank at National Center for Biotechnology
Information (NCBI)
  • GenBank is the NIH genetic sequence database, an
    annotated collection of all publicly available
    DNA sequences.
  • There are approximately 11,720,000,000 bases in
    10,897,000 sequence records as of February 2001.
  • GenBank is part of the International Nucleotide
    Sequence Database Collaboration, which is
    comprised of the DNA DataBank of Japan (DDBJ),
    the European Molecular Biology Laboratory (EMBL),
    and GenBank at NCBI. These three organizations
    exchange data on a daily basis.
  • http//www.ncbi.nlm.nih.gov/Entrez/

5
NCBI-SITEMAP
6
European Molecular Biology Laboratory (EMBL)
  • The EMBL Nucleotide Sequence Database constitutes
    Europe's primary nucleotide sequence resource.
    Main sources for DNA and RNA sequences are direct
    submissions from individual researchers, genome
    sequencing  projects and patent applications.
  • http//www.ebi.ac.uk/Databases/index.html

7
EBI-databases tools
8
DNA Data Bank of Japan (DDBJ) http//www.ddbj.nig.
ac.jp/
  • Database Search
  • Getentry, SFgate WAIS, SRS, Homology Search,
    TXSearch, SQmatch
  • Data Analysis
  • malign, clustal w
  • Genome Analysis
  • GTOP
  • Protein Structure
  • PDB Retriever, SSThread, LIBRA I

9
(No Transcript)
10
Genome Projects
  • Whole genome sequences
  • EST projects
  • MGC projects
  • SNP projects
  • GSS projects
  • STS projects

11
Graphs created on 12 Dec 2000
12
Graphs created on 12 Dec 2000
13
GenBank Sequence Submission Policy
  • At this time the following types of submissions
    are NOT acceptable.
  • sequences of less than 50 bp in length.
  • computer generated or otherwise predicted
    sequences (i.e. EST assembled sequences).
  • third party sequences downloaded from a sequence
    database or journal.
  • one genomic sequence with multiple exons joined
    together without the sequence of the intervening
    introns.
  • primer only sequences.

14
GenBank Sequence Submission Policy (cont.)
  • At this time the following types of submissions
    are NOT acceptable.
  • protein only sequences.
  • non-biologically contiguous sequences containing
    internal unsequenced spacers.
  • sequences containing a mix of genomic and mRNA
    sequence represented as a single sequence
  • EST submissions should be submitted through the
    dbEST system.
  • as of 1 January, 2000, Genome Survey Sequences
    (GSSs) should not be submitted through Bankit
    use the dbGSS system.

15
Data Submission
  • WWW
  • Bankit
  • WebIn
  • Sakura
  • Sequin
  • e-mail
  • Sequin
  • Diskette
  • Sequin

16
DNA Databases at NCBI
  • Nucleotides
  • dbEST
  • UniGene
  • dbGSS
  • dbSTS
  • UniSTS
  • RefSeq
  • MGC
  • dbSNP
  • HTGs
  • UniVec

17
dbEST http//www.ncbi.nlm.nih.gov/dbEST/index.htm
l
  • dbEST
  • dbEST is a database of expressed sequence tags
    short, single pass read cDNA (mRNA) sequences.
    Also includes cDNA sequences from differential
    display experiments and RACE experiments.

18
dbGSShttp//www.ncbi.nlm.nih.gov/dbGSS/index.html
  • Database of genome survey sequences.
  • Short, single pass read genomic sequences.
  • Exon trapped sequences.
  • Cosmid/BAC/YAC ends.
  • Alu PCR sequences.
  • GSS sequences are available from two sources
    dbGSS and the GSS division of GenBank. The
    sequences and accession numbers in both sources
    are the same but the record formats differ.

19
dbSTShttp//www.ncbi.nlm.nih.gov/dbSTS/index.html
  • Database of sequence tagged sites.
  • Short sequences that are operationally unique in
    the genome, used to generate mapping reagents.
  • STS sequences are available from two sources
    dbSTS and the STS division of GenBank. The
    sequences and accession numbers in both sources
    are the same but the record formats differ.

20
HTGshttp//www.ncbi.nlm.nih.gov/HTGS/
  • High throughput genome sequences from large scale
    genome sequencing centers.
  • Unfinished (phase 0, 1, 2) and finished (phase 3)
    sequences.
  • Sequence data in this division are available for
    BLAST homology searches against either the "htgs"
    database or the "month" database, which includes
    all new submissions for the prior month.

21
dbSNPhttp//www.ncbi.nlm.nih.gov/SNP/
  • Database of single nucleotide polymorphisms.
  • Small-scale insertions/deletions.
  • Polymorphic repetitive elements.
  • Microsatellite variation.

22
New HTC (High Throughput cDNA) division
  • At the May 2000 collaborative meeting
    DDBJ/EMBL/GenBank agreed to create a new database
    division HTC to represent unfinished High
    Throughput cDNA sequences. HTC sequences may
    include  5'UTR and 3'UTR regions and (part of a)
    coding region. Upon finishing of these sequences,
    they will be moved to the corresponding taxonomic
    division. HTC sequence entries will include the
    keyword 'HTC'. The keyword will be removed once
    the entry has been included in the taxonomic
    division.

23
Mammalian Gene Collection (MGC)
http//www.ncbi.nlm.nih.gov/MGC/
  • The Mammalian Gene Collection (MGC) project is a
    new effort by the NIH to generate full-length
    complementary DNA (cDNA) resources.

24
Entrez -A search retrival system

25
Entrez Searching
  • Subject searching
  • Phrase searching
  • Searching for authors
  • Searching for unique identifiers
  • Searching by molecular weight
  • Range searching
  • Truncating searching (Wildcard searching)
  • Combining sets

26
Entrez -Subject searching
  • Text searching
  • hiv-1
  • Subject terms are automatically combined
  • hiv-1 protease, hiv-1 AND protease


L
27
Entrez -Phrase searching
  • hiv-1 protease
  • Using quotes forces Entrez to check a phrase list
    against which the search terms are matshed.
  • It is not adjacency searching.
  • If the search phrase is not in the phrase list,
    Entrez treats it as a subject searching.

28
Entrez -Searching for authors
  • Chang YC
  • Search only the author field
  • Chang
  • Search all fields
  • Subject searching
  • Do not use punctuation.

29
Entrez -Searching for unique identifiers
  • Accession numbers
  • GenBank/EMBL/DDBJ U12345, AF123456
  • GenPept AAA12345
  • SwissProt PIR P12345
  • RefSeq NM_123456, NT_123456, NP_123456,
    NC_123456, XM_123456, XP_123456
  • Sequence identification numbers
  • GI numbers 6995995
  • Version numbers AF123456.3

30
Entrez -Searching by molecular weight
  • 010600Molecular Weight
  • 012345MOLWT
  • 010000050000MOLWT
  • 002000010000MOLWT AND humanOrganism
  • field name ? feature table

31
Entrez -Range searching
  • Accession numbers ACCN, sequence length SLEN,
    and molecular weight MOLWT
  • AF114696AF114714ACCN
  • Not for GI and Version numbers
  • 30004000SLEN
  • 002002002100MOLWT

32
Entrez -Truncating searching
  • Wildcard searching
  • Root word plus
  • bacte, retroviru
  • Only retrieve the first 150 variations of
    truncated terms
  • Left-handed trunction is not possible
  • ology

33
Entrez -Combining sets
  • Use your search History to combine documents
  • 1 AND 4

L
34
Entrez -Boolean operators
  • AND, OR, NOT
  • bacteria AND virus NOT phage
  • (bacteria AND virus) NOT phage
  • hiv-1 OR bacterial protease
  • hiv OR (bacterial AND protease)

L
35
Entrez -Boolean operators
36
Entrez -Using limits
37
Entrez -Limit a search to a particular
database field
  • You are only intrested in nucleotide sequences
    from the mouse
  • Select Nucleotide database from the black menu
    bar or the Search pull-down menu.
  • Select limits.
  • In the "Limits To" section, select Organism from
    the Search Field pull-down menu.
  • Type "mouse" without quotes in the query box and
    select Go.

38
Entrez -Limit a search to a particular
database field
  • You are only interested in protein sequences that
    are less than 50 amino acids in length.
  • Select the Protein database from the black menu
    bar or the Search pull-down menu.
  • Select Limits.
  • In the "Limited To" section, select Sequence
    Length from the Search Field pull-down menu.
  • Type "050" without quotes in the query box and
    select Go.

39
Entrez -Exclude certain kinds of sequences
  • You are interested in mitochondrial carriers but
    you do not want the EST sequences.
  • Select the Nucleotide database from the black
    menu bar or the Search pull-down menu.
  • Type "mitochondrial carrier" without quotes in
    the query box.
  • Select Limits.
  • In the "Limited To" section, checkthe box next
    to Exclude ESTs" and select Go.

40
Entrez -Limit the search to a particular
molecule type
  • You are only interested in Cryptosporidium
    ribosomal RNA sequences.
  • Select the Nucleotide database from the black
    menu bar or the Search pull-down menu.
  • Type "cryptosporidium" without quotes in the
    query box.
  • Select Limits.
  • In the "limited to" section, select the
    "Molecule" pull-down menu and choose rRNA and
    select Go.

41
Entrez -Limit the search to a particular gene
location
  • You are interested in the genes in the
    chloroplast of flowering plants.
  • Select the Nucleotide database from the black
    menu bar or the Search pull-down menu.
  • Type "flowering plants" without quotes in the
    query box.
  • Select Limits.
  • In the "Limited To" section, select the "Gene
    Location" pull down menu and choose chloroplast
    and select Go.

42
Entrez -Limit the search to records from a
particular sequence database
  • You are interested only in cysteine phosphatase
    protein sequences submitted directly to PIR.
  • Select the Protein database from the black menu
    bar or the Search pull-down menu.
  • Type "cysteine phosphatase" without quotes in the
    query box.
  • Select Limits.
  • In the "Limited To" section, select the "Only
    From" pull-down menu and choose PIR and select Go.

43
Entrez -Limit the search by date
  • You want to see any nucleotide sequences from
    pigs added to the database (or updated) in the
    last 30 days.
  • Select the Nucleotide database from the black
    menu bar or the Search pull-down menu.
  • Type "pigs" without quotes in the query box.
  • Select Limits.
  • In the "Limited To" section, select Organism
    from the Search Field pull-down menu.
  • And in the "Limited To" section, select the
    "Modification Date" pull down menu and choose 30
    days and select Go.

44
Entrez -Limit the search by date
  • You want to retrieve all mouse or human
    nucleotide sequences added to the database (or
    updated) during 1997.
  • Select the Nucleotide database from the black
    menu bar or the Search pull-down menu.
  • Type "mouse OR human" without quotes in the query
    box.
  • Select Limits.
  • In the "Limited To" section, select Organism
    from the Search Field pull-down menu.
  • And in the "Limited To" section, select the
    "Modification Date" pull down menu and choose
    Modification Date. In the date boxes, type the
    dates in the format YYYY/MM/DD. You can tab from
    box to box in the date fields. Select Go.

45
Entrez -Using more than one limit at a time
  • You are interested in the protein translations of
    human GenBank nucleotide sequences added to the
    protein database (or updated) in the last 30
    days. You do not want patent records.
  • Select the Protein database from the black menu
    bar or the Search pull-down menu.
  • Type "human" without quotes in the query box.
  • Select Limits.
  • In the "Limited To" section, select Organism
    from the Search Field pull-down menu.
  • On the same screen, select the exclude patents
    check box, select GenBank from the Only From
    pull-down menu, and finally select 30 days from
    the Modification Date pull-down menu and select
    Go.

46
Entrez -Writing advanced search statements
  • Find all human nucleotide sequences with LTR
    annotations.
  • In the Nucleotide database use the following
    expression -
  • LTRFKEY AND humanORGN
  • Find drosophila population studies published in
    the Journal of Molecular Evolution
  • In the PopSet database use the following
    expression -
  • j mol evolJOUR AND drosophilaORGN

47
Entrez -Writing advanced search statements
  • Find all human protein sequences with lengths
    between 50 and 60 amino acids and that were
    entered into the database during 1999.
  • In the Protein database use the following
    expression -
  • humanORGN AND 50SLEN60SLEN AND 1999MDAT

48
(No Transcript)
49
Feature key or descriptor line
Feature qualifiers
50
Feature Key Name (partial list)
  • allele
  • attenuator
  • CAAT_signal
  • CDS
  • enhancer
  • exon
  • gene
  • GC_signal
  • iDNA
  • intron
  • J_region
  • LTR
  • misc_binding
  • misc_feature
  • mRNA
  • polyA_signal
  • polyA_site
  • STS
  • 3UTR
  • 5clip

ftp//ncbi.nlm.nih.gov/genbank/gbrel.txt
51
Feature Qualifiers (partial list)
  • /anticodon
  • /bound_moiety
  • /citation
  • /codon
  • /codon_start
  • /cons_splice
  • /db_xref
  • /direction
  • /EC_number
  • /evidence
  • /function
  • /gene
  • /map
  • /note
  • /organism
  • /phenotype
  • /rpt_family
  • /translation

52
(No Transcript)
53
GenBank EST format
54
GenBank GSS format
55
GenBank GSS format
56
Gold Genome OnLine Databasehttp//wit.integrated
genomics.com/GOLD/
  • Genomes Online Database, is a World Wide Web
    resource for comprehensive access to information
    regarding complete and ongoing genome projects
    around the world.

57
SRS Sequence Retrival Systemhttp//srs.ebi.ac.uk
/
58
SRS Sequence Retrival Systemhttp//srs.ebi.ac.uk
/
59
SRS Sequence Retrival Systemhttp//srs.ebi.ac.uk
/
60
Deambulumhttp//www.infobiogen.fr/services/deambu
lum/english/
61
Deambulumhttp//www.infobiogen.fr/services/deambu
lum/english/
62
Deambulum READSEQhttp//www.infobiogen.fr/servic
es/deambulum/english/
63
NCGR National Center for Genome ResourcesGSDB
Genome Sequence Databasehttp//www.ncgr.org/resea
rch/sequence/data_retrieval.html
64
NCGR National Center for Genome ResourcesGSDB
Genome Sewquence Databasehttp//www.ncgr.org/rese
arch/sequence/data_retrieval.html
65
NCGR National Center for Genome ResourcesGSDB
Genome Sequence Databasehttp//www.ncgr.org/resea
rch/sequence/data_retrieval.html
66
BIOBASEhttp//www.gene-regulation.com/pub/databas
es.html
67
Repbasehttp//charon.girinst.org/index.html
68
Minisatellite Databasehttp//minisatellites.u-psu
d.fr/
69
STRBasehttp//www.cstl.nist.gov/div831/strbase/in
dex.htm
70
ASDBhttp//devnull.lbl.gov8888/alt/
71
ASDBhttp//devnull.lbl.gov8888/alt/
72
ASDBhttp//devnull.lbl.gov8888/alt/
73
ISIShttp//www.introns.com/
74
ExInt An Exon-Intron Database of Eukaryotic
Organismhttp//intron.bic.nus.edu.sg/exint/exint.
html
75
ExInt An Exon-Intron Database of Eukaryotic
Organismhttp//intron.bic.nus.edu.sg/exint/exint.
html
76
ExInt An Exon-Intron Database of Eukaryotic
Organismhttp//intron.bic.nus.edu.sg/exint/exint.
html
77
MethDBhttp//www.methdb.de./
  • The purpose of this database is to provide the
    scientific community with a resource to
  • store DNA methylation data
  • search for methylation patterns and profiles
  • correlate methylation and expression data of
    genes

78
Small RNA Databasehttp//mbcr.bcm.tmc.edu/smallRN
A/smallrna.html
79
UTRshttp//igs-server.cnrs-mrs.fr/gauthere/UTR/i
ndex.html
80
UTRdb UTRsitehttp//bigarea.area.ba.cnr.it8000
/EmbIT/UTRHome/
81
TARDhttp//wwwicg.bionet.nsc.ru/SRCG/Translation/
  • Gene expression is often regulated at the level
    of mRNA translation. The structural
    characteristics of mRNA correlate with
    translation efficiency and specificity.
    Determination of "active elements" could be very
    useful for prediction of the gene expression
    pattern under both normal and stress conditions
    because not all mRNAs can be translated when
    stressed. Prediction of the gene expression
    pattern can might be useful for biotechnology and
    cDNA analysis.

82
RAD RNA Abundance Databasehttp//www.cbil.upenn.
edu/RAD2/
  • RAD (RNA Abundance Database) is a public gene
    expression database designed to hold data from
    array-based (microarrays, high-density oligo
    arrays, macroarrays) and nonarray-based (SAGE)
    experiments.
  • The ultimate goal is to allow comparative
    analysis of experiments performed by different
    laboratories using different platforms and
    investigating different biological systems.

83
RAD RNA Abundance Databasehttp//www.cbil.upenn.
edu/RAD2/
84
Cook your food by yourself
  • Farms
  • Markets
  • Restaurents
  • Cooking skills
  • Sequencing centers
  • Nucleotide databases
  • Value-added databases
  • Bioinformatics

85
Exercise
  1. Please try to write a search statement for
    finding all mouse nucleotide sequences with CDS
    annotations.
Write a Comment
User Comments (0)
About PowerShow.com