Bioinformatics Data and Databases Stuart M. Brown, Ph.D - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Bioinformatics Data and Databases Stuart M. Brown, Ph.D

Description:

Bioinformatics Data and Databases Stuart M. Brown, Ph.D. Director: NYU Bioinformatics Core Ensembl at EBI/EMBL Genetic variation Can be alleles of genes also ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 54
Provided by: medNyuEd
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Data and Databases Stuart M. Brown, Ph.D


1
Bioinformatics Data and Databases
  • Stuart M. Brown, Ph.D.
  • Director NYU Bioinformatics Core

2
Biologists Collect Lots of Data
  • Hundreds of thousands of species
  • Millions of articles in scientific journals
  • Genetic information
  • gene names
  • phenotype of mutants
  • location of genes/mutations on chromosmes
  • linkage (distances between genes)

3
  • High Throughput lab technology
  • PCR
  • Rapid inexpensive DNA sequencing
  • Many methods of collecting genotype data
  • Assays for specific polymorphisms
  • Genome-wide SNP chips
  • Must have data quality assessment prior to
    analysis

4
What is a Database?
  • Organized data
  • Information is stored in "records" and "fields"
  • Fields are categories
  • Must contain contain data of the same type
  • Records contain data that is related to one object

5
A Spreadsheet can be a Database
  • columns are Fields
  • Rows are Records
  • Can search for a term within just one field
  • Or combine searches across several fields

6
(No Transcript)
7
Data Formats
  • How to organize various types of genetic data?
  • Need standard formats
  • DNA sequence GATC, but what about gaps, unknown
    letters, etc.
  • How many letters per line
  • ?? Spaces, numbers, headers, etc.
  • Store as a string, code as binary numbers, etc.
  • Use a completely different format for proteins?

8
FASTA Format
  • In the process of writing a similarity searching
    program (in 1985), William Pearson designed a
    simple text format for DNA and protein sequences
  • The FASTA format is now universal for all
    databases and software that handles DNA and
    protein sequences

One header line, starts with gt with a return at
end All other characters are part of
sequence.Most software ignores spaces, carriage
returns. Some ignores numbers
gtURO1 uro1.seq Length 2018 November 9, 2000
1150 Type N Check 3854 .. CGCAGAAAGAGGAGGCGC
TTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAAC
TGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTT
GTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCA
ACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTAT
GGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGT
CTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCT
GGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CT
TGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCT
GAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGAT
GACCAGTGGAAAAACAATG
9
Multi-Sequence FASTA file
  • gtFBpp0074027 typeprotein locXcomplement(161594
    13..16159860,16160061..16160497) IDFBpp0074027
    nameCG12507-PA parentFBgn0030729,FBtr0074248
    dbxrefFlyBaseFBpp0074027,FlyBase_Annotation_IDs
    CG12507 PA,GB_proteinAAF48569.1,GB_proteinAAF485
    69 MD5123b97d79d04a06c66e12fa665e6d801
    releaser5.1 speciesDmel length294
  • MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQ
  • PKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLA
  • SLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQ
  • YHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLR
  • DYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPE
  • IVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL
  • gtFBpp0082232 typeprotein loc3Rcomplement(92071
    09..9207225,9207285..9207431) IDFBpp0082232
    namemRpS21-PA parentFBgn0044511,FBtr0082764
    dbxrefFlyBaseFBpp0082232,FlyBase_Annotation_IDs
    CG32854-PA,GB_proteinAAN13563.1,GB_proteinAAN135
    63 MD5dcf91821f75ffab320491d124a0d816c
    releaser5.1 speciesDmel length87
  • MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQV
  • RRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS
  • gtFBpp0091159 typeprotein loc2Rcomplement(25113
    37..2511531,2511594..2511767,2511824..2511979,2512
    032..2512082) IDFBpp0091159 nameCG33919-PA
    parentFBgn0053919,FBtr0091923
    dbxrefFlyBaseFBpp0091159,FlyBase_Annotation_IDs
    CG33919-PA,GB_proteinAAZ52801.1,GB_proteinAAZ528
    01 MD5c91d880b654cd612d7292676f95038c5
    releaser5.1 speciesDmel length191
  • MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINW
  • NLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIER
  • RNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFY
  • QVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN
  • gtFBpp0070770 typeprotein locXjoin(5584802..558
    5021,5585925..5586137,5586198..5586342,5586410..55
    86605) IDFBpp0070770 namecv-PA
    parentFBgn0000394,FBtr0070804
    dbxrefFlyBaseFBpp0070770,FlyBase_Annotation_IDs
    CG12410-PA,GB_proteinAAF46063.1,GB_proteinAAF460
    63 MD50626ee34a518f248bbdda11a211f9b14
    releaser5.1 speciesDmel length257
  • MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELK
  • NCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPE
  • LFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKN

10
Other Standards?
  • Other types of important medical and genetic data
    may not have universal standards
  • Genotype/haplotype
  • Clinical records
  • Gene expression
  • Protein structure
  • Alignments
  • Phylogenetic trees

11
SNPStats
HapStat
12
Reformatting Data Files
  • Much of the routine (yet annoying) work of
    bioinformatics involves messing around with data
    files to get them into formats that will work
    with various software
  • Then messing around with the results produced by
    that software to create a useful summary

13
Public Databases
  • In addition to your own experimental data, access
    to public data is essential for epidemiology
  • Complete genome sequences (human and
    pathogens/vectors)
  • SNPs
  • Genotypes
  • Population Sets
  • Supplemental data for specific Journal articles

14
GenBank is a Database
  • Contains all DNA and protein sequences described
    in the scientific literature or collected in
    publicly funded research
  • Flatfile Composed entirely of text
  • you could print the whole thing out
  • Each submitted sequence is a record
  • Had fields for Organism, Date, Author, etc.
  • Unique identifier for each sequence
  • Locus and Accession

15
Fields
16
Accession Numbers!!
  • Databases are designed to be searched by
    accession numbers (and locus IDs)
  • These are guaranteed to be non-redundant,
    accurate, and not to change.
  • Searching by gene names and keywords is doomed to
    frustration and probable failure
  • Neither scientists nor computers can be trusted
    to accurately and consistently annotate database
    entries
  • If only scientists would refer to genes by
    accession numbers in all published work!

17
http//www.ncbi.nlm.nih.gov/Genbank
  • GenBank is managed by the National Center for
    Biotechnology Information (NCBI) at the NIH
    (part of the U.S. National Library of Medicine)
  • Once upon a time, GenBank mailed out sequences on
    CD-ROM disks a few times per year.
  • Now GenBank is over 100 billion bases
  • Scientists access GenBank directly over the Web
    at www.ncbi.nlm.nih.gov

18
What is GenBank? GenBank is the NIH genetic
sequence database, an annotated collection of all
publicly available DNA sequences ( Nucleic Acids
Research 2007 Jan 35(Database issue)D21-5).
There are approximately 65,369,091,950 bases in
61,132,599 sequence records in the traditional
GenBank divisions and 80,369,977,826 bases in
17,960,667 sequence records in the WGS division
as of August 2006.
19
Relational Databases
  • Databases can be more complex than a single
    spreadsheet
  • GenBank has proteins and SNPs as well as DNA
  • Some fields (i.e. phosphorylation sites) apply to
    protein, but not DNA
  • Better to create a separate spreadsheet format
    for Protein records
  • Each different spreadsheet is called a Table
  • Different Tables are linked by key fields
  • (i.e. DNA and protein for same gene)

20
Many Tables at NCBI
  • The NCBI hosts a huge interconnected database
    system that, in addition to DNA and protein,
    includes
  • Journal Articles (PubMed)
  • Genetic Diseases (OMIM)
  • Polymorphisms (dbSNP)
  • Cytogenetics (CGH/SKY/FISH CGAP)
  • Gene Expression (GEO)
  • Taxonomy
  • Chemistry (PubChem)

21
Database Design
  • A database can only be searched in ways that it
    was designed to be searched
  • You can search within a specific Field in a
    specific Table - and sometimes can combine
    searches from different Fields and/or Tables
  • (Boolean "AND" and "OR" searches)
  • Bad to search for "human hemoglobin" in a
    'Description' field
  • Much better to search for "homo sapiens in
    'Organism' AND "HBB" in 'gene name'

22
Web Query
  • Most Scientific databases have a web-based query
    tool
  • It may be simple

23
or complex
24
ENTREZ is the GenBank web query tool
25
Advanced query interface
26
ENTREZ has pre-computed links between Tables
  • Relationships between sequences are computed with
    BLAST
  • Relationships between articles are computed with
    "MESH" terms (shared keywords)
  • Relationships between DNA and protein sequences
    rely on accession numbers
  • Relationships between sequences and PubMed
    articles rely on both shared keywords and the
    mention of accession numbers in the articles.

27
NCBI Databases contain more than just DNA
protein sequences
28
(No Transcript)
29
(No Transcript)
30
Other Important Databases
  • Genomes
  • Proteins
  • Biochemical Regulatory Pathways
  • Gene Expression
  • Genetic Variation (mutants, SNPs)
  • Protein-Protein Interactions
  • Gene Ontology (Biological Function)

31
UCSC Genome Browser
Search by gene name
or by sequence
32
(No Transcript)
33
Lots of additional data can be added as optional
"tracks" - anything that can be mapped to
locations on the genome
34
Ensembl at EBI/EMBL
35
(No Transcript)
36
(No Transcript)
37
SNPs (Single Nucleotide Polymorphisms)
  • Genetic variation
  • Can be alleles of genes
  • also differences in non-coding regions collected
    from genome sequencing of different individuals
  • dbSNP at the NCBI - all public SNP data
  • SNP Consortium at CSHL - high quality set

38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
KEGG Kyoto Encylopedia of Genes and Genomes
  • Enzymatic and regulatory pathways
  • Mapped out by EC number and cross-referenced to
    genes in all known organisms
  • (wherever sequence information exits)
  • Parallel maps of regulatory pathways

42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
NCI BioCarta
46
(No Transcript)
47
Protein-Protein Interactions
  • Metabolic and regulatory pathways
  • Transcription factors
  • Co-expression
  • Biochemical data
  • crosslinking
  • yeast 2-hybrid
  • affinity tagging
  • Useful feedback to genome annotation/protein
    function and gene expression

48
(No Transcript)
49
BIND - The Biomolecular Interaction Network
Database
50
Genome Ontology
  • Genetics is a messy science
  • Scientists have been working in isolation on
    individual species for many years - naming genes,
    mutants, odd phenotypes
  • sonic hedgehog
  • Now that we have complete genome sequences, how
    to reconcile the names across all species?
  • Genome Ontology uses a single 3 part system
  • Molecular function (specific tasks)
  • Biological process (broad biologial goals - e.g
    cell division)
  • Cellular component (location)

51
(No Transcript)
52
Database Search Strategies
  • General search principles - not limited to
    sequence (or to biology)
  • Use accession numbers whenever possible
  • Start with broad keywords and narrow the search
    using more specific terms
  • Try variants of spelling, numbers, etc.
  • Search all relevant databases
  • Be persistent!!

53
Bioinformatics Paradigm
  • Find the data
  • Download the data
  • Reformat the data
  • Collect the samples
  • Run molecular analysis
  • Filter the data
  • Run analysis software
  • Collect and sort results
  • Publish / Data sharing
Write a Comment
User Comments (0)
About PowerShow.com