Bioinformatics Data and Databases Stuart M. Brown, Ph.D

About This Presentation

Title:

Bioinformatics Data and Databases Stuart M. Brown, Ph.D

Description:

Bioinformatics Data and Databases Stuart M. Brown, Ph.D. Director: NYU Bioinformatics Core Ensembl at EBI/EMBL Genetic variation Can be alleles of genes also ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 54

Provided by: medNyuEd

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Data and Databases Stuart M. Brown, Ph.D

1
Bioinformatics Data and Databases

Stuart M. Brown, Ph.D.
Director NYU Bioinformatics Core

2
Biologists Collect Lots of Data

Hundreds of thousands of species
Millions of articles in scientific journals
Genetic information
gene names
phenotype of mutants
location of genes/mutations on chromosmes
linkage (distances between genes)

High Throughput lab technology
PCR
Rapid inexpensive DNA sequencing
Many methods of collecting genotype data
Assays for specific polymorphisms
Genome-wide SNP chips
Must have data quality assessment prior to
analysis

4
What is a Database?

Organized data
Information is stored in "records" and "fields"
Fields are categories
Must contain contain data of the same type
Records contain data that is related to one object

5
A Spreadsheet can be a Database

columns are Fields
Rows are Records
Can search for a term within just one field
Or combine searches across several fields

6
(No Transcript)
7
Data Formats

How to organize various types of genetic data?
Need standard formats
DNA sequence GATC, but what about gaps, unknown
letters, etc.
How many letters per line
?? Spaces, numbers, headers, etc.
Store as a string, code as binary numbers, etc.
Use a completely different format for proteins?

8
FASTA Format

In the process of writing a similarity searching
program (in 1985), William Pearson designed a
simple text format for DNA and protein sequences
The FASTA format is now universal for all
databases and software that handles DNA and
protein sequences

One header line, starts with gt with a return at
end All other characters are part of
sequence.Most software ignores spaces, carriage
returns. Some ignores numbers
gtURO1 uro1.seq Length 2018 November 9, 2000
1150 Type N Check 3854 .. CGCAGAAAGAGGAGGCGC
TTGCCTTCAGCTTGTGGGAAATCCCGAAGATGGCCAAAGACA ACTCAAC
TGTTCGTTGCTTCCAGGGCCTGCTGATTTTTGGAAATGTGATTATTGGTT
GTT GCGGCATTGCCCTGACTGCGGAGTGCATCTTCTTTGTATCTGACCA
ACACAGCCTCTACC CACTGCTTGAAGCCACCGACAACGATGACATCTAT
GGGGCTGCCTGGATCGGCATATTTG TGGGCATCTGCCTCTTCTGCCTGT
CTGTTCTAGGCATTGTAGGCATCATGAAGTCCAGCA GGAAAATTCTTCT
GGCGTATTTCATTCTGATGTTTATAGTATATGCCTTTGAAGTGGCAT CT
TGTATCACAGCAGCAACACAACAAGACTTTTTCACACCCAACCTCTTCCT
GAAGCAGA TGCTAGAGAGGTACCAAAACAACAGCCCTCCAAACAATGAT
GACCAGTGGAAAAACAATG
9
Multi-Sequence FASTA file

gtFBpp0074027 typeprotein locXcomplement(161594
13..16159860,16160061..16160497) IDFBpp0074027
nameCG12507-PA parentFBgn0030729,FBtr0074248
dbxrefFlyBaseFBpp0074027,FlyBase_Annotation_IDs
CG12507 PA,GB_proteinAAF48569.1,GB_proteinAAF485
69 MD5123b97d79d04a06c66e12fa665e6d801
releaser5.1 speciesDmel length294
MRCLMPLLLANCIAANPSFEDPDRSLDMEAKDSSVVDTMGMGMGVLDPTQ
PKQMNYQKPPLGYKDYDYYLGSRRMADPYGADNDLSASSAIKIHGEGNLA
SLNRPVSGVAHKPLPWYGDYSGKLLASAPPMYPSRSYDPYIRRYDRYDEQ
YHRNYPQYFEDMYMHRQRFDPYDSYSPRIPQYPEPYVMYPDRYPDAPPLR
DYPKLRRGYIGEPMAPIDSYSSSKYVSSKQSDLSFPVRNERIVYYAHLPE
IVRTPYDSGSPEDRNSAPYKLNKKKIKNIQRPLANNSTTYKMTL
gtFBpp0082232 typeprotein loc3Rcomplement(92071
09..9207225,9207285..9207431) IDFBpp0082232
namemRpS21-PA parentFBgn0044511,FBtr0082764
dbxrefFlyBaseFBpp0082232,FlyBase_Annotation_IDs
CG32854-PA,GB_proteinAAN13563.1,GB_proteinAAN135
63 MD5dcf91821f75ffab320491d124a0d816c
releaser5.1 speciesDmel length87
MRHVQFLARTVLVQNNNVEEACRLLNRVLGKEELLDQFRRTRFYEKPYQV
RRRINFEKCKAIYNEDMNRKIQFVLRKNRAEPFPGCS
gtFBpp0091159 typeprotein loc2Rcomplement(25113
37..2511531,2511594..2511767,2511824..2511979,2512
032..2512082) IDFBpp0091159 nameCG33919-PA
parentFBgn0053919,FBtr0091923
dbxrefFlyBaseFBpp0091159,FlyBase_Annotation_IDs
CG33919-PA,GB_proteinAAZ52801.1,GB_proteinAAZ528
01 MD5c91d880b654cd612d7292676f95038c5
releaser5.1 speciesDmel length191
MKLVLVVLLGCCFIGQLTNTQLVYKLKKIECLVNRTRVSNVSCHVKAINW
NLAVVNMDCFMIVPLHNPIIRMQVFTKDYSNQYKPFLVDVKIRICEVIER
RNFIPYGVIMWKLFKRYTNVNHSCPFSGHLIARDGFLDTSLLPPFPQGFY
QVSLVVTDTNSTSTDYVGTMKFFLQAMEHIKSKKTHNLVHN
gtFBpp0070770 typeprotein locXjoin(5584802..558
5021,5585925..5586137,5586198..5586342,5586410..55
86605) IDFBpp0070770 namecv-PA
parentFBgn0000394,FBtr0070804
dbxrefFlyBaseFBpp0070770,FlyBase_Annotation_IDs
CG12410-PA,GB_proteinAAF46063.1,GB_proteinAAF460
63 MD50626ee34a518f248bbdda11a211f9b14
releaser5.1 speciesDmel length257
MEIWRSLTVGTIVLLAIVCFYGTVESCNEVVCASIVSKCMLTQSCKCELK
NCSCCKECLKCLGKNYEECCSCVELCPKPNDTRNSLSKKSHVEDFDGVPE
LFNAVATPDEGDSFGYNWNVFTFQVDFDKYLKGPKLEKDGHYFLRTNDKN

10
Other Standards?

Other types of important medical and genetic data
may not have universal standards
Genotype/haplotype
Clinical records
Gene expression
Protein structure
Alignments
Phylogenetic trees

11
SNPStats
HapStat
12
Reformatting Data Files

Much of the routine (yet annoying) work of
bioinformatics involves messing around with data
files to get them into formats that will work
with various software
Then messing around with the results produced by
that software to create a useful summary

13
Public Databases

In addition to your own experimental data, access
to public data is essential for epidemiology
Complete genome sequences (human and
pathogens/vectors)
SNPs
Genotypes
Population Sets
Supplemental data for specific Journal articles

14
GenBank is a Database

Contains all DNA and protein sequences described
in the scientific literature or collected in
publicly funded research
Flatfile Composed entirely of text
you could print the whole thing out
Each submitted sequence is a record
Had fields for Organism, Date, Author, etc.
Unique identifier for each sequence
Locus and Accession

15
Fields
16
Accession Numbers!!

Databases are designed to be searched by
accession numbers (and locus IDs)
These are guaranteed to be non-redundant,
accurate, and not to change.
Searching by gene names and keywords is doomed to
frustration and probable failure
Neither scientists nor computers can be trusted
to accurately and consistently annotate database
entries
If only scientists would refer to genes by
accession numbers in all published work!

17
http//www.ncbi.nlm.nih.gov/Genbank

GenBank is managed by the National Center for
Biotechnology Information (NCBI) at the NIH
(part of the U.S. National Library of Medicine)
Once upon a time, GenBank mailed out sequences on
CD-ROM disks a few times per year.
Now GenBank is over 100 billion bases
Scientists access GenBank directly over the Web
at www.ncbi.nlm.nih.gov

18
What is GenBank? GenBank is the NIH genetic
sequence database, an annotated collection of all
publicly available DNA sequences ( Nucleic Acids
Research 2007 Jan 35(Database issue)D21-5).
There are approximately 65,369,091,950 bases in
61,132,599 sequence records in the traditional
GenBank divisions and 80,369,977,826 bases in
17,960,667 sequence records in the WGS division
as of August 2006.
19
Relational Databases

Databases can be more complex than a single
spreadsheet
GenBank has proteins and SNPs as well as DNA
Some fields (i.e. phosphorylation sites) apply to
protein, but not DNA
Better to create a separate spreadsheet format
for Protein records
Each different spreadsheet is called a Table
Different Tables are linked by key fields
(i.e. DNA and protein for same gene)

20
Many Tables at NCBI

The NCBI hosts a huge interconnected database
system that, in addition to DNA and protein,
includes
Journal Articles (PubMed)
Genetic Diseases (OMIM)
Polymorphisms (dbSNP)
Cytogenetics (CGH/SKY/FISH CGAP)
Gene Expression (GEO)
Taxonomy
Chemistry (PubChem)

21
Database Design

A database can only be searched in ways that it
was designed to be searched
You can search within a specific Field in a
specific Table - and sometimes can combine
searches from different Fields and/or Tables
(Boolean "AND" and "OR" searches)
Bad to search for "human hemoglobin" in a
'Description' field
Much better to search for "homo sapiens in
'Organism' AND "HBB" in 'gene name'

22
Web Query

Most Scientific databases have a web-based query
tool
It may be simple

23
or complex
24
ENTREZ is the GenBank web query tool
25
Advanced query interface
26
ENTREZ has pre-computed links between Tables

Relationships between sequences are computed with
BLAST
Relationships between articles are computed with
"MESH" terms (shared keywords)
Relationships between DNA and protein sequences
rely on accession numbers
Relationships between sequences and PubMed
articles rely on both shared keywords and the
mention of accession numbers in the articles.

27
NCBI Databases contain more than just DNA
protein sequences
28
(No Transcript)
29
(No Transcript)
30
Other Important Databases

Genomes
Proteins
Biochemical Regulatory Pathways
Gene Expression
Genetic Variation (mutants, SNPs)
Protein-Protein Interactions
Gene Ontology (Biological Function)

31
UCSC Genome Browser
Search by gene name
or by sequence
32
(No Transcript)
33
Lots of additional data can be added as optional
"tracks" - anything that can be mapped to
locations on the genome
34
Ensembl at EBI/EMBL
35
(No Transcript)
36
(No Transcript)
37
SNPs (Single Nucleotide Polymorphisms)

Genetic variation
Can be alleles of genes
also differences in non-coding regions collected
from genome sequencing of different individuals
dbSNP at the NCBI - all public SNP data
SNP Consortium at CSHL - high quality set

38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
KEGG Kyoto Encylopedia of Genes and Genomes

Enzymatic and regulatory pathways
Mapped out by EC number and cross-referenced to
genes in all known organisms
(wherever sequence information exits)
Parallel maps of regulatory pathways

42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
NCI BioCarta
46
(No Transcript)
47
Protein-Protein Interactions

Metabolic and regulatory pathways
Transcription factors
Co-expression
Biochemical data
crosslinking
yeast 2-hybrid
affinity tagging
Useful feedback to genome annotation/protein
function and gene expression

48
(No Transcript)
49
BIND - The Biomolecular Interaction Network
Database
50
Genome Ontology

Genetics is a messy science
Scientists have been working in isolation on
individual species for many years - naming genes,
mutants, odd phenotypes
sonic hedgehog
Now that we have complete genome sequences, how
to reconcile the names across all species?
Genome Ontology uses a single 3 part system
Molecular function (specific tasks)
Biological process (broad biologial goals - e.g
cell division)
Cellular component (location)

51
(No Transcript)
52
Database Search Strategies

General search principles - not limited to
sequence (or to biology)
Use accession numbers whenever possible
Start with broad keywords and narrow the search
using more specific terms
Try variants of spelling, numbers, etc.
Search all relevant databases
Be persistent!!

Bioinformatics Data and Databases Stuart M. Brown, Ph.D - PowerPoint PPT Presentation

Bioinformatics Data and Databases Stuart M. Brown, Ph.D

Bioinformatics Data and Databases Stuart M. Brown, Ph.D. Director: NYU Bioinformatics Core Ensembl at EBI/EMBL Genetic variation Can be alleles of genes also ... – PowerPoint PPT presentation