Part 7 Collecting and Storing Sequences in Lab - PowerPoint PPT Presentation

Loading...

PPT – Part 7 Collecting and Storing Sequences in Lab PowerPoint presentation | free to view - id: 5e69e1-ZWRjN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Part 7 Collecting and Storing Sequences in Lab

Description:

Title: PowerPoint Presentation Last modified by: lu Created Date: 8/25/2009 4:39:34 PM Document presentation format: On-screen Show Company: Dan Graur – PowerPoint PPT presentation

Number of Views:459
Avg rating:3.0/5.0
Slides: 81
Provided by: lmbeSeuE
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Part 7 Collecting and Storing Sequences in Lab


1
Part 7 Collecting and Storing Sequences in Lab
2
What is Bioinformatics?
3
NIH definitions
What is Bioinformatics? - Research, development,
and application of computational tools and
approaches for expanding the use of biological,
medical, behavioral, and health data, including
the means to acquire, store, organize, archive,
analyze, or visualize such data. What is
Computational Biology? - The development and
application of analytical and theoretical
methods, mathematical modeling and computational
simulation techniques to the study of biological,
behavioral, and social data.
4
NSF introduction
Large databases that can be accessed and analyzed
with sophisticated tools have become central to
biological research and education. The
information content in the genomes of organisms,
in the molecular dynamics of proteins, and in
population dynamics, to name but a few areas, is
enormous. Biologists are increasingly finding
that the management of complex data sets is
becoming a bottleneck for scientific advances.
Therefore, bioinformatics is rapidly become a key
technology in all fields of biology.
5
NSF mission statement
The present bottlenecks in bioinformatics include
the education of biologists in the use of
advanced computing tools, the recruitment of
computer scientists into this evolving field, the
limited availability of developed databases of
biological information, and the need for more
efficient and intelligent search engines for
complex databases.
6
NSF mission statement
The present bottlenecks in bioinformatics include
the education of biologists in the use of
advanced computing tools, the recruitment of
computer scientists into this evolving field, the
limited availability of developed databases of
biological information, and the need for more
efficient and intelligent search engines for
complex databases.
7
Molecular Bioinformatics
Molecular Bioinformatics involves the use of
computational tools to discover new information
in complex data sets (from the one-dimensional
information of DNA through the two-dimensional
information of RNA and the three-dimensional
information of proteins, to the four-dimensional
information of evolving living systems).
8
From DNA to Genome
1955
1960
1965
1970
1975
1980
1985
9
1990
1995
2000
10
Origin of bioinformatics and biological
databases
The first protein sequence reported was that of
bovine insulin in 1956, consisting of 51
residues.
Nearly a decade later, the first nucleic acid
sequence was reported, that of yeast tRNAalanine
with 77 bases.
11
In 1965, Dayhoff gathered all the available
sequence data to create the first bioinformatic
database (Atlas of Protein Sequence and
Structure). The Protein DataBank followed in
1972 with a collection of ten X-ray
crystallographic protein structures. The
SWISSPROT protein sequence database began in
1987.
12
Nucleotides
13
Complete Genomes
14
What can we do with sequences and other type of
molecular information?
15
Open reading frames
Functional sites
Annotation
Structure, function
16
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG CGTGCAA
ATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA CTTTGTTACGCGT
TTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCG
GGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGACG
GAGATGTCTGATG CAA TAT GGA CAA TTG GTT TCT TCT CTG
AAT ...... .............. TGAAAAACGTA
17
promoter
TF binding site
CCTGACAAATTCGACGTGCGGCATTGCATGCAGACGTGCATG CGTGCAA
ATAATCAATGTGGACTTTTCTGCGATTATGGAAGAA CTTTGTTACGCGT
TTTTGTCATGGCTTTGGTCCCGCTTTGTTC AGAATGCTTTTAATAAGCG
GGGTTACCGGTTTGGTTAGCGAGA AGAGCCAGTAAAAGACGCAGTGACG
GAGATGTCTGATG CAA TAT GGA CAA TTG GTT TCT TCT CTG
AAT ................................. ............
.. TGAAAAACGTA
Transcription Start Site
Ribosome binding Site
ORF Open Reading Frame CDS Coding Sequence
18
Comparing ORFs Identifying orthologs Inferences
on structure and function
Comparative genomics
Comparing functional sites Inferences on
regulatory networks
19
Similarity profiles
Researchers can learned a great deal about the
structure and function of human genes by
examining their counterparts in model organisms.
20
Alignment preproinsulin
Xenopus MALWMQCLP-LVLVLLFSTPNTEALANQHL Bos
MALWTRLRPLLALLALWPPPPARAFVNQHL
. .. . Xenopus
CGSHLVEALYLVCGDRGFFYYPKIKRDIEQ Bos
CGSHLVEALYLVCGERGFFYTPKARREVEG
Xenopus
AQVNGPQDNELDG-MQFQPQEYQKMKRGIV Bos
PQVG---ALELAGGPGAGGLEGPPQKRGIV
.. Xenopus
EQCCHSTCSLFQLENYCN Bos
EQCCASVCSLYQLENYCN
.
21
(No Transcript)
22
  • Ultraconserved Elements in the Human Genome
  • Gill Bejerano, Michael Pheasant, Igor Makunin,
    Stuart Stephen, W. James Kent, John S. Mattick,
    David Haussler (Science 2004. 3041321-1325)
  • There are 481 segments longer than 200 base
    pairs (bp) that are absolutely conserved (100
    identity with no insertions or deletions) between
    orthologous regions of the human, rat, and mouse
    genomes. Nearly all of these segments are also
    conserved in the chicken and dog genomes, with an
    average of 95 and 99 identity, respectively.
    Many are also significantly conserved in fish.
    These ultraconserved elements of the human genome
    are most often located either overlapping exons
    in genes involved in RNA processing or in introns
    or nearby genes involved in the regulation of
    transcription and development.
  • There are 156 intergenic, untranscribed,
    ultraconserved segments

23
Junk Supporting evidence
Junk is real!
24
Genome-wide profiling of mRNA levels
Protein levels Co-expression of genes and/or
proteins
Functional genomics
Identifying protein-protein interactions Networks
of interactions
25
Understanding the function of genes and other
parts of the genome
26
Structural genomics
Assign structure to all proteins encoded in a
genome
27
Structural Genomics
27761 structures
Currently
28
Structural Genomics
Estimate
29
Origin of tools
Immediately after the establishment of the first
databases, tools became available to search them
- at first in a very simple manner, looking for
keyword matches and short sequence words and,
then, in a more sophisticated manner by using
pattern matching, alignment based methods, and
machine learning techniques.
30
Despite the huge explosion in the number and
length of sequences, the tools used for storage,
retrieval, analysis, and dissemination of data in
bioinformatics are very similar to those from
15-20 years ago.
31
Biological databases
32
Database or databank?
  • Initially
  • Databank (in UK)
  • Database (in the USA)
  • Solution
  • The abbreviation db

33
What is a database?
  • A collection of data
  • structured
  • searchable (index) -gt table of contents
  • updated periodically (release) -gt new edition
  • cross-referenced (hyperlinks) -gt links with
    other db
  • Includes also associated tools (software)
    necessary for access, updating, information
    insertion, information deletion.
  • Data storage management flat files, relational
    databases

34
Database a  flat file  example
Flat-file database (  flat file, 3 entries  )
  • Accession number 1
  • First Name Amos
  • Last Name Bairoch
  • Course Pottery 2000 Pottery 2001
  • //
  • Accession number 2
  • First Name Dan
  • Last name Graur
  • Course Pottery 2000, Pottery 2001 Ballet 2001,
    Ballet 2002
  • //
  • Accession number 3
  • First Name John
  • Last name Travolta
  • Course Ballet 2001 Ballet 2002
  • //
  • Easy to manage all the entries are visible at
    the same time !

35
Database a  relational  example
Relational database ( table file )
Teacher Accession number Education
Amos 1 Biochemistry
Dan 2 Genetics
John 3 Scientology
Course Year Involved teachers
Advanced Pottery 2000 2001 1 2
Ballet for Fat People 2001 2002 2 3
36
Why biological databases?
  • Exponential growth in biological data.
  • Data (genomic sequences, 3D structures, 2D gel
    analysis, MS analysis, Microarrays.) are no
    longer published in a conventional manner, but
    directly submitted to databases.
  • Essential tools for biological research.

37
Distribution of sequences
  • Books, articles 1968 -gt 1985
  • Computer tapes 1982 -gt 1992
  • Floppy disks 1984 -gt 1990
  • CD-ROM 1989 -gt
  • FTP 1989 -gt
  • On-line services 1982 -gt 1994
  • WWW 1993 -gt
  • DVD 2001 -gt

38
Some statistics
  • More than 1000 different biological databases
  • Variable size lt100Kb to gt20Gb
  • DNA gt 20 Gb
  • Protein 1 Gb
  • 3D structure 5 Gb
  • Other smaller
  • Update frequency daily to annually to seldom to
    forget about it.
  • Usually accessible through the web (some free,
    some not)

39
  • Some databases in the field of molecular
    biology
  • AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb,
  • ARR, AsDb, BBDB, BCGD, Beanref,
    Biolmage,
  • BioMagResBank, BIOMDB, BLOCKS,
    BovGBASE,
  • BOVMAP, BSORF, BTKbase, CANSITE, CarbBank,
  • CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP,
  • ChickGBASE, Colibri, COPE, CottonDB, CSNDB, CUTG,
  • CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP,
    DictyDb,
  • Picty_cDB, DIP, DOGS, DOMO, DPD, DPlnteract,
    ECDC,
  • ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db,
  • ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
  • GCRDB, GDB, GENATLAS, Genbank, GeneCards,
  • Genline, GenLink, GENOTK, GenProtEC,
    GIFTS,
  • GPCRDB, GRAP, GRBase, gRNAsdb, GRR, GSDB,
  • HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD,
  • HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN, HPDB,
  • HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat,
  • KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB,

40
Categories of databases for Life Sciences
  • Sequences (DNA, protein)
  • Genomics
  • Mutation/polymorphism
  • Protein domain/family
  • Proteomics (2D gel, Mass Spectrometry)
  • 3D structure
  • Metabolism
  • Bibliography
  • Expression (Microarrays,)
  • Specialized

41
Resources
NCBI (National Center for Biotechnology
Information) is a resource for molecular biology
information. NCBI creates and maintains public
databases, conducts research in computational
biology, develops software tools for analyzing
genome data, and disseminates biomedical
information. The NCBI site is constantly being
updated and some of the changes include new
databases and tools for data mining. NCBI
offers several searchable literature, molecular
and genomic databases and many bioinformatic
tools. An up-to-date list of databases and tools
can be found on the NCBI Sitemap.
42
Literature Databases
Bookshelf A collection of searchable biomedical
books linked to PubMed. PubMed Allows
searching by author names, journal titles, and a
new Preview/Index option. PubMed database
provides access to over 12 million MEDLINE
citations back to the mid-1960's. It includes
History and Clipboard options which may enhance
your search session. PubMed Central The U.S.
National Library of Medicine digital archive of
life science journal literature. OMIM Online
Mendelian Inheritance in Man is a database of
human genes and genetic disorders (also OMIA).
43
GenBankhttp//www.ncbi.nlm.nih.gov/Genbank/
EBIhttp//www.ebi.ac.uk/
DDBJhttp//www.ddbj.nig.ac.jp/
44
Type in a Query term
  • Enter your search words in the
  • query box and hit the Go button

http//www.ncbi.nlm.nih.gov/entrez/query/static/he
lp/helpdoc.htmlSearching
45
The Syntax
  • Boolean operators AND, OR, NOT must be entered
    in UPPERCASE (e.g., promoters OR response
    elements). The default is AND.
  • 2. Entrez processes all Boolean operators in a
    left-to-right sequence. The order in which Entrez
    processes a search statement can be changed by
    enclosing individual concepts in parentheses. The
    terms inside the parentheses are processed first.
    For example, the search statement g1p3 OR
    (response AND element AND promoter).
  • 3. Quotation marks The term inside the quotation
    marks is read as one phrase (e.g. public health
    is different than public health, which will also
    include articles on public latrines and their
    effect on health workers).
  • 4. Asterisk Extends the search to all terms that
    start with the letters before the asterisk. For
    example, dia will include such terms as
    diaphragm, dial, and diameter.

46

47
Refine the Query
  • Often a search finds too many (or too few)
    sequences, so you can go back and try again with
    more (or fewer) keywords in your query
  • The History feature allows you to combine any
    of your past queries.
  • The Limits feature allows you to limit a query
    to specific organisms, sequences submitted during
    a specific period of time, etc.
  • Many other features are designed to search for
    literature in MEDLINE

48
Related Items
  • You can search for a text term in sequence
    annotations or in MEDLINE abstracts, and find all
    articles, DNA, and protein sequences that mention
    that term.
  • Then from any article or sequence, you can move
    to "related articles" or "related sequences".
  • Relationships between sequences are computed with
    BLAST
  • Relationships between articles are computed with
    "MESH" terms (shared keywords)
  • Relationships between DNA and protein sequences
    rely on accession numbers
  • Relationships between sequences and MEDLINE
    articles rely on both shared keywords and the
    mention of accession numbers in the articles.

49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
Database Search Strategies
  • General search principles - not limited to
    sequence (or to biology).
  • Start with broad keywords and narrow the search
    using more specific terms.
  • Try variants of spelling, numbers, etc.
  • Search many databases.
  • Be persistent!!

53
PubMed
  • MEDLINE publication database
  • Over 17,000 journals
  • Some other citations
  • Papers from 1960 and on
  • Over 12,000,000 entries
  • Alerting services
  • http//www.pubcrawler.ie/
  • http//www.biomail.org/

54
Searching PubMed
  • Structureless searches
  • Automatic term mapping
  • Structured searches
  • Tags, e.g. au, ta, dp, ti
  • Boolean operators, e.g. AND, OR, NOT, ()
  • Additional features
  • Subsets, limits
  • Clipboard, history

55
  • Start working
  • Search PubMed
  • cuban cigars
  • cuban OR cigars
  • cuban cigars
  • cuba cigar
  • (cuba cigar) NOT smok
  • Fidel Castro
  • fidel castro
  • 6 NOT 7

56
Details and History in PubMed
57
Details and History in PubMed
58
The OMIM (Online Mendelian Inheritance in Man)
  • Genes and genetic disorders
  • Edited by team at Johns Hopkins
  • Updated daily

59
MIM Number Prefixes gene with known
sequence gene with known sequence and
phenotype phenotype description,
molecular basis known mendelian
phenotype or locus, molecular basis unknown no
prefix other, mainly phenotypes with
suspected mendelian basis
60
Searching OMIM
  • Search Fields
  • Name of trait, e.g., hypertension
  • Cytogenetic location, e.g., 1p31.6
  • Inheritance, e.g., autosomal dominant
  • Gene, e.g., coagulation factor VIII

61
OMIM search tags All Fields ALL Allelic
Variant AV or VAR Chromosome CH or
CHR Clinical Synopsis CS or CLIN Gene Map
GM or MAP Gene Name GN or
GENE Reference RE or REF
62
(No Transcript)
63
Start working Search OMIM How many types of
hemophilia are there? For how many is the
affected gene known? What are the genes involved
in hemophilia A? What are the mutations in
hemophilia A?
64
Online Literature databases
1. How to use the UH online Library?
2. Online glossaries
3. Google Scholar
4. Google Books
5. Web of Science
65
How to use the online UH Library?
http//info.lib.uh.edu/
66
(No Transcript)
67
Online Glossaries
Bioinformatics http//www.geocities.com/bioinfor
maticsweb/glossary.html http//big.mcw.edu/ Genom
ics http//www.geocities.com/bioinformaticsweb/g
enomicglossary.html Molecular Evolution
http//workshop.molecularevolution.org/resources/
glossary/ Biology dictionary http//www.biology
-online.org/dictionary/satellite_cells Other
glossaries, e.g., the list of phobias http//www.
phobialist.com/class.html
68
4. Google Scholar http//www.scholar.google.com/
69
What is Google Scholar?
Enables you to search specifically for scholarly
literature, including peer-reviewed papers,
theses, books, preprints, abstracts and technical
reports from all broad areas of research.
70
Use Google Scholar to find articles from a wide
variety of academic publishers, professional
societies, preprint repositories and
universities, as well as scholarly articles
available across the web.
71
Google Scholar orders your search results by how
relevant they are to your query, so the most
useful references should appear at the top of the
page
This relevance ranking takes into account the
full text of each article. the article's author,
the publication in which the article appeared and
how often it has been cited in scholarly
literature.
72
What other DATA can we retrieve from the record?
73
(No Transcript)
74
(No Transcript)
75
5. Google Book Search
76
(No Transcript)
77
Start working Search Google Books How many
times is the tail of the giraffe mentioned in On
the Origin of Species by Mr. Darwin?
78
6. Web of science http//portal01.isiknowledge.com
.ezproxy.lib.uh.edu/portal.cgi?DestAppWOSFuncFr
ame
79
(No Transcript)
80
(No Transcript)
About PowerShow.com