An Introduction to Bioinformatics. CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546 Oct/12/09 Source http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt - PowerPoint PPT Presentation

About This Presentation
Title:

An Introduction to Bioinformatics. CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546 Oct/12/09 Source http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt

Description:

1) National Center for Biotechnology Information (NCBI),/the National Library of ... Definition: A brief, one-line, textual sequence description. ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 13
Provided by: steve961
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Bioinformatics. CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546 Oct/12/09 Source http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt


1
An Introduction to Bioinformatics.CSE, Marmara
University mimoza.marmara.edu.tr/m.sakalli/cse54
6Oct/12/09Source http//bio.fsu.edu/stevet/BS
C5936/BioDataBases.ppt
2
Terminology
  • Bioinformatics using computational techniques to
    access, analyze, and interpret the biological
    information. Tool Building. Biocomputing and
    computational biology are the synonyms.
  • Sequence analysis is the study of molecular
    sequence data.
  • Genomics analyzes the context of genes or
    complete genomes.
  • Proteomics is the subdivision of genomics
    concerned with analyzing the protein complement,
    i.e. the proteome.
  • The Human Genome Project and numerous the data
    coming at alarming rates.
  • Homo sapiens the 3.2 billion base pairs
    Estimates of the number of genes were around
    100,000 range but turns out to be twice as many
    as a fruit fly, between 25 and 35,000!
  • The protein coding region of the genome is only
    about 1 or so, a bunch of the remainder is
    jumping selfish DNA of which much may be
    involved in regulation and control.

3
  • Three major databases with their own specific
    format. Mirrored among each other and sharing
    accession codes, but NOT identifier names
  • 1) National Center for Biotechnology Information
    (NCBI),/the National Library of Medicine (NLM),
    at the NIH, (Gene bank and GenPept).
  • http//www.ncbi.nlm.nih.gov/
  • http//www.ncbi.nlm.nih.gov/Genbank/GenbankOvervie
    w.html
  • Georgetown Universitys National Biomedical
    Research Foundation Protein Identification
    Resource and Naval Research Lab sequences of
    three-dimensional structure.
  • http//www-nbrf.georgetown.edu/
  • http//www-nbrf.georgetown.edu/pirwww/dbinfo/nrl3d
    .html
  • 2)
  • European Molecular Biology Laboratory
  • http//www.ebi.ac.uk/embl/index.html,
    http//www.embl-heidelberg.de/
  • European Bioinformatics Institute,
  • http//www.ebi.ac.uk/
  • Swiss Institute of Bioinformatics (SIB), Expert
    Protein Analysis System
  • http//www.expasy.ch/, http//www.expasy.org/links
    .html
  • Nucleotide Sequence Database, amino acid sequence
    databases
  • http//expasy.cbr.nrc.ca/sprot/
  • 3)
  • http//www.ddbj.nig.ac.jp/

4
  • Atlas of Protein Sequence and Structure The
    first well recognized protein sequence database,
    mid sixties, by Dr. Margaret Dayhoff.
  • DDBJ began in 1984, GenBank in 1982, and EMBL in
    1980. They are all attempts at establishing an
    organized, reliable, comprehensive and openly
    available library of genetic sequences.
  • Each program needs to recognize particular
    aspects of the sequence files flexibility of the
    program is a headache. NCBIs ASN.1 format and
    its Entrez interface attempt to reduce these
    prbls.
  • Unfortunately, not like ieee working groups for
    internet taskforce, RFCies for example, format
    issues are the most confusing and troubling
    aspect of working with primary sequence data.
  • Sequence database installations are commonly a
    complex ASCII/Binary mix, but neither relational
    nor OOP (often proprietary).
  • Contain several very long text files each
    containing different types of information all
    related to particular sequences.
  • Software is usually required to interact with
    these databases. ReadSeq of Don Gilbert (a
    reformatting program, for DNA and protein
    sequences, accepting single or multiple inputs in
    18 different formats, converting to a specified
    format. )

5
  • http//www.molecularevolution.org/
  • AWTY (Are We There Yet?) is a system for
    graphically exploring convergence of Markov Chain
    Monte Carlo (MCMC) chains in Bayesian
    phylogenetic inference (Nylander et al. 2008).
  • FigTree to graphically view phylogenetic trees.
  • Clustal W (Thompson et al. 1994) is for global
    multiple sequence alignment. Using a progressive
    alignment algorithm with affine gap penalties and
    a guide tree based on sequence similarity to
    align DNA or amino acid sequences. The affine gap
    cost model penalizes insertions and deletions
    using a linear function in which one term is
    length independent, and the other is length
    dependent. Gap penalty Gapopen Len
    Gapextend. Recent reviews comparing multiple
    alignment algorithms (e.g., Hickson et al. 2000,
    Thompson et al. 1999, and McClure et al. 1994).
    Morrison and Ellis (1997) discuss the effects of
    nucleotide sequence alignment on the estimation
    of phylogenetic hypotheses. The current version
    is Clustal W2 (Larkin et al. 2007). The program
    is also available with a graphical user
    interface, Clustal X.
  • BEAST, (Beauti), -Bayesian Evolutionary Analysis
    Sampling Trees- is for evolutionary inference of
    molecular sequences, Andrew Rambaut and Alexei
    Drummond (Drummond et al. 2002 2005 2006).
  • FASTA compares pairs of protein or DNA sequences
    as well as comparing a single protein or DNA
    sequence to a database or library. Fast and local
    or remote services.
  • GARLI (Genetic Algorithm for Rapid Likelihood
    Inference) performs phylogenetic searches on
    aligned nucleotide datasets using the maximum
    likelihood criterion.
  • MAFFT implements FFT to optimize protein
    alignments based on physical properties of the
    amino acids (Katoh et al., 2002 2005). The
    program uses progressive alignment followed by
    refinement, also known as iterative alignment.

6
  • All sequence databases contain (in their own
    format)
  • Name (Genetic identifiers) LOCUS, ENTRY, ID
  • Definition A brief, one-line, textual sequence
    description.
  • Accession Number A constant data identifier.
  • Source and classification (taxonomy) information.
  • Complete literature references.
  • Comments and keywords.
  • The all important FEATURE table!
  • A summary or checksum line.
  • The sequence itself.

7
  • LOCUS HSEF1AR 1506 bp
    mRNA linear PRI 12-SEP-1993
  • DEFINITION Human mRNA for elongation factor 1
    alpha subunit (EF-1 alpha).
  • ACCESSION X03558
  • VERSION X03558.1 GI31097
  • KEYWORDS elongation factor elongation factor
    1.
  • SOURCE human.
  • ORGANISM Homo sapiens
  • Eukaryota Metazoa Chordata
    Craniata Vertebrata Euteleostomi
  • Mammalia Eutheria Primates
    Catarrhini Hominidae Homo.
  • REFERENCE 1 (bases 1 to 1506)
  • AUTHORS Brands,J.H., Maassen,J.A., van
    Hemert,F.J., Amons,R. and Moller,W.
  • TITLE The primary structure of the alpha
    subunit of human elongation
  • JOURNAL Eur. J. Biochem. 155 (1), 167-171
    (1986)
  • MEDLINE 86136120
  • FEATURES Location/Qualifiers
  • source 1..1506
  • /organism"Homo sapiens"
  • /db_xref"taxon9606"
  • CDS 54..1442

8
EMBL and SWISS-PROT
  • ID EF11_HUMAN STANDARD PRT 462 AA.
  • AC P04720 P04719
  • DT 13-AUG-1987 (Rel. 05, Created)
  • DE Elongation factor 1-alpha 1 (EF-1-alpha-1)
    (Elongation factor 1 A-1)
  • DE (eEF1A-1) (Elongation factor Tu) (EF-Tu).
  • GN EEF1A1 OR EEF1A OR EF1A.
  • OS Homo sapiens (Human),
  • OS Bos taurus (Bovine), and
  • OS Oryctolagus cuniculus (Rabbit).
  • OC Eukaryota Metazoa Chordata Craniata
    Vertebrata Euteleostomi
  • OC Mammalia Eutheria Primates Catarrhini
    Hominidae Homo.
  • OX NCBI_TaxID9606, 9913, 9986
  • RN 1
  • RP SEQUENCE FROM N.A.
  • RC SPECIESHuman
  • RX MEDLINE86136120 PubMed3512269
  • RA Brands J.H.G.M., Maassen J.A., van Hemert
    F.J., Amons R., Moeller W.
  • RT "The primary structure of the alpha subunit
    of human elongation . -binding sites."
  • RL Eur. J. Biochem. 155167-171(1986).

9
PIR/NBRF format
  • ENTRY EFHU1 type complete
    iProClass View of EFHU1
  • TITLE translation elongation factor
    eEF-1 alpha-1 chain - human
  • ALTERNATE_NAMES translation elongation factor Tu
  • ORGANISM formal_name Homo sapiens
    common_name man
  • cross-references taxon9606
  • DATE 30-Jun-1988 sequence_revision
    05-Apr-1995 text_change..
  • ACCESSIONS B24977 A25409 A29946 A32863
    I37339
  • REFERENCE A93610
  • authors Rao, T.R. Slobin, L.I.
  • journal Nucleic Acids Res. (1986) 142409
  • title Structure of the amino-terminal
    end of mammalian elongation
  • accession B24977
  • molecule_type mRNA
  • residues 1-82,'A',84-94 label RAO
  • cross-references EMBLX03689 NIDg31109
    PIDNCAA27325.1
  • PIDg31110.
  • GENETICS
  • gene GDBEEF1A1 EEF1A EF1A
  • cross-references GDB118791 OMIM130590

10
Examples of DBs with specialized type of sequences
  • Almost all the links Human Genome Ensemble
    Project at http//www.ensembl.org/
  • Patterns, motifs, and profiles REBASE, EPD,
    PROSITE,
  • Aligned multiple sequence entries. RDP and ALN.
  • Functionally, structurally, or phylogenetically
    ordered iProClass and HOVERGEN vertebrate gene
    db.
  • HIV Database, and the Giardia lamblia Genome
    Project.
  • 3D Structure, atomic coordinate data is necessary
    to define the tertiary shape of a particular
    biological molecule. Protein DB and Rutgers
    Nucleic Acid Db.
  • MolBio Molecular visualization with special
    software.
  • Genomic linkage mapping databases for H. sapiens,
    Mus, Drosophila, C. elegans, Saccharomyces,
    Arabidopsis, E. coli.
  • OMIM Online Mendelian Inheritance in Man
  • Phylogenetic Tree Databases e.g. the Tree of
    Life.

11
  • Theres a bewildering assortment of different
    databases and ways to access and manipulate the
    information within them. The key is to learn how
    to use that information in the most efficient
    manner.
  • For example Given a novel genome sequence, find
    all genes and p-genes.
  • I want to design "sequence capture" probes for
    the exons of 40 genes that cause RP.
  • Obtain the exonic sequence, with at least 100
    nt's flanking, and 1000 nts of the promoter from
    transcription start
  • I propose a new way to find disease-causing
    mutations in humans. I want to only look in
    genes that have regions that are 1) highly
    conserved across species, 2) have known
    functional protein domains (ex. transmembrane
    domains), and 3) have mRNA secondary structure.
    Is this a good idea?
  • 1859 of Charles Darwins The Origin of Species
  • Basic Mendelian Genetics
  • Mendels laws
  • independent assortment
  • independent segregation
  • mitosis and meiosis
  • dominant/recessive and pedigrees (the graphs of
    phenotype)
  • alleles
  • Basic molecular genetics
  • DNA
  • RNA

12
Pearson FastA format GCG single
sequence format
  • gtEFHU1 PIR1 release 71.01
  • MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMG
  • KGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIK
  • NMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIV
  • GVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDN
  • MLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPL
  • QDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALS
  • EALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHP
  • GQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA
  • IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGK
  • VTKSAQKAQKAK

!!AA_SEQUENCE 1.0 P1EFHU1 - translation
elongation factor eEF-1 alpha-1 chain -
human NAlternate names translation elongation
factor Tu F1-223/Domain eEF-1 alpha domain I,
GTP-binding status predicted ltEF1gt F8-156/Domain
translation elongation factor Tu homology
ltETUgt F14-21/Region nucleotide-binding motif A
(P-loop) F153-156/Region GTP-binding NKXD
motif F245-330/Domain eEF-1 alpha domain II,
tRNA-binding status predicted ltEF2gt F332-462/Dom
ain eEF-1 alpha domain III, tRNA-binding status
predicted ltEF3gt F36,55,79,165,318/Modified site
N6,N6,N6-trimethyllysine (Lys) status
predicted F301,374/Binding site
glycerylphosphorylethanolamine (Glu) (covalent)
status predicted EFHU1 Length 462 January 14,
2002 1949 Type P Check 5308 .. 1
MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE
KFEKE 401 IVDMVPGKPM CVESFSDYPP
LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK 351
GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG
PKFLKSGDAA 451 VTKSAQKAQK AK
Write a Comment
User Comments (0)
About PowerShow.com