Data Acquisition Tools - PowerPoint PPT Presentation

About This Presentation
Title:

Data Acquisition Tools

Description:

DNA sequencing is performed using an automated version of ... Affinity methods. Affinity chromatography. Co-immunoprecipitation. Molecular and atomic methods ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 37
Provided by: tvisw
Category:

less

Transcript and Presenter's Notes

Title: Data Acquisition Tools


1
  • Data Acquisition Tools Techniques

2
In this presentation
  • Part 1 Sequencing Technology
  • Part 2 Genomic Databases

3
Part1
  • Sequencing Technology

4
Principles of DNA sequencing
  • DNA sequencing is performed using an automated
    version of the chain termination reaction, in
    which limiting amounts of dideoxyribonucleotides
    generate nested sets of DNA fragments with
    specific terminal bases
  • Four reactions are set up, one for each of the
    four bases in DNA, each incorporating a different
    fluorescent label
  • The DNA fragments are separated by PAGE and the
    sequence is read by a scanner as each fragment
    moves to the bottom of the gel

5
Types of DNA sequencing
  • DNA sequences come in three major forms
  • Genomic DNA comes directly from the genome and
    includes extragenic material as well as genes.
    In eukaryotes, genomic DNA contains introns
  • cDNA is reverse-transcribed from mRNA and
    corresponds only to the expressed parts of the
    genome. It does not contain introns
  • Recombinant DNA comes from the laboratory and
    comprises artificial DNA molecules such as
    cloning vectors

6
Genome sequencing strategies
  • Only short DNA molecules (800 bp) can be
    sequenced in one read, so large DNA molecules,
    such as genomes, must first be broken into
    fragments. Genome sequencing can be approached in
    two ways
  • Shotgun sequencing involves the generation of
    random DNA fragments, which are sequenced in
    large numbers to provide genome-wide coverage
  • Clone contig sequencing involves the systematic
    production and sequencing of subclones

7
Sequence quality control
  • High quality sequence data is generated by
    performing multiple reads on both DNA strands
  • Preliminary trace data is then base called and
    assessed for quality using a program such as
    Phred
  • Vector sequences and repeated DNA elements are
    masked off and then the sequence is assembled
    into contigs using a program such as Phrap
  • Remaining inconsistencies must be addressed by
    human curators

8
Single-pass sequencing
  • Sequence data of lower quality can be generated
    by single reads (single-pass sequencing)
  • Although somewhat inaccurate, single-pass
    sequences such as ESTs and GSSs can be generated
    in large amounts very quickly and inexpensively

9
RNA sequencing
  • Most RNA sequencing are deduced from the
    corresponding DNA sequences but special methods
    are required for the identification of modified
    nucleotides. These include biochemical assays,
    NMR spectroscopy and MS

10
Protein sequencing
  • Most protein sequencing is now-a-days carried out
    by MS, a technique in which accurate molecular
    masses are calculated from the mass/charge ration
    of ions in a vacuum
  • Soft ionization methods allow MS analysis of
    large macromolecules such as proteins
  • Sequences can be deduced by comparing the masses
    of tryptic peptide fragments to those predicted
    from virtual digests of proteins in databases
  • Also, de novo sequencing can be carried out by
    generating nested sets of peptide fragments in a
    collision cell and calculating difference in mass
    between fragments differing in length by a single
    amino acid residue

11
Importance of protein interactions
  • They underlie most cellular functions.
    Protein-protein interactions result in formation
    of transient or stable multi-subunit complexes
  • Understanding of these complexes is required for
    functional annotation of proteins and is a step
    towards the elucidation of molecular pathways
    such as signaling cascades and regulatory
    networks
  • Protein interactions with nucleic acids form an
    important area of study, since such interactions
    are required for replication, transcription,
    recombination, DNA repair and many other
    processes. Proteins also interact with small
    molecules, which act as ligands, substrates,
    cofactors and allosteric regulators

12
Methods for protein interactions
  • Genetic methods
  • Suppressor mutant
  • Synthetic lethal effect
  • Dominant negative mutations
  • Affinity methods
  • Affinity chromatography
  • Co-immunoprecipitation
  • Molecular and atomic methods
  • X-ray crystallography
  • NMR spectroscopy
  • Other methods
  • FRET
  • SPR spectroscopy
  • SELDI
  • Library-based methods
  • Y2H system

13
Other methods
  • For larger proteins that do not readily form
    crystals, alternative analytical methods are
    required to deduce structures
  • These include X-ray fiber diffraction, electron
    microscopy and circular dichroism (CD)
    spectroscopy

14
Protein structure determination
  • X-ray crystallography
  • NMR spectroscopy
  • Other methods
  • X-ray fiber diffraction
  • Electron microscopy
  • CD spectroscopy

15
X-ray crystallography
  • Involves determination of protein structure by
    studying diffraction pattern of X-rays through a
    precisely orientated protein crystal
  • They way in which X-rays are scattered depends on
    the electron density and spatial orientation of
    the atoms in the crystal
  • A mathematical method called the Fourier
    transform is used to reconstruct electron density
    maps from the diffraction data allowing
    structural models to be built

16
NMR spectroscopy
  • NMR is a property of certain atoms that can
    switch between magnetic states in an applied
    magnetic field by absorbing electromagnetic
    radiation
  • The nature of absorbance spectrum is influenced
    by the type of atom and its chemical context, so
    that NMR spectroscopy can discriminate between
    different chemical groups
  • NMR spectra are also modified by the proximity of
    atoms in space
  • Analysis of NMR spectra allows 3D configuration
    of atoms to be reconstructed, resulting in a
    series of structural models
  • The technique is suitable only for the analysis
    of small, soluble proteins

17
2-D gel electrophoresis
  • The current method for studying proteins consists
    in part of a technique called two dimensional gel
    electrophoresis, which separates proteins by
    charge and size
  • In the technique, researchers squirt a solution
    of cell contents onto a narrow polymer strip that
    has a gradient of acidity. When the strip is
    exposed to an electric current, each protein in
    the mixture settles into a layer according to its
    charge. Next, the strip is placed along the edge
    of a flat gel and exposed to electricity again.
    As the proteins migrate through the gel, they
    separate according to their molecular weight.
    What results is a smudgy patterns of dots, each
    of which contains a different protein
  • In academic laboratories, scientists generally
    use a tool similar to a hole puncher to cut the
    protein spots from 2-D gels for individual
    identification by another method, mass
    spectroscopy
  • Now-a-days, companies have started using robots
    to do it

18
Part2
  • Genomic Databases

19
Types of databases
  • There are many types of databases available for
    researchers in the field of biology
  • Primary sequence databases - for storage of raw
    experimental data
  • Secondary databases - contain information on
    sequence patterns and motifs
  • Organism specific databases
  • Other databases

20
Primary sequence databases
  • Three primary sequence databases are GenBank
    (NCBI), the Nucleotide Sequence Database (EMBL)
    and the DNA Databank of Japan (DDBJ)
  • These are repositories for raw sequence data, but
    each entry is extensively annotated and has
    features table to highlight the important
    properties of each sequence
  • The three databases exchange data on a daily basis

21
Subsidiary sequence databases
  • Particular types of sequence data are stored in
    subsidiaries of the main sequence databases. For
    instance, ESTs are stored in dbEST, a division of
    GenBank
  • There are also subsidiary databases for GSSs and
    unfinished genomic sequence data

22
Organism specific resource
  • As well as general databases that serve the
    entire biology community, there are many organism
    specific databases that provide information and
    resources for those researches working on
    particular species
  • The number of such databases is growing as more
    genome projects are initiated, and many can be
    accessed from general genomics gateway sites such
    as GOLD

23
Organism-specific genomic databases
Organism Database/resource URL
Escherichia coli EcoGene EcoCyc (Encyclopedia of E. coli genes and metabolism Colibri http//bmb.med.miami.edu/EcoGene/EcoWeb http//ecocyc.pangeasystems.com/ecocyc/ecocyc.html http//genolist.pasteur.fr/Colibri
Bacillus subtilis SubtiList http//genolist.pasteur.fr/SubtiList
Saccharomyces cerevisiae Saccharomyces Genome Database (SGD) http//genome-www.stanford.edu/Saccharmyces
Plasmodium falciparum PlasmoDB http//PlasmoDB.org
Arabidopsis thaliana MIPS Arabidopsis thaliana Database (MAtDB) The Arabidopsis information resource (TAIR) http//mips.gsf.de/proj/thal/db http//www.arabidopsis.org
Drosophila melanogaster FlyBase http//flybase.bio.indiana.edu
Caenorhabditis elegans A C. elegans DataBase (ACeDB) http//www.acedb.org
Mouse Mouse Genome Database (MGD) http//www.informatics.jax.org
Human OnLine Mendelian Inheritance in Man (OMIM) http//www.ncbi.nlm.nih.gov/omim
24
Finding organism-specific databases
  • Organism specific databases are widely
    distributed on the Internet
  • In order to find and interrogate databases on
    specific organisms, it is necessary to use a
    gateway site to access relevant databases and
    information resources
  • Worked examples are provided, using GOLD as the
    gateway and illustrated with Ebola virus, the
    bacterium E. coli, the fruit fly Drosophila
    melanogaster and the human genome

25
Useful gateway sites providing information on
multiple, organism and genomic resources
Gateway site URL
NCBI Genomic Biology www.ncbi.nlm.nih.gov/Genomes/
GOLD (Genomes OnLine Database) wit.integratedgenomics.com/GOLD
Organism specific genomic databases www.unl.edu/stc-95/ResTools/biotools/biotools10.html
TIGR Microbial Database www.tigr.org/tdb/mdb/mdbcomplete.html
Bacterial genomes genolist.pasteur.fr
Yeast database genome-www.stanford.edu/Saccharomyces/yeast_info.html
EnsEMBL genome database project www.ensembl.org
MIPS (Munich Information Centre for Protein Sequences) mips.gsf.de
26
Nematode
Bakers Yeast Cells
27
Other databases
  • Specialized sequence databases for storage and
    analysis of particular types of sequences e.g.,
    rRNA and tRNA, introns, promoters and other
    regulatory elements
  • OMIM for study of human genetics and molecular
    biology
  • Incyte and UniGene for providing gene sequences
    and transcripts with expert annotation for use in
    drug design and research
  • Structural databases for protein structural
    data (e.g. PDB, MMDB) containing X-ray Crys.
    and NMR studies
  • Proteins and higher order functions to store
    information on particular types of proteins such
    as receptors, signal transduction components,
    regulatory hierarchies and enzymes
  • Literature databases to store scientific
    articles with text search facility (e.g. Medline
    and PubMED)

28
Database tools for displaying and annotating
genomic sequence data
Viewer format URL
Artemis www.sanger.ac.uk/Software/Artemis
ACeDB www.acedb.org/Tutorial/brief-tutorial/shtml
Apollo www.ensembl.org/apollo
EnsEMBL www.ensembl.org
NCBI map viewer www.ncbi.nlm.nih.gov
GoldenPath genome.ucsc.edu
29
Database formats
  • There is no universally agreed format for genome
    databases and several viewers and browsers have
    been developed with graphical displays for
    genomic sequence analysis and annotation
  • One of the most versatile formats is ACeDN
    (originally designed for the nematode C.
    elegans), which has an object-oriented database
    architecture and is now used in many applications
    outside the field of genomic bioinformatics

30
Common formats
  • There are several conventions for representing
    nucleic acid and protein sequences, of which the
    following are widely used
  • NBRF/PIR
  • FASTA
  • GDE
  • These formats have limited facilities for
    comments, which must include a unique identifier
    code and sequence accession number

31
Formats for multiple sequence alignment
  • There are separate formats for multiple sequence
    alignment representation, of which the following
    are popular
  • MSF
  • PHYLIP
  • ALN

32
Files of structural data
  • Structural data are maintained as flat files
    using the PDB format
  • Such files contain orthogonal atomic co-ordinates
    together with annotations, comments and
    experimental details

33
Submission of sequences
  • Sequences may be submitted to any of the three
    primary databases using the tools provided by the
    database curators
  • Such tools include WebIn and BankIt, which can be
    used over the Internet, and Sequin, a stand-alone
    application

34
Database interrogation
  • All the databases discussed above can be searched
    by sequence similarity
  • However, detailed text-based searches of the
    annotations are also possible using tools such as
    Entrez
  • The simplest way to cross-reference between the
    primary nucleotide sequence databases and
    SWISS-PROT is to search by accession number, as
    this provides an unambiguous identifier of genes
    and their products

35
Databases covered by Entrez
Category Database
Nucleic acid sequences Entrez nucleotides sequences obtained from GenBank, RefSeq and PDB
Protein sequences Entrez protein sequences obtained from SWISS-PROT, PIR, PRF, PDB and translations from annotated coding regions in GenBank and RefSeq
3D structures Entrez Molecular Modeling Database (MMDB)
Genomes Complete genome assemblies from many sources
PopSet From GenBank, set of DNA sequences that have been collected to analyze the evolutionary relatedness of a population
OMIM OnLine Mendelian Inheritance in Man
Taxonomy NCBI Taxonomy Database
Books Bookshelf
ProbeSet Gene Expression Omnibus (GEO)
3D domains Domains from the Entrez MMDB
Literature PubMED
36
Databases covered by DBGET/LinkDB
Category Database
Nucleic acid sequences GenBank, EMBL
Protein sequences SWISS-PROT, PIR, PRF, PDBSTR
3D structures PDB
Sequence motifs PROSITE, EPD, TRANSFAC
Enzyme reactions LIGAND
Metabolic pathways PATHWAY
Amino acid mutations PMD
Amino acid indices AAindex
Genetic diseases OMIM
Literature LITDB Medline
Organism-specific gene catalogs E. coli, H. influenzae, M. genitalium, M. pneumoniae, M. jannashii, Synechocystis, S. cerevisiae
Write a Comment
User Comments (0)
About PowerShow.com