Introduction to Computational Molecular and Cell Biology - PowerPoint PPT Presentation


PPT – Introduction to Computational Molecular and Cell Biology PowerPoint presentation | free to view - id: 126ca4-YjFjN


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Introduction to Computational Molecular and Cell Biology


Introduction to Computational Molecular and Cell Biology – PowerPoint PPT presentation

Number of Views:239
Avg rating:3.0/5.0
Slides: 44
Provided by: mch88


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introduction to Computational Molecular and Cell Biology

Introduction to Computational Molecular and Cell
  • BE 190C (BE131/231 starting Fall 2003)
  • UC Berkeley

Revolutionary Experimental Efforts in Biology
Functional Annotation Initiative Gene deletion
projects Yeast two-hybrid screening Gene
expression micro-arrays
Structural Genomics Initiative High throughput
effort NIH, new beamlines LBNL ALS
Human Genome InitiativeMicrobial organisms C
elegans Human
Genome Annotation and Gene Finding
Experimentally studied
Function by homology
Steven Brenner, Dept. of Plant Microbial
Biology, MCB, Bioengineering Computational
structural functional genomics Saira Mian,
Lawrence Berkeley National Laboratory Gene-finding
, homology detection, expression analysis Lior
Pachter, Dept. of Mathematics Algorithms and
methods for gene recognition Stephen Holbrook,
Lawrence Berkeley National Laboratory RNA
structure analysis gene-finding
Molecular Biology as Physical Systems
Teresa Head-Gordon, Dept. of BE Computational
Protein Folding Structure Prediction Characteriz
ing protein environments George Oster, Molecular
Cell Biology Theoretical models of molecular
processes of motor proteins
Modeling the Cellular Program
Three mammalian signal transduction pathway that
share common molecular elements. From the
Signaling PAthway Database (SPAD)
Adam Arkin, Dept. of Bioengineering
Chemistry Arup Chakraborty, Dept. of Chemical
Engineering Cellular biochemical networks
Gene Expression and Analysis
Michael Eisen, LBNL Microarray experiments data
analysis Daniel Rokhsar, Dept. of Physics,
JGI Microarray data analysis DNA microarrays
that measure expression levels and other aspects
of gene regulation for all of an organisms genes.
Richard Karp, Dept. of EECS Bioengineering Algor
ithms for physical mapping and gene
expression Terry Speed, Dept. of Statistics Gene
mapping, microarray data Mark van der Laan, Dept.
of Biostatistics Statistics, UC
Berkeley Statistical microarray data analysis
Computational Biology
Molecular Genetics, Structural Biology,
Discrete mathematics Statistics
Linear Algebra, Calculus Scientific computing
BE131/231 (Fall) Introduction to Computational
Biology BE143/243 (Spring) Simulation Methods in
Biology BE/MCB/PMB 246 Computational Genomics
Molecular Biology Primer
  • DNA, RNA and proteins are large molecules which
    consist of a chain of smaller residues called
    nucleotides or amino acids, respectively
  • Central Dogma
  • DNA makes RNA makes protein
  • Sequence determines structure determines function

The Twenty Common Amino Acids
  • Hydrophobic
  • Hydrophilic
  • Acidic and Basic
  • primary structure (1D sequence)
  • secondary structure (local 2D 3D)
  • tertiary structure (global 3D)
  • quatenary structure (

Protein Structure
  • Macromolecular structure divided into

Protein Function
Some motifs play a direct part in the function of
the protein.
Examples include the 'E-F hand' calcium
binding motif the helix-turn-helix DNA
binding motif the 'zinc finger' DNA binding
Computational Protein Folding
One microsecond simulation of a fragment of the
protein, Villin. (Duan Kollman, Science
1998) Biophysics is governed by 4 fundamental
theories Quantum Mechanics Potential energy
surfaces Classical Mechanics How to move on PE
surfaces Statistical Mechanics Microscopic to
macroscopic Thermodynamics What we see in
macroscopic world Numerical simulation when
analytical statistical mechanics is intractable
Protein Databases Computerized storage place
for data
  • But a good database is so much more!
  • Permits user-defined inquiries
  • Permits adding, changing, retrieving, deleting of
  • Permits analysis or visualization of data using
    linked software
  • Some standardized form to avoid dirty data
  • What kind of data do we need to store in
  • DNA sequence data
  • Protein sequence data
  • Protein structure
  • Protein function
  • Literature!

Why do we need Electronic Databases?
  • Explosion of data
  • in a storage sense (high energy physics is
  • in a density of information sense
  • Data and databases are developed in remote
    locations to user
  • Data is inter-related so that we need information
    talking to each other
  • Fast moving field like computational biology
    needs the speed!

DNA Sequence or Nucleotide Databases
  • The two major public-domain databases for
    nucleotide sequences are
  • European Molecular Biology Laboratories (EMBL)
    Nucleotide Database
  • National Center for Biotechnology Information
    (NCBI) GenBank database
  • DNA Database of Japan (DDBJ)
  • Large-scale genome sequencing projects tend to
    maintain their own databases for specific species
  • The number of bases grows at an exponential rate
  • In the NCBI, today's total is 20,197,497,568!

Protein Sequence Databases SwissProt and TrEMBL
  • Maintained collaboratively by Swiss Institute for
    Bioinformatics (SIB) and the European
    Bioinformatics Institute (EBI)
  • SwissProt has a high level of annotation, a
    minimal level of redundancy, and high level of
    integration with other databases
  • description of protein function, domains,
    structure, post-translational modifications,
    variants, etc.
  • high quality because manually curated (but
    updates are slower)
  • input from GenBank, EMBL, DDBJ, literature,
    individual labs
  • Translation of EMBL (nucleotide) coding sequence
    to protein sequence that is placed in SwissProt
  • a computer annotated expansion of SwissProt
  • 106,000 to 700,000 protein sequences

Protein Structure Databases
  • Protein Databank (PDB) Information about protein
    3D structure
  • http//
  • Typically x-ray, NMR experiment
  • Some predicted or theoretical models
  • BioMagResBank
  • NMR only
  • http//

The Protein Data Bank is the single worldwide
archive of primary structural data of biological
macromolecules. Many secondary sources of
information are derived from PDB data. It is the
starting point for studies in structural
bioinformatics. Berman, Westbrook et al. (2000),
Nucleic Acids Res. 28, 235-242 http//
Protein Sequence Databases Secondary Sources
PROSITE http// Database of
protein families and domains Grouped on the
basis of similarities in their sequences into
a limited number of families (fingerprint). Pro
teins or protein domains belonging to a
particular family generally share functional
attributes and/or are derived from a common
ancestor. Maintained by European Bioinformatics
Institute (EBI) and the Swiss Institute for
Bioinformatics (SIB)
Protein Structure Databases Secondary Sources
Structural Classification of Protein Structures
(SCOP) http// Nearly
all proteins have structural similarities with
other proteins and sometimes share a common
evolutionary origin. The SCOP database, created
by manual inspection and abetted by a battery of
automated methods, aims to provide a detailed and
comprehensive description of the structural and
evolutionary relationships between all proteins
whose structure is known Class, Architecture,
Topology, Homologous superfamily (CATH)
ml Class, derived from secondary structure
content, is assigned for more than 90 of
protein structures automatically. Architecture,
which describes the gross orientation of
secondary structures, independent of
connectivities, is currently assigned manually.
Topology level clusters structures according to
their topological connections and numbers of
secondary structures. Homologous superfamilies
cluster proteins with highly similar structures
and functions. The assignments of structures to
topology families and homologous superfamilies
are made by sequence and structure comparisons.
Other Computational Biology Databases
MEDLINE bibliographic database covering the
fields of medicine, nursing, dentistry,
veterinary medicine, the health care system, and
the preclinical sciences. http//
abases/MEDLINE/medline.html KEGG. Kyoto
Encyclopedia of Genes and Genomes (KEGG) is a
database of current knowledge of information
pathways that consist of interacting molecules or
genes and to provide links from the gene catalogs
produced by genome sequencing projects.
http// ArrayExpres
s is a public repository for microarray-based
gene expression data. http//
Implicit Collaborations with Computer Science
Computer Hardware Portability Applications
described running on various platforms T3D, T3E,
IBM SP's, ASCI Red, Blue Information
Technologies and Database Management Integrating
biological databases Data Warehousing ultra-high-s
peed networks Ensuring Scalability on Parallel
Architectures implicit algorithmic scaling
paradigm/software library support tools for
effective parallelization strategies 100
teraflop Meta Problem Solving
Environments geographically distributed software
paradigm plug and play paradigm Visualization
Querying data which is information dense
The Need for Advanced Computing for Computational
Computational Complexity arises from inherent
factors 40,000 gene products just from human
genes from many other organisms Experimental data
is accumulating rapidly N2, N3, N4, etc.
interactions between gene products Combinatorial
libraries of potential drugs/ligands New
materials that elaborate on native gene products
from many organisms Algorithmic Issues to make
it tractable Objective Functions Optimization Trea
tment of Long-ranged Interactions Overcoming Size
and Time scale bottlenecks Statistics
Computational Challenges in Biology
(1) Drawing analogies with known protein
structures Bioinformatics Sequence homology,
Structural Homology Inverse Folding,
Threading (2) Ab initio prediction the ability
to extrapolate to unknown folds multiple minima
problem robust objective function (3) Ab initio
folding the ability to follow kinetics,
mechanism robust objective function severe
time-scale problem proper treatment of
long-ranged interactions
Computational Protein Folding
One microsecond simulation of a fragment of the
protein, Villin. (Duan Kollman, Science
1998) (1) robust objective function? all atom
simulation with molecular water present some
structure present (2) severe time-scale
problem? required 109 energy and force
evaluations parallelization (spatial
decomposition) (3) proper treatment of
long-ranged interactions X cut-off interactions
at 8Å, poor by known simulation standards (4)
Statistics (1 trajectory is anecdotal) X Many
trajectories required to characterize kinetics
and thermodynamics
Computational Protein Folding
One microsecond simulation of a fragment of the
protein, Villin. (Duan Kollman, Science
1998) (1) robust objective function? all atom
simulation with molecular water present proper
treatment of long-ranged interactions X cut-off
interactions at 8Å, poor by known simulation
standards (2) severe time-scale
problem? required 109 energy/forces
parallelization (spatial decomposition) (3)
Statistics (1 trajectory is anecdotal) X many
trajectories required to characterize kinetics
and thermodynamics The small protein did not fold
Free Energy Function
Empirical Force Field (less accurate)
cN2 ab initio (more accurate)CN3 or worse
cltltC Approximations to Schroedingers
Equation The field is making progress in this
area, but still too difficult to solve for
protein folding
Scales as N fast timescales
Scales as N2 slow timescales
Executing Motion on this Surface Molecular
Dynamics Simulation
  • Classical mechanics is determined by Hamiltons
  • qi are positions and pi are conjugate momenta
  • Phase space is defined as the union of qi and
  • The state of our system at time t is a point in
    phase space.
  • Knowing qiand pi at t0 determines the
  • Newtons Equation of Motion is Hamiltons
    equations in Cartesian coordinates

Properties of Hamilton or Newtonian Dynamics
Hamiltons equations satisfy three symplectic
properties wrt time trajectories 1.Time
reversibility, i.e. invariance t t, p
p, q q. easy to show this
2.Conservation of energy, i.e. H(q,p) is the
same for all times.
0 3. Liouville's theorem, i.e. conservation
of phase-space volumes. Phase space
volume Volume of phase space element is
invariant to time evolution Show Proof
Numerical Simulation of Hamilton or Newtonian
Leap Frog Algorithms (Taylor expansions) Ea
ch integration cycle involves four steps
1. calculate new position 2.
calculate velocities at mid-step 3. calculate
forces at new position 4. complete the
velocity move 1.Time reversibility, 2.
Liouville's theorem, i.e. conservation of
phase-space volumes. Shear transformations
only 3.Conservation of energy (almost), i.e.
H(q,p) is a constant to O(?t)3
How long does it take to fold a protein?
Proteins fold in a matter of milliseconds to
seconds Some experimental observables are
captured on microsecond timescale Time-Scale of
motions bottlenecks (?t) Timestep
limited by fastest timescale in your system
bond vibrations period of 10-14 seconds
(10fs) ?t 1fs shake/rattle bonds
(project out force along bond) ?t 2fs
multiple timescale algorithms (4fs to 10fs)
(active area of research)
Preserve symplectic,reversible properties! So
we need to 109 to 1015 energy and force
evaluations to try to fold
Ewald Sums the best way to do electrostatics
  • Conventional algorithm scales as N3/2 at best
  • Particle Mesh Ewald (N)
  • Spatial Decomposition in r-space Parallelization
    of FFT's in k-space
  • Evaluate full Ewald sum in r-space using FMM

N2 evaluation of energy forces
N evaluation of energy forces
(No Transcript)
Scope of the BlueGene Project
  • BlueGene
  • Is a RESEARCH project, involving computer
    hardware, software and science groups
  • Began December 1999
  • 100 M over 5 years
  • Synergy between IBM's interests and capabilities
    in high performance computing and the scientific
    needs of the field
  • The project will develop a computer with
    petaflop-level performance and use this computer
    for large scale biomolecular simulation to
    advance the understanding of biologically
    important processes, in particular that of the
    mechanisms behind protein folding.
  • The project will also advance our knowledge of
    cellular architectures (massively parallel
    computer systems built of replicable cells that
    integrate processors, memory and communication),
    and of the software needed to exploit those

Computational Protein Folding Blue Gene
(2) Severe time-scale problem? parallelization
(spatial decomposition) Replicated algorithm
stores all coordinates and forces at each
processor low overhead maintaining data
distribution. O(N) communication Spatial
decomposition stores only coordinates, forces of
atoms in box pass data of 26 boxes within
cut-off r in 6 passes distribute data, better
locality and scalability Multiple time step not
used symplectic (important for long timescale
numerical integration)! Vfast are updated
more frequently with cost O(N) Vslow are updated
less frequently with cost O(N2 or N 3/2) O(M2)
Blue Gene A Special Purpose Computer for
Simulating Protein Folding
(1) Robust objective function all atom
simulation with molecular water present proper
treatment of long-ranged interactions Blue Gene
will do proper Ewald Part of the objective is to
interrogate free energy functions (2) Severe
time-scale problem required 109 energy/forces
parallelization (spatial decomposition) Blue Gene
will simulate on the microsecond-millisecond (3)
Statistics (1 trajectory is anecdotal) This is
where Blue Gene can make a difference!
Ab Initio Protein Structure Prediction
Funneled Energy Landscape
Native State Global Free Energy
  • Sequence, an objective function, a search method
    Tertiary Structure
  • Protein and Aqueous Solvent Energy Surface
  • Incorporation of Constraints Predicted by Machine
    Learning Methods
  • Global Optimization Approach to Predict Tertiary
  • Parallelization of Tree Search Problems

Protein Structure Prediction Illustrates How
Computational Biology is Multi-Disciplinary
  • Use of Constraints Predicted by Machine Learning
  • AI/Bioinformatics
  • Global Optimization Approach to Predict Tertiary
  • Mathematical Optimization/Applied Mathematics
  • Parallelization of Tree Search Problems
  • Computer Science/Tools
  • Protein and Aqueous Solvent Energy Surface
  • Biophysics and physical chemistry
  • Experiments and theory

Critical Assessment of Structure Prediction (CASP)
  • It consists of three parts
  • The collection of targets from the experimental
  • The collection of blind predictions from the
    modeling community over a period of 3 months
  • Comparative modeling (high sequence homology)
  • Fold recognition (high structural homology)
  • Ab initio (genuine new folds generally
  • The assessment and discussion of the results.
  • Organizers ranked protein targets by difficulty
  • Various objective measure/metrics have been

Global Optimization Algorithm Stochastic
Stochastic/perturbation in sub-space of dihedral
angles predicted coil (1) Local minimization of
a set of start points in sub-space (2) Define a
critical radius a measure of whether a point
is within a basis of attraction (3) Generate many
sample points in sub-space volume, V (4) Evaluate
r.m.s. between new sample points and minimizers
of (1) If (r.m.s. lt rk) ignore this sample
point (5) Minimize sample points not in critical
distance, merge into (1) Choose new set of coil
dihedral angles and repeat Crivelli, Philip,
Byrd, Eskow, Schnabel,Yu, Head-Gordon (1999). In
New Trends in Computational Methods for Large
Molecular Systems, in press. Probabilistic
theoretical guarantees of global optimum in
sub-spaces Global optimization of full space
solve series of global optimum in sub-spaces?

(No Transcript)
Execution time differs by minutes
Add new layer to hierarchy
Static Number of Tree Nodes 14 configurations
Increasing number of supervisors Gain in
efficiency of a factor of 4-8 depending on job
size Expanded Number of Tree Nodes 14 to 83
conformations No loss in scalability Hierarchica
l/Dynamic Load Balancing Generic to large tree
search problems
Crivelli Head-Gordon (2000). Submitted to J.
Parallel Distributed Computing
Our CASP Blind Prediction Results
Emphasize ab initio methods can be complementary
to other approaches that rely on database
tertiary structure information Crivelli, Eskow,
Bader, Lamberti, Byrd, Schnabel, Head-Gordon
(2001). Biophysical Journal, in press
T124 New Fold One of Most Difficult Targets
Submitted to CASP4 lowest energy
structure RMSD8.8 EQR1148
Runs after CASP4 lowest energy structure RMSD7.7