Title: CS 598SS Probabilistic Methods in Biological Sequence Analysis
1CS 598SSProbabilistic Methods in Biological
Sequence Analysis
2What is the course about?
- Bioinformatics / Computational Biology
- Algorithms for analyzing genomes
- Probabilistic methods
3What is the course format?
- Research course
- Lectures by instructor
- Student presentations of research papers
- 1 students per paper
- 2 papers in a session, typically
- Research project presentation
- 1-3 students per project
- 20 - 30 mins presentation at end of course.
4Grading
- Project 60
- Paper presentation 20
- Participation (including assignments or quizzes,
if any) 20
5Expectations
- Programming skills (for the project)
- Basic exposure to probability theory
- Basic exposure to algorithm
6What you can do at the end of the course
- Start working on research projects in
bioinformatics biological sequence analysis - Use principled approaches, supported by
probability theory, instead of ad hoc methods - If project succeeds, publish a paper in
bioinformatics - Join me as a graduate advisee ?
7Administrative Details
- Instructor
- Saurabh Sinha
- Room 2122, Siebel Center
- Email sinhas_at_cs.uiuc.edu
- Class hrs Wed Fri, 1230pm-145pm, 1111SC
- Office hrs Thursdays, 2pm - 3pm, 2122SC
- CRN 46042
- Credits 4 graduate hrs
- Welcome to sit in, if not taking for credit
8Recommended books
- Not required, but recommended
- Biological Sequence Analysis Probabilistic
Models of Proteins and Nucleic Acids - -- Durbin, Eddy, Krogh, Mitchison
- Bioinformatics The Machine Learning Approach
- -- Baldi, Brunak
9Companion Course
- CS498-CXZ Algorithms in Bioinformatics (Fall
2005) - T/Th 1230pm 145pm
- http//sifaka.cs.uiuc.edu/course/498cxz05f/info.ht
ml
10Why study bioinformatics?
- Molecular biology is the new frontier of 21st
century science - Computer science is the crown prince of 20th
century engineering - Bioinformatics is the application and development
of computer science with the goal of supporting
molecular biology
11Why study bioinformatics?
- Flood of data several Gigabytes of sequence, and
gene expression data. - Noise in the data
- Biological
- Experimental
- Algorithms needed to make discoveries
- Probabilistic methods
- Need for efficiency
12Why study bioinformatics?
- The big picture
- Human health and quality of life
- Fundamental science
- Billions of dollars being spent
- Health research gets the major chunk of the US
Govts funds - Fundamental health research is at the molecular
level - Molecular biology research increasingly a
quantitative science
13Fundamental Science
- Recent issue of Science top 25 questions
- gtWhat Is the Universe Made Of?gtWhat is the
Biological Basis of Consciousness?gtWhy Do Humans
Have So Few Genes?gtTo What Extent Are Genetic
Variation and Personal Health Linked?gtCan the
Laws of Physics Be Unified?gtHow Much Can Human
Life Span Be Extended?gtWhat Controls Organ
Regeneration?gtHow Can a Skin Cell Become a Nerve
Cell?gtHow Does a Single Somatic Cell Become a
Whole Plant?gtHow Does Earth's Interior Work?gtAre
We Alone in the Universe?gtHow and Where Did Life
on Earth Arise?gtWhat Determines Species
Diversity?gtWhat Genetic Changes Made Us Uniquely
Human?gtHow Are Memories Stored and Retrieved?gtHow
Did Cooperative Behavior Evolve?gtHow Will Big
Pictures Emerge from a Sea of Biological
Data?gtHow Far Can We Push Chemical
Self-Assembly?gtWhat Are the Limits of
Conventional Computing?gtCan We Selectively Shut
Off Immune Responses?gtDo Deeper Principles
Underlie Quantum Uncertainty and Nonlocality?gtIs
an Effective HIV Vaccine Feasible?gtHow Hot Will
the Greenhouse World Be?gtWhat Can Replace Cheap
Oil -- and When?gtWill Malthus Continue to Be
Wrong?
14 15Heredity and DNA
- Heredity children resemble parents
- Easy to see
- Hard to explain
- DNA discovered as the physical (molecular)
carrier of hereditary information - Watson Crick explain DNA structure in 1953.
16Life, Cells, Proteins
- The study of life ? the study of cells
- Cells are born, do their job, duplicate, die
- All these processes controlled by proteins
17Protein functions
- Enzymes (catalysts)
- Control chemical reactions in cell
- E.g., Aspirin inhibits an enzyme that produces
the inflammation messenger - Transfer of signals/molecules between and inside
cells - E.g., sensing of environment
- Regulate activity of genes
18DNA
- DNA is a molecule deoxyribonucleic acid
- Double helical structure (discovered by Watson,
Crick Franklin) - Chromosomes are densely coiled and packed DNA
19Chromosome
DNA
SOURCE http//www.microbe.org/espanol/news/human_
genome.asp
20The DNA Molecule
5
G -- C A -- T T -- A G -- C C -- G G -- C T -- A
G -- C T -- A T -- A A -- T A -- T C -- G T -- A
?
3
Base Nucleotide
21Protein
- Protein is a sequence or chain of amino-acids
- 20 possible amino acids
- The amino-acid sequence folds into a 3-D
structure called protein
22Protein Structure
Protein
PNAS cover, courtesy Amie Boal
DNA
The DNA repair protein MutY (blue) bound to DNA
(purple).
23From DNA to Amino-acid sequence
Cell
SRChttp//www.biologycorner.com/resources/DNA-RNA
.gif
24From DNA to Protein In words
- DNA nucleotide sequence
- Alphabet size 4 (A,C,G,T)
- DNA ? mRNA (single stranded)
- Alphabet size 4 (A,C,G,U)
- mRNA ? amino acid sequence
- Alphabet size 20
- Amino acid sequence folds into 3-dimensional
molecule called protein
25What about RNA ?
- RNA ribonucleic acid
- U instead of T
- Usually single stranded
- Has base-pairing capability
- Can form simple non-linear structures
- Life may have started with RNA
26Central Dogma
- Information flows from DNA to RNA to Protein
- Once information it has passed on to protein, it
cannot come back to DNA
27DNA and genes
- DNA is a very long molecule
- If kept straight, will cover 5cm (!!) in human
cell - DNA in human has 3 billion base-pairs
- String of 3 billion characters !
- DNA harbors genes
- A gene is a substring of the DNA string
- A gene codes for a protein
28Genes code for proteins
- DNA ? mRNA ? protein can actually be written as
Gene ? mRNA ? protein - A gene is typically few hundred base-pairs (bp)
long
29Transcription
- Process of making a single stranded mRNA using
double stranded DNA as template - Only genes are transcribed, not all DNA
- Gene has a transcription start site and a
transcription stop site
30Step 1 From DNA to mRNA
Transcription
SOURCE http//academy.d20.co.edu/kadets/lundberg/
DNA_animations/rna.dcr
31Translation
- Process of making an amino acid sequence from
(single stranded) mRNA - Each triplet of bases translates into one amino
acid - Each such triplet is called codon
- The translation is basically a table lookup
32(No Transcript)
33The Genetic Code
SOURCE http//www.bioscience.org/atlases/genecode
/genecode.htm
34Step 2 mRNA to Amino acid sequence
Translation
SOURCE http//bioweb.uwlax.edu/GenWeb/Molecular/T
heory/Translation/translation.htm
35Gene structure
SOURCE http//www.wellcome.ac.uk/en/genome/thegen
ome/hg02b001.html
36Gene structure
- Exons and Introns
- Introns are spliced out, and are not part of
mRNA - Promoter (upstream) of gene
37Gene expression
- Process of making a protein from a gene as
template - Transcription, then translation
- Can be regulated
38Gene Regulation
- Chromosomal activation/deactivation
- Transcriptional regulation
- Splicing regulation
- mRNA degradation
- mRNA transport regulation
- Control of translation initiation
- Post-translational modification
39Transcriptional regulation
TRANSCRIPTION FACTOR
40Transcriptional regulation
TRANSCRIPTION FACTOR
41The importance of gene regulation
42Genetic regulatory network controlling the
development of the body plan of the sea urchin
embryo Davidson et al., Science,
295(5560)1669-1678.
43- That was the circuit responsible for
development of the sea urchin embryo - Nodes genes
- Switches gene regulation
- Change the switches and the circuit changes
- Gene regulation significance
- Development of an organism
- Functioning of the organism
- Evolution of organisms
44Genome
- The entire sequence of DNA in a cell
- All cells have the same genome
- All cells came from repeated duplications
starting from initial cell (zygote) - Human genome is 99.9 identical among individuals
- Human genome is 3 billion base-pairs (bp) long
45Genome features
- Genes
- Regulatory sequences
- The above two make up 5of human genome
- Whats the rest doing?
- We dont know for sure
- Annotating the genome
- Task of bioinformatics
46Some genome sizes
Organism Genome size (base pairs) Virus, Phage
F-X174 5387 - First sequenced genome Virus,
Phage ? 5104 Bacterium, Escherichia
coli 4106 Plant, Fritillary assyrica 131010
Largest known genome Fungus,Saccharomyces
cerevisiae 2107 Nematode, Caenorhabditis
elegans 8107 Insect, Drosophila
melanogaster 2108 Mammal, Homo
sapiens 3109 Note The DNA from a single human
cell has a length of 1.8m.
47Evolution
- A model/theory to explain the diversity of life
forms - Some aspects known, some not
- An active field of research in itself
- Bioinformatics deals with genomes, which are
end-products of evolution. Hence bioinformatics
cannot ignore the study of evolution
48 endless forms most beautiful and most
wonderful - Charled Darwin
49Evolution
- All organisms share the genetic code
- Similar genes across species
- Probably had a common ancestor
- Genomes are a wonderful resource to trace back
the history of life - Got to be careful though -- the inferences may
require clever techniques
50Evolution
- Lamarck, Darwin, Weissmann, Mendel
51Some mechanisms of evolution
- New or different genes
- Gene duplication new gene
- Gene mutation different gene
- New or different regulation of genes
- Switches change, therefore circuit changes, even
though genes are same - A difference of time scales