Title: Sequence Optimization For Synthetic Genes Using Genetic Algorithms
1Sequence Optimization For Synthetic GenesUsing
Genetic Algorithms
- David Sigfredo Angulo1
- Rob Vogelbacher1, Benjamin R. Capraro2, Tobin
Sosnick2, - Shohei Koide2
- 1 School of Computer Science Telecommunications
and - Information Systems DePaul University
- 2 Department of Biochemistry and Molecular
Biology - The University of Chicago
2Introduction
- Genetic Algorithms
- Using ideas based on the biology of genes
- Create software to use such a stochastic means to
search through large searchspaces - Resulting algorithm has nothing to do with genes
- Designing Genes
- This search space is huge
- REALLY NOVEL IDEA
- Use Genetic Algorithms based on genes to design
genes!!
3Outline
- Short biology Tutorial
- DNA Sequence Generation
- Why is the problem difficult?
- IBG Gene Designer
- Genetic Algorithm (GA) solution
- Heuristics and Fitness Evaluation
4First
- Before the problem can be described
- Must give some background biochemistry principles
- Tutorial outline
- DNA
- Codons
- Protein
- Synthetic genes
- What are they and what are they used for?
- Restriction Enzymes
- Expressing Proteins using Vectors
5Transcription/Translation
- Transcription
Translation - DNA RNA
Protein - RNA Polymerase Ribosomes
6DNA
- Deoxyribonucleic acid
- Strand backbone is made of sugar phosphate
molecules - Strands connected by nitrogen containing
nucleotide bases - Two strands join making a double helix
- Each strand is made of nucleotides joined together
7 2 nm 11 nm
30 nm 300 nm
700 nm 1100 nm
- Short region of DNA 2bl helix
- "beads on a string" form of Chromatin
- 30 nm chromatin fiber of packed nucleosomes
- Section of chromosome in an extended form
- Condensed section of chromosome
- Entire mitotic chromosome
8DNA
Four Nucleotides AGTC
9DNA Base Pairing
10Short Biology Tutorial
- Tutorial outline
- DNA
- Codons
- Protein
- Restriction Enzymes
- Expressing Proteins using Vectors
11DNA Sequence GenerationCodon to Amino Acid
Translation
- http//campus.queens.edu/faculty/jannr/Genetics/im
ages/codon.jpg
12Short Biology Tutorial
- Tutorial outline
- DNA
- Codons
- Protein
- Restriction Enzymes
- Expressing Proteins using Vectors
13Proteins AA Chains
14Proteins
- Amino Acid Chains Fold Into complex 3D Structures
- Functional properties depend on3D structure
- Usefulness depends onfunctional properties
- E.g. designing drugs
15Designed/Expressed Proteins Extremely Useful
- Designed Proteins
- Can be used to study protein structure
- Can be used to study effects of otther proteins
- Can be designed to knock out other proteins
- Can be designed to block the acgtion of other
proteins - Expressed proteins
- Expressed in cows milk or chicken eggs
- Can manufacture drugs on large scales in this way
- E.g. insulin
16Synthetic Genes
- DNA sequences
- backtranslated from a novel Protein or Amino
Acid sequence -
Transcription Translation - DNA
RNA Protein - RNA Polymerase
Ribosomes - Well put the DNA for our designed protein into
an organism (a vector) - Then that vector will make (express) our protein
- But, how do we get the DNA into an organism???
17Short Biology Tutorial
- Tutorial outline
- DNA
- Codons
- Protein
- Restriction Enzymes
- Expressing Proteins using Vectors
18Restriction Enzyme Digests
- Watson Crick 1953
- Took 20 years to be able to do anything with DNA
- H. Smith (and others) made a discovery that
allowed manipulation and deciphering of DNA - Discovery was that bacteria produced enzymes that
introduce breaks in double stranded DNA molecules
whenever they encountered a specific string of
nucleotides - These enzymes are called Restriction Enzymes
- Restriction Enzymes can be used as precise
scissors - They let biologists cut (and paste) portions of
DNA
19EcoRI
- EcoRI was the very first Restriction Enzyme
discovered - "Eco" because it was isolated from E. Coli
(Escherichia Coli) - "R" because it is a Restriction Enzyme
- "I" because it was the first Restriction Enzyme
from E. Coli - Now over 300 Restriction Enzymes known
- EcoRI cleaves (restricts, digests) DNA
- Between the G and A nucleotides
- Only when it encounters them in the string
5'-GAATTC-3' - This is called therestriction site
20Sticky Ends
- Many restriction enzymes in such a way that some
single stranded DNA is left at both ends - These nucleotide sequences
- Are complimentary to each other
- Are 5'-AATT-3' in the case of EcoRI
- Can base pair with other nucleotides in a
sequence - Thus, are called "sticky ends"
- Can temporarily hold twoDNA strands together
- The enzyme ligasewill permanently jointhose
strands - This is calledligation
21Short Biology Tutorial
- Tutorial outline
- DNA
- Codons
- Protein
- Restriction Enzymes
- Expressing Proteins using Vectors
22Gene SynthesisOn the Lab Bench
- Initial Sequence Construction
- Oligonucleotides (short strands of DNA) are
defined with complementary overlapping sites - The sticky ends
- Assembly PCR
- Oligonucleotides and polymerase are mixed and
placed in a thermocycler - Creates contiguous DNA sequence from component
oligos
23Gene SynthesisOn the Lab Bench (cont)?
- After PCR, generated DNA sequence cut with
restriction enzymes - Expression hosts's plasmid cut with restriction
enzymes - Synthetic gene inserted into plasmid and plasmid
repaired - Expression Vectors
- Host organisms used to express the synthetic
genes (make the protein) - Typically E. Coli
- Possibly Chickens or Cows
- Expression vector can now express protein coded
for by synthetic gene - A bit more complicated than described above!!!
24DNA Sequence GenerationGene Insertion
25Outline
- Short biology Tutorial
- DNA Sequence Generation
- Why is the problem difficult?
- IBG Gene Designer
- Genetic Algorithm (GA) solution
- Heuristics and Fitness Evaluation
26DNA Sequence GenerationThe Computational Problem
- Why is the problem difficult?
- Conflicting goals
- Avoid restriction sites
- Maximizing Codon Preference
- Thus, cannot use deterministic algorithm
- Degeneracy (redundancy) of the DNA code 64
codons, 20 (21) amino acids (see next slide) - Several synonymous codons are translated into the
same amino acid - Synonymous codons per AA vary from one to six
(average is four codons per AA)? - Huge number of possible DNA Sequences
- Average 2N for protein of amino acid length n
- Codon Preference
- Varying levels of tRNA assembly components in
organisms - Codon usage for a particular AA greatly influence
protein expression - (continued)
27DNA Sequence GenerationCodon to Amino Acid
Translation
- http//campus.queens.edu/faculty/jannr/Genetics/im
ages/codon.jpg
28DNA Sequence GenerationThe Computational
Problem (cont)?
- Why is the problem difficult?
- (continued)
- Restriction Enzymes
- The vector will contain many restriction enzymes
- If these cut up our DNA, we wont express our
proteins - We must design the DNA string using synonymous
codons so that there are no restriction sites - Helpful to include some other restriction sites
- We must design the DNA string using synonymous
codons so that these are included - (continued)
29DNA Sequence GenerationThe Computational
Problem (cont)?
- Why is the problem difficult?
- (continued)
- mRNA Secondary Structure
- In prokaryotes, mRNA can fold into complex shapes
- This inhibits protein creation
- Oligonucleotide generation
- Want a specific melting temperature so that the
complex folding doesnt take place - The sticky ends must have the same melting
temperature so that they will bind together.
30Outline
- Short biology Tutorial
- DNA Sequence Generation
- Why is the problem difficult?
- IBG Gene Designer
- Genetic Algorithm (GA) solution
- Heuristics and Fitness Evaluation
31IBG GeneDesignerOur Solution
32IBG GeneDesignerGenetic Algorithm
- Uses a Genetic Algorithm for sequence
optimization - Tournament selection model
- Uniform and single-point crossover (behind the
scenes not user selectable at present.)? - Mutation causes codon wobbling
- Sequence fitness determined by heuristic
evaluation
33IBG GeneDesignerFitness Evaluation
- GeneDesigner heuristics
- Manipulation of nucleotide percentages/ratios to
reduce mRNA secondary structure formation - Inclusion and Exclusion of restriction sites
- Restriction sites requested for inclusion should
only occur once - Matching of codon preference
- Oligonucleotide generation
- Fitness determined by melting points, start and
end nucleotide
34IBG GeneDesignerFuture Work
- Algorithm parameters
- Systematically manipulate GA parameters to
identify default values for sequence optimization - Population size
- Number of generations
- Mutation rate
- Convergence criteria
- Modify heuristic weighting scheme
- Selection models
- Experiment with alternative selection models
(Roulette wheel, elitism, limit population
replacement)?
35IBG GeneDesignerFuture Work
- Move algorithm to ECJ architecture
- Use the Strength-Pareto multi-objective
optimization algorithm - Create web-based version of application
- Explore island model effects on optimization
36Results
- IBG GeneDesigner utilized to generate a
nucleotide sequence for the SH3 domain of
a-spectrin1. - The codon optimization option was set for
expression in E. coli with a 40 G/C bias - We also used the application to generate four
assembly PCR template oligonucleotide sequences
to produce the protein coding sequence flanked by
desired restriction enzyme recognition sites. - The calculated Tm values of the three overlapping
regions were within 1.6oC - Promoting similar annealing behavior between
strands. - Success of the reaction was confirmed by DNA
sequencing of a pUC19 expression vector
containing the PCR product cloned between
restriction sites included in the gene design. - Summary Protein Made!!!
37Input Protein Sequnce, Vector, Restriction
Enzymes
38Input Flanking Sequences
39 Input Algorithm Parameters and Fitness Scores
40Output Generation of Oligonucleotides
41(No Transcript)
42Acknowledgements
- Graduate student who did much of the coding
- Rob Vogelbacher
- University of Chicago undergraduate who used it
to build a protein - Benjamin R. Capraro
- His advisor
- Tobin Sosnick
- Our collaborator at University of chicago
- Shohei Koide