Title: CS1315: Introduction to Media Computation
1COMP.3500/5800 Topics in Bioinformatics
- Webpage
- http//www.cs.uml.edu/kim/580.html
- What is bioinformatics ?
- Study of DNA sequences, genomes, protein
- Modeling/Inference
- What are involved in bioinformatics ?
- Biology
- Statistics Linear Algebra
- Computer Science
- Algorithms
- Machine learning
2COMP.3500/5800 Topics in Bioinformatics
- Why study bioinformatics ?
- What makes us different ? 99.9 genomes are
identical - How different cells are developed from the same
genome ? - Study mutation in genome -gt drug development
- For CS,
- Study how bioinformatics algorithms are
implemented
3Topics today
- Textbook, pgs. 16-19
- DNA RNA
- Genes to Proteins
- transcription
- Translation
- Genome
4DNA RNAGenes to ProteinsGenomes
5DNA and RNA
- DNA (deoxyribonucleic acid) and RNA (ribonucleic
acid) are composed of linear chains of monomeric
units of nucleotides - A nucleotide has three parts a sugar, a phophate
and a base - Four bases
6Base Types
- Nucleic acid bases are of two types
- Pyrimidine pairím?dìn C, T, U (two nitrogens
in 6-member ring at positions 1 and 3) - Purine A, G (pyrimidine ring fused to an
imidazole ring (C3H4N2))
7R
Y
W
M
K
B
A T G C
A T G C
A T G C
A T G C
s
N
H
V
D
A T G C
A T G C
A T G C
A T G C
8Primary Structure of DNA and RNA
- Nucleotides are joined by phosphodiester bonds
and form sugar-phosphate backbone - Sugar is deoxyribose in DNA (left) and ribose in
RNA (right) - Nitrogen-containing nucleobases are bonded to
sugar
9Online course on Biology
- Educational Portal
- DNA chemical structure
- http//education-portal.com/academy/lesson/dna-and
-the-chemical-structure-of-nucleic-acids.html
10Secondary Structure
- Double helix 1953 Watson and Crick using X-ray
diffraction - Sugar-phosphate backbone is the outer part of the
helix - Two strands run in antiparallel directions
- Dimensions
- Inside diameter of backbone 11 A (1.1 nm)
- Outer diameter 20 A (1A10-10 m 0.1 nm)
- Length of one complete turn 34 A, 10 base-pairs
- Major and minor grooves drugs or polypeptides
bind to DNA
11Secondary Structure of DNA
- Two strands are complementary
- Base pairing A-T G-C
- Pyrimidine and Purine form complementary H
bonding
12Monomer counts in DNA
- In double strands
- of A of T of G of C
- Erwin Chargaffs 1st Parity Rule, 1951
- In a single strand ?
- of A of T of G of C
- Erwin Chargaffs 2nd Parity Rule
- How about oligomer (a few successive bases)
frequencies ?
13Oligomer Frequencies
- Oligomer length k
- Window of k sliding by one base (overlapping k-1
bases) - A simple counting program
- May have to contend with long sequences
- An oligomer and its reverse complement
- ACT vs. AGT
A C T A A G C G
A C T A A G C G
A C T A A G C G
14Trimer Frequencies in Yeast
15Trimer Frequencies in a Few Species
16Importance of Hydrogen Bonding
- Many consider hydrogen bond essential to the
evolution of life - Individual hydrogen bond is weak, many H bonds
collectively exert very strong force - Orderly repetitive arrangement of H bonds in
polymers determines their shape
17(No Transcript)
18Online course on Biology
- Educational Portal
- Four bases
- http//education-portal.com/academy/lesson/dna-ade
nine-guanine-cytosine-thymine-complementary-base-p
airing.html
19Chromosome Length
- 3.4A per base
- 3 Billion bases
- 1.8 meters of DNA
- 0.09 nm of chromatin after being wound on
histones - Five families of histones
- H1/H5, H2A, H2B, H3, and H4
20(No Transcript)
21RNA
- Sugar in RNA nucleotide is ribose rather than
2-deoxyribose - Thymine is replaced by uracil (U)
- RNA polymers are usually a few thousand
nucleotides or shorter - RNA in cells is usually single-stranded
- RNA is considered to be the original gene coding
material, and it still code genes in a few viruses
22RNA Types
- Four RNAs are involved in protein synthesis
RNA Type Size Function
Transfer RNA Small Transports AA to protein synthesis sites
Ribosomal RNA Variable combines with proteins to form ribosome, where protein polypeptide chain grows
Messenger RNA Variable Transcribes AA sequence from genes
Small nuclear RNA Processing of initial mRNA to its mature form in eukaryotes
23Online course on Biology
- Educational Portal
- RNA
- http//education-portal.com/academy/lesson/differe
nces-between-rna-and-dna-types-of-rna-mrna-trna-rr
na.html
24DNA RNAGenes to ProteinsGenomes
25Central Dogma in Molecular Biology
- Info flows in one direction
- DNA/genome
- A template or a roadmap
- RNA
- Copies of genes to be expressed (activated)
- Protein
- Biochemical molecules performing biological
functions
26Gene to ProteinTranscription Translation
27Translation
28Transcription
coding, sense
anti-sense
29Transcription
30Gene to Protein
Protein Coding Region
3UTR
5UTR
mRNA
Non-Protein Coding Region
Non-Protein Coding Region
Protein 2
Protein 1
exon
intergenic
intron
UTR
31Alternative Splicing
32Translation
- Genetic Code
- A triplet (called codon)
- Ribosome moves along mRNA 3 bases at a time
- Degenerate coding
- 4x4x464 possible triplets into 20 Amino Acids
- 8 AA have 3rd base irrelevant immune to
mutation - Anti-codon reverse complement of a codon
33Genetic Code
34Genetic Code
35Genetic Code
36Amino Acids
- General structure of amino acids
- an amino group
- a carboxyl group
- a-carbon bonded to a hydrogen and a side-chain
group, R - R determines the identity of particular amino
acid - Protein a sequence of AAs
-
- R large white and gray
- C black
- Nitrogen blue
- Oxygen red
- Hydrogen white
37Translation and Transcription in Details
38(No Transcript)
39mRNA (messenger RNA) another view
40Transcription
- Gene sequence is copied from one strand
- Sense strand mRNA sequence
- Antisense strand is used to generate mRNA
sequence - 5CGCTATAGCGTTTCAT 3 -- antisense, template
strand - 3GCGATATCGCAAAGTA 5 sense, coding strand
41Template, anti-sense
sense
42RNAs
- Protein coding RNA mRNA
- Non-coding RNAs
- tRNA (transfer RNA)
- rRNA (ribosomal RNA)
- Involved in protein synthesis
- 80-85 of total RNAs
- snRNA (small nuclear RNA)
- Localized to the nucleus
- Consists of families of RNAs responsible for
functions such as RNA splicing - E.g. splicsome five snRNAs U1, U2, U4, U5 and
U6 - snoRNA (small nucleolar RNA)
- Synthesis of rRNA occurs in nucleolus,
specialized structure within the nucleus,
facilitated by snoRNAs - miRNA (micro RNA)
- Short 22-nt, regulating gene expressions
43Ribosomes for Translation
- Role of ribosomes in protein synthesis
- Coordinate protein synthesis by placing mRNA,
tRNA and protein factors in their correct
positions - Components of ribomsomes catalyze at least some
chemical reactions occurring during translation - Each ribosome consists of two units
- 45S (18S, 28S, 5.8S) subunits in eukaryotes
- 50S and 30S in bacteria (16S, 23S, 5S)
44Schematic drawing of secondary structure for 16S
rRNA. Intrachain folding pattern includes loops
and double-stranded regions.
45Protein Synthesis
- Translation from mRNA to protein
- mRNA is transported out of a cell nucleus,
translated - tRNA (transfer RNA)
46(No Transcript)
47tRNA
- Anti-codon and AA ends do not know each other
- Aminoacyl-tRNA synthetase recognizes a DHU loop
and determines which AA is attached to a tRNA
48Is the genetic code arbitrary?
- http//www.cs.uml.edu/kim/580/SA_genetic_code.pdf
- Douglas Hofstadter, Scientific American, early
80s - Where is the genetic code stored ?
- Ribosome translate a codon to an AA
- How does ribosome know which AA to attract for a
codon? - tRNA has anti-codon in one end and AA in the
other - Does a tRNA know the genetic code ?
- No. Aminoacyl-tRNA synthetase matches up DHU loop
to an AA -
49Is the genetic code arbitrary?
- In a cell
- Remove all mRNA and tRNA and discard them
- Remove all DNA, modify it according to a new
genetic code, and insert them back to the cell - Leave all ribosomes and others intact
- Will the cell function the same way (or, will the
same set of proteins be manufactured by the new
genetic code) ?
50Is the genetic code arbitrary?
- New DNA has new coding for tRNA, mRNA
- New tRNA
- According to DHU loop, synthetase matches new
anti-codon to an AA which would have been matched
up by the old genetic code - Therefore, new genetic code will generate the
same proteins
51HW 1
- Read Hofstadters article
- Submit a report on 1/30/17, including
- Re-statement of his hypothesis
- Bases of his claim
- If you agree with him claim
- Justification of your argument
- Include references in the report
52DNA RNAGenes to ProteinsGenomes
53Genome
- Genome
- The entire DNAs of a cell is the genome
- Individual units for coding proteins or RNA are
genes - A gene starts with ATG, ends with one or two stop
codons - Called ORF (Open Reading Frame)
- Biological Info
- Contained in genome
- Encoded in nucleotide sequences of DNA or RNA
- Partitioned into discrete units, genes
54Cell
- Different levels of cells
- Prokaryote (karyan, kernel in
Greek)(/proekaeri?ts) (pro for before) - Eukaryote (true)
- Main difference is the presence of organelle,
especially the nucleus, in eukaryotes
Organelle Prokaryotes Eukaryotes
Nucleus No definite nucleus Present
Cell membrane Present Present
Mitochondria None. Present
Endoplasmic reticulum None Present
Ribosomes Present Present
Chloroplasts None Present in green plants
55animal cell
plant cell
Prokaryotic cell
56Three Domain
- Classification purely based on biochemistry (RNA)
- C. Woese, 1981
- Eubacteria (true bacteria)
- Archaea (archaebacteria, early bacteria)
- Eukarya (eukaryotes)
57Genome Sequencing Projects
- Major genome sequencing centers
- U.S. Dept. of Energy Joint Genome Institute (435
projects) - J. Craig Venter Institue (302)
- The Institute for Genomic Research (TIGR) (206)
- Washington Univ. (184)
- Institut Pasteur, Univ. of Tokyo
- www.ncbi.nlm.nih.gov/genomes/static/lcenters.html
- national center for biotechnology information
- Completely sequenced genomes include
- Several hundred bacteria, over 20 archea, and
over 30 eukarya - Human (homo sapies), chimpanzee (Pan
troglodytes), mouse (Mus musculus), brown rat
(Rattus norvegicus), dog (Canis familiaris),
Thale cress (Arabidopsis thaliana), rice (Oryza
sativa), Fruit fly (Drosophila melanogaster),
yeast (Saccharomyces cerevisiae) - http//www.ebi.ac.uk/2can/genomes/genomes.html
has descriptions of species and their clinical
and scientific significances - http//www.genomesonline.org has current status
of genome projects
58Genome Databases
- Completed genomes
- ftp site -- ftp//ftp.ncbi.nlm.nih.gov/genomes/
- http//www.ncbi.nlm.nih.gov/PMGifs/Genomes/allorg.
html - http//www.ebi.ac.uk/genomes/mot/index.html
- http/pir.goergetown.edu/pirwww/search/genome.html
- Organism-specific databases
- http//www.unledu/stc-95/ResTools/biotools/biotool
s10.html - http//www.fp.mcs.anl.gov/gaasterland/genomes.htm
l - http//www.hgmp.mrc.ac.uk/GenomeWeb/genome-db.html
- http//www.bioinformatik.de/cgi-bin/browse/Catalog
/Databases/Genome_Proejcts
59Genomes of Prokaryotes
- Circular double-stranded DNA
- Protein-coding regions do not contain introns
- Protein-coding regions are partially organized
into operons tandom genes transcribed into a
single mRNA molecule - The density of coding region is high
- 89 in E.Coli
trpE
trpD
The trp operon in E.Coli begins with control
region, followed by genes performing successive
steps in systhesis of tryptophan AA
60Genome of E.Coli
- Many E.Coli proteins were known before the
sequencing (1853 proteins) - Genome of Escherichia coli, strain MG1655
published in 1997 - By F. Blattner at Univ. Wisconsin
- 4.64 Mbp
- 4284 protein-coding genes, 122 structural RNA
genes, Non-coding repeat sequences, Regulatory
elements, etc. - Average size of ORF is 317 AA
- Average inter-genic gap is 118 bp
- ¾ transcribe single genes, and the rest are
operons (gene clusters) - 60 protein functions are known
- http//wishart.biology.ualberta.ca/BacMap/index.ht
ml contains an atlas of bacterial genome diagram
(2005)
61(No Transcript)
62Genome of Archea
- Microorganism Methanococcus jannaschii
- thrives in hydrothermal vents at temp from 48 to
94 CB genes from 45 strains - Capable of self-reproduction from inorganic
components - Metabolism is to synthesize methane from H2 and
CO2 - Sequenced in 1996 by TIGR
- 1.665 Mbp in chromosome containing a circular DNA
modecule, two extra-chromosomal elements - 1,784 protein-coding regions
- Proteins in archea for transcription and
translation are closer to those in eukaryote - Proteins involved in metabolism are closer to
those of bacteria
63Genomes of Eukarya
- Majority of DNA is in the nucleus
- Organized into chromosomes containing single-DNA
molecule each - Smaller amount of DNA in organelles such as
mitochondria and chloroplasts - Organelles originated as intra-cellular parasites
- Organelle genomes usually have circular forms,
but sometimes in linear or multi-circular shape - Genetic code is different that the one for
nuclear genes - Diverse among species
- Humans have 23 chromosomes, chimpanzees have 24
- Human chromosome 2 is equivalent to a fusion of
chimpanzee chromosomes 12 and 13 - List of genome sequences
- http//en.wikipedia.org/wiki/List_of_sequenced_euk
aryotic_genomes
64Genome of Saccharomyces cerevisiae (Yeast)
- Simplest eukaryotic organism
- Sequencing from 100 labs completed in 1992
- 12.06 Mbp
- 16 chromosomes
- 6,172 protein-coding genes
- Dense only 231 genes contain introns
65Genome of Caenorhabditis elegans (C. elegans)
- Completed in 1998
- First full DNA sequence of a multi-cellular
organism - 97 Mbp
- Paired chromosomes
- XX for a self-fertilizing hermaphrodite
(simultaneously male and female) - XO for male
- Avg. 5 introns per gene
- Proteins
- 42 have homologues to other species
- 34 specific to nematodes (round worms)
- 24 no known homologues
Chromosome Size (Mbp) Protein genes Kbp/gene
I 7.9 2803 5.06
II 8.5 3559 3.05
III 7.6 2508 5.40
IV 9.2 3094 5.17
V 9.8 4082 4.15
X 10.1 2631 6.54
66Genome of Drosophila melanogaster (Fruit fly)
- Completed in 1999 by Celera Genomics and Berkeley
- 180 Mbp
- Five chromosomes 3 large autosomes, Y, and tiny
fifth - 13,601 genes, 1 gene/8Kbp
- Has 289 homologues to human genes
- Such as cancer, cardiovascular, neurological,
etc. - There is a fly model for Parkinson and malaria
67Genome of Arabidopsis thaliana
- Relatively small genome, 146 Mbp, completed in
2000 - Five chromosomes
- 25,498 predicted genes 1 gene/4.6 kbp
- Proteins
- Most A. thaliana proteins have homologues in
animals - 60 of genes have human homologues, e.g., BRCA2
- Gene distribution
- Nucleus genome size (125 Mbp), genes (25,500)
- Chloroplast genome (154 Kbp), genes (79)
- Mitochondrion genome (367 Kbp), genes (58)
68- 20 of 54 genes in a 340-Kbp stretch of rice
genome (top) are conserved and retain the same
order in five A. thalia strands
69Human Genome
- Human Genome Project
- Conceived in 1984, begun in 1990, completed in
2001 ahead of 2003 schedule - What did the sequence reveal ?
- 3 Bbp (base pair)
- 24 chromosomes,
- 22 autosomes plus two sex chromasomes (X,Y)
- Longest 250 Mbp, shorted 55 Mbp
- Mitochondrial genome
- Circular DNA molecule of 16.569 Mbp
- 10(13) cells
- How many is 3 Bbp ?
- Typical 11-pt font can print 60 nucleotide is 3
in (10 cm). - In this format, 3 Bbp writes out in 5,000 mi
70Genome of Homo sapiens
- 22 chromosomes plus X (163 Mbp) and Y (51 Mbp)
- Web resources
- Interactive access to DNA and protein sequences
- http//www.ensembl.org
- Images of chromosomes, maps, loci
- http//www.ncbi.nlm.nih.gov/projects/genome/guide/
- Gene map 99
- http//www.ncbi.nlm.nih.gov/genemap99
- overview of human genome structure
- http//www.ims.u-tokyo.ac.jp/imsut/en
- SNP (Single nucleotide polymorphisms)
- http//snp.cshl.org
- Human genetic diseases
- http//www.ncbi.nlm.nih.gov/Omim (Online
Mendelian Inheritance in Man ) - http//www.geneclinics.org/profiles/all-html
71Human Genome Insights (ENCODE)
- Majority of genome is transcribed
- 50 transposons
- 25 protein coding genes/1.3 exons
- 23,700 protein coding genes
- 160,000 transcripts
- Average Gene 36,000 bp
- 7 exons _at_ 300 bp
- 6 introns _at_ 5,700 bp
- 7 alternatively spliced products (95 of genes)
- RefSeq 34,600 reference sequence genes
(includes pseudogenes, known RNA genes)
72Genome of Homo sapiens (contd)
- Repeat sequences gt50 of the genome
- Short interspersed nuclear elements (SINEs) 13
, LINEs 21 - Simple stutters (repeats of short oligomers
including mini- and micro-satellites) - Triplet repeats such as CAG are implicated in
numerous diseases (e.g., glutamine repeats in
glutamine protein) - SNP (pronounced snip)
- A-gtT mutation in beta-globin changes Glu -gt Val,
creating a sticky surface on haemoglobin
molecules gt sicklecell anaemia - Progeria
- Avg 1 SNP/Kbp (100 SNPs per 100 Kbp)
- Many 100-Kbp regions tend to remain intact, with
fewer than five SNPs - ? discrete combinations of SNPs define
individuals haplotype (haploid genotype) - Individual genomes are characterized by a
distribtuion of genetic makers including SNPs - Intl HapMap Consortium
73Genome of Homo sapiens (contd)
- SNP consortium
- Collects human SNPs, nearly 5 million SNPs
- Show
- Most of variations appear in all populations
- However, a few SNPs are unique to particular
populations - Genomes of individuals from Japan and China are
very similar - Chromosome X varies more than other chromosomes
(X is more subject to selective pressure) - Mitochondrial DNA
- Double-stranded closed circular molecule of
16,569 bp - Inherited almost exclusively through maternal
lines - Not subject to recombination, and changes only by
mutation - About 1 mutation every 25,000 years
74mtDNA and Y
- mtDNA Inherited through maternal lines
- Both sons and daughters get it from their mother
- All existing sequence variants are traced back to
a single woman (Mitochondrial Eve) in Africa
roughly 200,000 years ago - Supports from Africa hypothesis
- Avg difference in mtDNA between pairs of
individuals is 61.1, between Africans is 76.7,
between non-Africans is 38.5 - More divergent populations in Africa for much
longer than in the rest of the world - Y chromosome
- Most recent common male ancestor (Y-chromosome
Adam) is around 59,000 years ago - Most divergent sequences are found from Africans
75Other Species
Organism Genome size of genes
Epstein Barr virus 0.17 Mbp 80
E.Coli 4.6 Mbp 4,406
Yeast (S. cerevisiae) 12.5 Mbp 6,172
Nematode worm (C.elegans) 100.3 Mbp 19,099
Thale cress (A. thaliana) 115.4 Mbp 25,498
Fruit fly (D. melanogaster) 128.3 Mbp 13,601
Human (H. sapiens) 3223.0 Mbp 20,500
Fugu (Takifugu rubripes) 390.0 Mbp 30,000
Wheat 16000.0 Mbp 30,000
76Genome
- one or more chromosomes that contain the code
(gene) that directs the synthesis of proteins
that are essential for its structure and function - Human 22 pairs of homologous chromosomes XY
- http//www.ncbi.nlm.nih.gov/genome/?termtxid9606
orgn
77Genes
- Allele
- Alternative forms of the same gene
- Dominant, recessive
78(No Transcript)