Genome Science - PowerPoint PPT Presentation


PPT – Genome Science PowerPoint presentation | free to download - id: 846248-MGE1N


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Genome Science


Genome Science Ka-Lok Ng Dept. of Bioinformatics Asia University – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 125
Provided by: edut1550


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Genome Science

Genome Science
  • Ka-Lok Ng
  • Dept. of Bioinformatics
  • Asia University

The Core Aims of Genomics Science
  • (1) An integrated web-based database and research
  • access to the enormous volume of data
  • web interfaces
  • Relational databases
  • Generic Model Organism Database (GMOD)
  • project http// ? to develop
    reusable components suitable for creating new
    community databases of biology

The Core Aims of Genomics Science
  • (2) To assemble physical an genetic maps
  • location of genes in a genome
  • physical distance and relative position
    defined by
  • recombination frequencies
  • the map is crucial for comparing the genomes
    of related
  • species
  • related phenotypic and genetics data
  • used in animal and plants breeding
  • extend to more species with greater accuracy

The Core Aims of Genomics Science
  • To generate and order genomic and expressed gene
  • High-volume sequencing
  • Basic technique is developed by Fred Sanger
  • Shotgun approach ? assemble into contigs,
    scaffolds (a set of contigs), then the whole
  • mRNA is unstable
  • Coding parts ? cDNA clones cloned from mRNA
  • Expressed sequence tags (ESTs)
  • Obtain full length cDNA is not easy ? because of
    mRNA structure

The Core Aims of Genomics Science
  • To generate and order genomic and expressed gene
  • mRNA ? cDNA ? EST

Whole genome reconstruction
Reverse transcription ? cDNA EST - partial cDNA
sequences sequenced either from 5' or
3 Alternative splicing ? not a
one-to-one correspondence between ESTs and genes
The Core Aims of Genomics Science
  • Identify and annotate the complete set of genes
    encoded within a genome
  • From complete sequence of a genome ? genes
  • Alignment of cDNA, DNA and protein sequences
  • Gene finding software ORFs, transcription start
    and termination sites, exon/intron boundaries
  • Then gene annotation ? linking sequence to
    genetic function, expression, locus information,
    comparative data from homologous proteins

The Core Aims of Genomics Science
  • (5) To characterize DNA sequence diversity
  • Single-nucleotide polymorphisms (SNPs)
  • About 90 percent of human genome variation comes
    in the form of single nucleotide polymorphisms
    (neither harmful nor beneficial)
  • Theoretically, a SNP could have four possible
    forms, or alleles (different seq. alternative),
    since there are four types of bases in DNA. But
    in reality, most SNPs have only two alleles. For
    example, if some people have a T at a certain
    place in their genome while everyone else has a
    G, that place in the genome is a SNP with a T
    allele and a G allele.
  • The human genome contains more than 10 million
    SNPs ? once in every 100 to 300 bp !
  • Find associations between SNP variation and
    phenotypic variation,e.g. Sickle-cell anemia

Sickle-cell anemia and SNP
  • http//

The Core Aims of Genomics Science
  • (5) To characterize DNA sequence diversity
  • Characterize the level of haplotype structure due
    to linkage disequilibrium (LD)
  • haplotype a set of adjacent polymorphisms found
    on a single chromosome
  • LD groups of closely linked alleles that tend
    to be inherited together, can be used to map
    human disease genes very accurately
  • Knowledge of LD are utilized to do disease locus
  • In the human genome, haplotypes tend to be
    approximately 60,000 bp in size and therefore
    contain up to 60 SNPs that travel as a group.

The Core Aims of Genomics Science
  • Mendel's Laws enable the outcome of genetic
    crosses to be predicted.

A and B on different chromosome
The Core Aims of Genomics Science
  • Genes on the same chromosome should display
  • Genes A and B are on the same chromosome and so
    should be inherited together. Mendel's Second Law
    should therefore not apply to the inheritance of
    A and B, but holds for the inheritance of A and
    C, or B and C. Mendel did not discover linkage
    because the seven genes that he studied were each
    on a different pea chromosome.

Partial linkage Partial linkage was discovered
in the early 20th century. The cross shown here
was carried out by Bateson, Saunders and Punnett
in 1905 with sweet peas. The parental cross gives
the typical dihybrid result (see Figure on the
right ), with all the F1 plants displaying the
same phenotype, indicating that the dominant
alleles are purple flowers and long pollen
grains. The F1 cross gives unexpected results as
the progeny (??) show neither a 9 3 3 1
ratio (expected for genes on different
chromosomes) nor a 3 1 ratio (expected if the
genes are completely linked). An unusual ratio is
typical of partial linkage
The Core Aims of Genomics Science
  • (5) To characterize DNA sequence diversity
  • the farther apart two genes are, the more they
    tend to assort independently (randomly) ?
    recombination frequency ?

Higher freq. ? farther apart
Vermilion - ???
The Core Aims of Genomics Science
  • (6) To compile atlases of gene expression
  • analyzing profiles of transcription and protein
  • traditional method Northern blots,
  • modern technology microarray
  • relative level of expression (differential
  • patterns of covariation in gene expression ?
    clues to unknown gene function (guilt by

The Core Aims of Genomics Science
  • (7) To accumulate functional data, including
    biochemical and phenotypic properties of genes
  • Near-saturation mutagenesis (screening hundreds
    of thousands of mutants to identify genes that
    affect traits as diverse as embryogenesis,
    immunology, and behavior)
  • high-throughput reverse genetics (methods to
    systematically and specifically inactivate
    individual genes).
  • Yeast Genome Deletion Project http//www-sequence.
  • Mouse http//
  • Proteomics detecting protein expression and
    protein-protein interactions
  • Pharmacogenomicists study the interactions
    between small molecules (i.e. potential drugs)
    and proteins
  • Functional genomics a crucial component is to
    study various model organisms
  • Clone library collections of DNA fragments that
    are cloned into a vector

The Core Aims of Genomics Science
With Smith's site-directed mutagenesis the
researchers can study in detail how proteins
function and how they interact with other
biological molecules. Site-directed mutagenesis
can be used, for example, to systematically
change amino acids in enzymes, in order to better
understand the function of these important
biocatalysts. The researchers can also analyze
how a protein is folded into its biologically
active three-dimensional structure. The method
can also be used to study the complex cellular
regulation of the genes and to increase our
understanding of the mechanism behind genetic and
infectious diseases, including cancer.
GTC ? Valine
GCC ? Alanine
Site-directed mutagenesis
The Core Aims of Genomics Science
  • (8) To provide the resources for comparison with
    other genomes.
  • Comparative maps ? allow genetic data from one
    species to be used in the other species
  • Comparative maps ? local gene order along a
    chromosome tends to be conserved ? Synteny (human
    and mouse genome)
  • Even without synteny, the conservation of gene
    function is known (say from fly to primate?????)
  • Gene order conservation (GOC)

  • Mapping Genomes Genetic Maps
  • Genetic map the relative order of genetic
    markers in linkage groups in which the distance
    between markers is expressed as units of
  • Genetic markers sequences tags, repeats,
    restriction enzyme polymorphism (cutting sites)
  • In diploid (??????) organisms, genetic maps are
    assembled from data on the co-segregation (????)
    of genetic markers either in pedigrees (??) or in
    the progeny (??) of controlled crosses.
  • Genetic distance unit ? centriMorgan (cM)
  • In human 1cM 1 of recombination frequency
  • Human, 1cM 1Mbp
  • 100 cM ? 1 crossover occurs per chromosome per
  • Markers on different chromosomes have a 50-50
    chance of co-segregation ?50cM (0.5 crossover
    occurs per generation)

Mapping Genomes Genetic Maps
(A) A pair of different parental
chromosomes (green and blue colors). (B) A table
showing the frequency of recombinants between
each marker. Larger number indicates that the
genes are farther apart. (C) The most likely
genetic map from the entire data. In this
hypothetical example, two linkage groups are
inferred, the top one is longer than 50 cM.
Genetic distance 0.11 ? 11cM 0.22 ? 21cM, 0.25
? 24cM, 0.33 ? 33cM
Figure 1.1
Mapping Genomes Genetic Maps
  • Software of the assembly of genetic maps
  • http//
  • Multiple factors lead to high variation in the
  • correspondence between physical and genetic
  • There is variability of recombination rate
    along a
  • chromosome (centromeres and telomeres are
  • reconbinogenic than general euchromatin) ?
    hot spots
  • and cold spots of recombination

Exercise 1.1 (Part 1) Constructing a genetic map
Constructing a genetic map - four recessive loci
thickskin, reddish, sour, petite. After
identifying two true-breeding trees that are
either completely wild-type or mutant for all
four loci, the breeder crosses them, and then
plants an orchard of F2 (second generation)
trees. Q. Based on the following frequencies of
mutant classes, determine which loci are likely
to be on the same chromosome and which are the
most closely linked.
Exercise 1.1 (Part 2) Constructing a genetic map
Assume independent assortment for each recessive
phenotype ? ¼ ? 242 petite (12742381210832)
, 249 reddish, 247 sour and 236 thickskin
Expect that unlinked loci would segregate
independently ? 60 trees (that is 1/41/4968)
produced each double mutants class
Exercise 1.1 (Part 2) Constructing a genetic map
Mapping Genomes Genetic Maps
  • Exercise 1.1 Constructing a genetic map
  • four recessive loci thickskin, reddish, sour,
  • Q. Determine which loci are likely to be on the
    same chromosome and which are the most closely
  • Answer Total number of 968 trees. Assume
    independent assortment for each recessive
    phenotype ? ¼ ? 242 petite, 249 reddish, 247 sour
    and 236 thickskin
  • Expect that unlinked loci would segregate
    independently ? 60 trees (that is 1/41/4968)
    produced each double mutants class

Exercise 1.1 (Part 2) Constructing a genetic map
Mapping Genomes Genetic Maps
Approximate solution
Mapping Genomes Physical Maps
  • Physical maps
  • is an assembly of contiguous stretches of
    chromosomal DNA contigs in which the
    distance between landmark sequences of DNA is
    expressed in kilobases
  • the ultimate physical map is the complete
  • Applications
  • (1) provide a scaffold upon which polymorphic
    markers can be placed
  • (2) facilitating finer scale linkage mapping
  • (3) confirm linkages inferred from recombination
  • (4) resolve ambiguities about the order of
    closely linked genes
  • (5) enable detailed comparisons of regions of
    synteny between genomes

Mapping Genomes Physical Maps
  • Two strategies used to assemble contigs
  • Alignment of randomly isolated clones based on
    shared restriction fragment length profiles
  • YAC 1Mbp long fragments
  • BAC 100kbp long fragments
  • Plasmid kbp long fragments
  • Automatic restriction profiling (Ch. 2)
    ?assemble contigs (short for "contiguous

Genomic clone library
Unlike the case of fX174, no large genome could
be completely sequenced without an extra round
of fragmentation into manageable sized chunks.
In other words it had to be transferred into one
or more clone libraries from which individual
clones were picked to be "subcloned" in M13 for
sequencing. The general outline of the procedure
is shown at right. You can see that fX174
bypassed the first stage, the construction of a
clone library from the target genome. cDNA
library made from RNA that has been reverse
transcribed into cDNA and are used for EST
sequencing projects.
Cloning vectors
Mapping Genomes Physical Maps
  • (2) Hybridization-based approaches chromosome
  • Chromosome walking is used as a means of finding
    adjacent genes (positional cloning), or parts of
    a gene which are missing in the original clone as
    well as to analyze long stretches of eukaryotic
    DNA. This task requires finding a set of
    overlapping fragments of DNA that spans the
    distance between the marker and the gene.
  • Genomic DNA is shown in blue. Selected clones
    from a library of cloned genomic DNA fragments
    are shown in red. The initial probe, probe a, is
    specific to gene A or exon A and allows
    identification of clones 1 and 2. A new probe,
    probe b, is prepared from one end of clone 2 and
    used to isolate new clones 3 and 4 from the
    genomic library. Probe c, prepared from clone 4
    is used to identify clone 5, etc. The orientation
    of the clones is determined by restriction
    mapping of the clones. Clone 6 contains the
    desired gene B or exon B.

Mapping Genomes Cytogenetic Maps
  • Historically aid in the alignment of physical
    and genetic maps
  • Cytogenetic maps are the banding patterns
    observed through a microscope on stained
    chromosome spreads
  • Traditional preparation salivary gland polytene
    chromosomes ???????? (greatly enlarged relative
    to their usual condition) of insects and
    Giemsa-banded mammalian metaphase karyotypes
  • http//
  • Chromosomes ? the genetic material ? phenotypes
    or medical conditions correlate with the deletion
    or rearrangement of chromosome sections
  • Cytogenetic map are aligned with the physical map
    through in situ (????) hybridization a clone
    fragment is annealed to a single location on the
    cytogenetic map
  • NCBI Genomic Biology http//
  • Keyword HOX AND homoORGN

Mapping Genomes Cytogenetic Maps
Alignment of cytological, physical, and genetic
maps. Cytological map a representation of a
chromosome based on the pattern of staining of
bands Physical map the location of transcripts
and sites of insertions and deletions Genetic
map recombination rates vary along a
chromosome, typically reduced near the telomere
and centromere Distances between genetic,
physical and cytological markers are not
uniform How to search for genes on a genome map
? See my lecture notes on Bioinformatics class.
  • Comparative Genomics

Synteny conservation of gene order between
chromosome segments of two or more
organisms. Homologes highly conserved loci
derived form a common ancestral locus Orthologs
similar genes that arose as result of duplication
subsequent to an evolutionary split Paralogs
similar genes that arose as result of duplication
  • Conservation of gene order is an inverse
    function of the times since
  • divergence from the ancestral locus.
  • Note rates of divergence vary considerably at
    all taxonomic levels.
  • Japanese pufferfish 7.5 times smaller than
    the human genome, show
  • extensive gene order similarity with humans,
    around 50 - 80 is in the same
  • order as is found in the human genome

Comparative Genomics
  1. Chromosome painting used to define regions of
    Synteny cover regions (0.1 of a chromosome arm)
  2. Each chromosome of one species is labeled with a
    set of fluorescent dyes, and hybridized to
    chromosome spreads of the other genome.
  3. Uses the fluorescent in situ hybridization (FISH)
    technique to detect DNA sequences in metaphase
    spreads of animal cells. The fluorescently
    labeled hybrid karyotype is shown in bottom.

Comparative Genomics
Synteny between cat and human genomes. Ideograms
(??????) for each of the 24 chromosomes shown on
the right in each pair are aligned against
color-coded representations of corresponding cat
chromosomes. CAT six groups (A F) of 2 4
chromosomes each. Top row 12 autosomes that are
essentially syntenic along, except for some
rearrangements Bottom row 10
autosomes that have at least one major
rearrangement The two sex chromosomes are
essentially syntenic between cat and human
Comparative Genomics
  • Sequence conservation functional importance
  • High-resolution comparative physical mapping
    found 1Mbp synteny region between human and
  • May contain hundreds of genes, local inversions
    and insertions/deletions involving one or a few
  • Families of genes organized in tandem clusters
  • Considerable size variation in intergenic junk

Comparative Genomics
  • Identifying genes and regulatory regions in seq.
    genomes is challenging
  • ORF are usually good

Comparative Genomics
  • Identifying genes and regulatory regions in
    sequenced genomes is challenging
  • Open reading frames (ORFs) are usually good
    indication of genes
  • However, it is difficult to determine which ORFs
    belong to a gene
  • Many mammalian genes have small exons and large
  • Regulatory sequences even more difficult

Comparative Genomics
  • Computer programs analyze genomic sequence
  • GeneFinder
  • Look for ORFs, splice sites, poly A addition
    sites, etc.
  • Predict gene structure
  • Frequently wrong
  • Usually miss exons at beginning or end of gene
  • Sometimes predict exon when one doesnt really

Comparative Genomics
  • When comparing genomes of different species, the
    genes normally have the same exonintron
  • Look for conserved ORFs in both genomes
  • Frequently permit accurate identification of
  • Fuguhuman comparison found gt1,000 genes
  • Mousehuman comparison indicates only 25,000
    genes in genome

Example of sequence comparison
  • Comparison of the human and mouse spermidine
    synthase genes revealed an additional intron in
    the human gene that is not found in the mouse

5,500 bp
  • The Human Genome Project (HGP)
  • Objectives
  • Generation of high-resolution genetic and
    physical maps that will help in the
  • localization of disease-associated genes.
  • The attainment of sequence benchmarks, leading to
    generation of a complete
  • genome sequence by the year 2005. (A draft
    version was achieved in May 2000,
  • but finished sequence required an error
    rate of less than 1 in 10,000 bp)
  • Identification of each and every gene in the
    genome by a combination
  • bioinformatics identification of open
    reading frame (ORFs), generation of voluminous
  • EST databases, and collation(??)of
    functional data including comparative data from
  • other animal genome projects.
  • Compilation of exhaustive polymorphism databases,
    in particular of SNPs, to
  • facilitate integration of genomic and
    clinical data, as well as studies of human
  • diversity and evolution.

The Human Genome Project (HGP)
  • Table 1.1 Initial Goals of the HGP
  • From the First 5-Year Plan 1993-1998
  • Table 1.2 A Blueprint for the Future of the HGP
  • 15 Grand Challenges in the Third 5-Year Plan
    2003 2005
  • HGP budget set aside for research on the
    ethical , legal, and social implication of
    genetic reserach (the ELSI project)

(No Transcript)
(No Transcript)
(No Transcript)
(No Transcript)
(No Transcript)
The Human Genome Project
The architecture of the Human Genome Project in
the twenty-first century. Three major themes for
future genome research are founded on six
pillars of genome resources.
  • Box 1.1 The Ethical, Legal, and Social
    Implications of the HGP
  • Funding The National Human Genome Research
    Institute (NHGRI) ? 5 of its annual budget to
  • Funding three types of activities regular
    research grants, education grants, and intramural
    programs at the NIH campus
  • Web sites http//
  • http//
  • 4 major objectives
  • 4 main subject areas

  • Great concern is the privacy and confidentiality
    of genetic information.
  • Especially Iceland (?????????
    http// and
    Estonia (??????? http//
  • ? government-sponsored databases of medical
    records have been supplied to medical research
  • Psychological impact and potential for
    stigmatization (?????,?????) inherent in the
    generation of genetic data ? racial mistrust and
    socioeconomic differences in gathering of and
    access to genetic information
  • Reproductive issues
  • Potential moral (possible legal) obligations once
    data has been obtained.
  • Philosophical discussions human responsibility,
    human right to play God with genetic material,
    meaning of free will in relation to genetically
    influenced behaviors
  • Genetically Modified Organisms (GMOs)
  • 1998 Five new major aims

(No Transcript)
1.7 (Part 1) Whose genome was sequenced?
The content of the Human Genome
  • Completion of the first draft of the HGP was
    announced at press conference in May 2000, but
    publication of the result was delayed until Feb.
    of 2001.
  • Need refinment of the seq. assembly, including
    gap closure, gene annotation, and prediction
  • It is estimated that the total number of genes is
    somewhere around 25,000 ( two times greater than
    gene contents of the fruit-fly and C.elegans, and
    five times greater than yeast, see Table 1.3 for
    more details)
  • Table 1.3 Comparison of Gene Content in some
    Representative Genomes
  • No dramatic differences in gene content between
    humans and other mammals.
  • Sep. 1994 the first high-resolution genetic map
    of the complete genome 23 linkage groups (one
    per chromosome) with 1200 markers at an average
    of 1cM intervals
  • Around 1995 physical map 52000 sequence tag
    sites (STS) at 60 kbp intervals
  • 1998 3000 SNPs
  • Middle of 2004 1.8 million mapped SNP, see The
    SNP Consortium (TSC) http//
  • Providing polymorphic markers at 2kb intervals
    and placing 85 of all exons within 5kp of a SNP.
  • 2000 the first draft of the smallest human
    chromosome, chromosome 21 was published

(No Transcript)
The content of the Human Genome
  • Two questions for the HGP
  • Whose genome was sequenced ?
  • The sequence is derived from a collection of
    several libraries obtained from
  • a set of anonymous donors. Both the IHGSC
    and the private firm Celera Genomics
  • assembled their seq. from multiple libraries
    of ethnicaly diverse individuals
  • One particular indiveiduals DNA contributed
    3/4 and 2/3 of the raw seq. respectively.

Size of shaded sector amount of seq.
contributed by a single individual
The content of the Human Genome
  • The Celera sample included at least one
    individuals from each of four ethnic groups, as
    well as both males and females.
  • Craig Venter admitted that his own DNA
    contributed substantially to the Celera sequence
  • Their own poodle (???) contributed to the
    first-draft canine (????) genome seq.

The Human Genome Project
  • (2) When can we regard it as finished ?
  • The complete seq. of 99 of human euchromatin has
    been published to an estimated error rate of 1
    event in 100,000 bases.
  • Human polymorphism is an order of magnitude
    greater than this ? at least 10 SNPs for each
    seq. error
  • Extensive tracts of heterochromatin (there are
    few or no genes, such as centromeres and
    telomeres), mostly associated with centromeres
    that may account for as much as 20 of the total
    genome, will probably never be sequenced.
  • Since the completion of the first draft ? HGP
    focus on characteristing human diversity.
  • International HapMap project map all of the
    major haplotypes in the human genome and
    characterize their distribution among
    populations, as a step toward identification of
    human disease susceptibility factors, see

Figure 1.8 The National Center for
Biotechnology Information (NCBI) Web site.
  • Internet Resources
  • NCBI and Ensembl

NCBI http// Ensemble
http// a collaboration between
EMBL-EBI and the Sanger Center in the UK. Both
sites provide high-resolution physical maps of
any segment of the genome. Several genome
views UCSC Genome Browser http//genome.cse.ucsc.e
du Commercial web sites - Incyte Genomics,
Celera, Rosetta Inpharmatics, Informax, and LION
Biosciences http//
Internet Resources NCBI and Ensembl
  • Ex. 1.2 Use the NCBI and Ensemble genome browser
    to examine a human disease gene. Use OMIM to
    identify a gene that is implicated in the
    etiology (???) of the disease.
  • Ans.
  • Go to http// ? Asthma (??) ?
    find one of the interest ? for example,
    Interleukin 13 (IL13). This page gives a lot of
    textual information link to other sites,
    including Human Gene Mutation Database (HGDB) or
    Entrez Gene
  • What are the various identifiers of the gene ?
  • 147683
  • (b) Where is the gene located on the chromosome
    (cytologically and physically) ?
  • The cytological location is 5q31 (chromosome 5,
    long arm, Click on Gene map locus ? 5q13 ? click
    location 5q13 ? click NCBI MapViewer
  • ? position132.02 Mb, Gene ID for IL13 is 3596
  • ? Gene aliases ALRH P600 IL-13 MGC116786
    MGC116788 MGC116789
  • (c) What is the RefSeq for the gene ?
  • The RefSeq is NM_002188, an mRNA seq.

Internet Resources NCBI and Ensembl
  • (d) How many exons are there in the major
    transcript, and how long is it?
  • From Entrez Gene ? Display Gene table ? 4
    exons, 1282 bp long and encodes a 146 amino acid
    protein, or use NCBI MapViewer ? Consensus CDS
  • From RefSeq ID is NM_002188 link to GeneBank
    ?signal peptide (interleukin 13 precursor), 34 aa
    (seq. 15 116),
  • mat_peptide (interleukin 13 precursor) 98 aa
  • (e) What is known about the function of the gene?
  • See NCBI description - This gene encodes an
    immunoregulatory cytokine produced primarily by
    activated Th2 cells. This cytokine is involved in
    several stages of B-cell maturation and
  • (f) Do the two annotations agree? Which browser
    do you prefer, and why?
  • Ensemble http//, select
    gene ? type IL-13 ? Ensembl gene ID
  • GeneView show that the Exons 4 Transcript
    length 1,282 bps Protein length 146 residues

Internet Resources - OMIM
  • Online Mendelian Inheritance in Man
  • A database that provides text summarizing recent
    genetic research in response to a query about a
    particular disease, as well as links to MedLine
    and GenBank and other information.
  • Intended for physicians and human geneticists
  • disease types such as muscle, metabolism,
  • cardiovascular, and physiological disorders.
  • OMIM lists in excess of 15,000 known
    disease-causing Mendelian disorders.
  • GEO BLAST tool search for all genes in the gene
  • expression database that have similar seq, and
    then compare levels of expression of the genes
    across species and experimental conditions.

Figure 1.9 The Mendelian Inheritance in Man
(OMIM) Web site
Internet Resources - OMIM
OMIM http//
Use OMIM help
Internet Resources - OMIM
OMIM has a defined numbering system certain
positions within that number indicate information
about the genetic disorder itself. The first
digit the mode of inheritance of the disorder
1 autosomal (????) dominant 2 autosomal
recessive 3 X-linked locus or phenotype 4
Y-linked locus or phenotype 5 mitochondrial 6
autosomal locus or phenotype
Internet Resources - OMIM
  • The distinct between 1 or 2 and 6 is that entries
    cataloged before May 1994 were assigned either a
    1 or 2, whereas entries after that date were
    assigned a 6 regardless of whether the mode of
    inheritance was dominate or recessive.
  • the phenotype caused by the gene at this
    locus is not influenced by genes at other loci
    however, the disorder itself may be caused by
    mutations at multiple loci
  • the phenotype is caused by two or more
    genetic mutations

Internet Resources - OMIM
  • Example 604896 (MKKS)

Display allele variant
allelic variants description is given after
each allelic variant of the clinical or
biochemical outcome of that particular mutation
allelic variant for MKKS
Internet Resources - OMIM
  • The OMIM indicates that the gene SRY encodes a
    transcription factor that is a member of the
    high-mobility group-box family of DNA binding
    proteins. Mutations in this gene give rise to XY
    females with gonadal dysgenesis(??????????), as
    well as translation of part of the Y chromosome
    containing this gene to the X chromosome in XX
  • Q 1a. An allelic variant of SRY causing sex
    reversal with partial ovarian function has been
    cataloged in OMIM. What was the mutation at the
    amino acid level and what is observed in XY mice
    carrying this mutation?
  • Ans. Use SRY AND human for the OMIM search ?
    then view list of allelic variants. Variant 0020
    is the correct entry. Mutation is Gln2Ter XY
    mice are fertile females, although fertility is
    reduced and ovaries fail early.

Internet Resources - OMIM
  • Q1b. Follow the Gene Map link in the left sidebar
    to access the MIM gene map, one other gene is
    found at the same cytogenetic map location. What
    is the name of this gene, and what methods were
    used to map the gene to this location?
  • Ans. Click GeneMap in the left sidebar. Correct
    gene is ZFY. Under the Methods columns, REn and A
    are listed. Clicking on the Methods hyperlink at
    the top of the column shows the key to the
    abbreviations. REn stands for neighbor analysis
    in restriction fragments A stands for in situ

Figure 1.10 (Part 1) A gallery of animal genome
sequencing projects
Animal Genome Projects
  • The International Sequencing Consortium (ISC)
  • A database of animal and plant genome sequencing
  • Some of these organisms are shown in Figure 1.10

Figure 1.10 (Part 2) A gallery of animal genome
sequencing projects
Animal Genome Projects
  • At the National Human Genome Research Institute
    (NHGRI), the decision to commit the tens of
    millions of dollars required for any new genome
    is made by a council of senior genome scientists
    a 10 page white paper
  • Weigh the expected impact of the sequence on
    enabling biomedical research and the annotation
    of sequence function
  • A draft genome can be produced for most animals
    within 3-6 months

1.10 (Part 3) A gallery of animal genome
sequencing projects
  • GenBank Files Box 1.2
  • There are may ways to present the structure and
    annotation of a gene or seq.
  • due to alternative splicing and TSS, the small
    errors occur during cDNA cloning
  • all genomes are full of polymorphism
  • The same gene may be represented by multiple
    different seq. or annotations in the genome
  • Refseq hand curation by experts
  • Example human HoxA1, 11421562
  • Go to http//
  • LOCUS XM_004915, GI14751246
  • Followed by the reference, .
  • Features section (CDS, misc_feature, .. etc),
    links to GeneID, MIM, CDD
  • Next comes the seq. in FASTA format, Display in
    XML or ASN.1 file format

GenBank Files Box 1.2
Use Entrez Gene HOXA1
Two isoforms
GenBank format
Graph display HOXA1
GenBank Files Box 1.2
Ensembl - http// Gene
GenBank Files Box 1.2
  • UCSC Genome Browser http//
  • Gene HOXA1

Figure 1.11 The Mouse Genome Informatics (MGI)
Web site
  • Rodent Genome Projects
  • Mouse Genome Informatics (MGI) http//www.informat
  • Three major advantages of rodent research are
  • Existence of a large number of mutant strains
    that, combined with whole genome mutagensis ?
    lead to genetic analysis of every identified
    locus in the genome
  • Existence of a panel of approximately 100
    commonly used lab. mouse strains
  • with well-characterized genealogy a resource
    for the study of genetic variation
  • 3. The existence of conserved seq. blocks is
    generally an indicator of functional constraint
  • 2002 draft of the Mouse genome
  • 2004 draft of the rat genome

Rodent Genome Projects
  • Functional genomic analysis of rat has been
    stimulated by three major advances achieved in
    the 1990s
  • The technology for targeted (Site-directed)
    mutagenesis by homologous recombination of the
    wide-type locus with a disrupted copy
  • Saturation random (unbiased) mutagenesis programs
    - Gathers information about entire sequence
    space i.e., relationship between aa sequence,
    3D protein structure and function
  • Emergence of phenomic(????) analysis, in which
    mutagenized lines are subject to biochemical,
    physiological, immunological, morphological, and
    behavioral tests in parallel ? large-scale
    identification of genes required for non-lethal
    (????) phenotypes

Figure 1.12 Mouse-human synteny and sequence
Rodent Genome Projects
  • Conservation of gene order and DNA seq. between
    the human and mouse genomes
  • http//
  • Blocks of synteny between mouse (chr. 11) and
    parts of five different human chromosomes
  • Enlarged view of a small region human 5q31. In
    this approximately 1 Mb region there is almost
    perfect correspondence in the order, orientation,
    and spacing of 23 putative genes, including four
  • Enlargement of the alignment of 50kb that
    includes the genes KIF-3A, IL-4 and IL-13. Blue
    dots show the distribution of conserved seq.
    (with 50-100 identity). Two of the conserved
    blocks (red bars) fall between genes, whereas
    most of the others (blue bars) are in the introns
    and exons of the genes.
  • Use PipMaker http//

Exercise 1.3 Compare the structure of a gene in
a mouse and a human
Rodent Genome Projects
  • Use NCBI http//
  • choose Genome biology
  • mouse chr.11
  • use Maps and options
  • ?add human gene map

Rodent Genome Projects
  • Mouse Genome Informatics (MGI)
  • http//
  • Integrate physical and genetic maps
  • Search for ortholog genes
  • Online comparison of the mouse and human genome

Rodent Genome Projects
  • Ex. 1.3
  • Use either NCBI or Ensembl browser, explore the
    structure of the gene used in Fig. 1.2 in a mouse
    and a human (and other vertebrates)
  • Ans. Ensembl http//
  • type in human IL13 (ENSG00000169194)?
    Orthologue Prediction ? view all genes in
    MultiContigView ? IL13 is on mouse chr.11,
    human chr. 5, and rat chr.10

Box 1.2 (Part 2) GenBank Files
Other Vertebrate Biomedical Models
  • 2004 chicken (G. gallus) and dog (C.
    familiaris) genomes are fully sequenced
  • Motivation biomedical
  • Chickens model for oncogenesis and virology
  • Dog model for complex diseases such as asthma,
    parasite infection, cancer
  • arthritis (???), diabetes, and behavioral
  • Applications
  • Artificial selection on breed diversity
  • Research into avian (???) evolution
  • Vertebrate development ? Zebrafish
  • transparent embryogenesis, ease of culture,
    existence of dense genetic map
  • Found thousands of genes are required for
    proper development of organs
  • http//
  • a variety of ecologically and commercially fish
    species, such as sticklebacks??, cichlids??,

Other Vertebrate Biomedical Models
  • ????? ????
  • ?????2005/12/8?
  • http//
  • ?????????????????????????????????????????,?
  • ?????????,
  • ????????,
  • ??????????????,
  • ??????????????????,???????????

  • ???????
  • http//

Other Vertebrate Biomedical Models
  • Sequencing nonhuman primates, such as rehsus
    macaque (??), chimpanzee(???) intend to
    understand the origins of diversity in the immune
    system as well as mechanisms of pathogen
  • Comparison of human and chimp seq.
  • Many genes seems to have been positively selected
  • Huamn are differentiated from chimps by small
    deletions up to 10kb in length, which occur on
    average every 500kb along chromosome 21

Animal Breeding Projects
  • OMIA (Australia) genome maps for over a dozen
    species of agricultural importance
  • http//
  • Access data on inheritance patterns for species
    other than human and mouse
  • Benefits of breeding programs lie in improvements
    in yield, infectious disease resistance
    adaptation to climatic conditions, improved food
    quality, maximizing the benefits of transgenic
  • These goals will be met both through enhanced
    genetic map development and association studies
    using SNP technology
  • ArkDBs (UK, Roslin Institute in Edinburgh)
  • http//
  • genomes resources for 10 species

Invertebrate Model Organisms
  • Generic Model Organism Database (GMOD)
  • http//
  • A coordinated effort of the mammalian,
    invertebrate, and plant genome communities to
    standardize web tool construction and
    implementation and to provide open source
    software for database management

Figure 1.13 The GMOD project
Invertebrate Model Organisms
A 40 kb region of cytological band 43E of fruit
fly, centered on the saxophone gene.
Figure 1.14 Drosophila gene annotation
Invertebrate Model Organisms
  • Flybase
  • http//
  • Search for the gene symbol sax
  • click the gene region map
  • http//
  • each gene either has a number beginning with CG
    or is identified by its standard name (e.g. sax)
  • show gene and mRNA
  • transposable element insertions (Burdock, one is
    shown in pink)

Invertebrate Model Organisms
  • The first multicellular eukaryotes to be
    sequenced completely is C. elegans at 1998
  • Fruit fly sequences completed at 2000
  • Decades of genetic analysis have led to the
    molecular characterization of up to 20 of the
    complement of genes in these two organisms
  • Over 90 of the true genes seem to have been
  • Assigned a tentative function based on seq.
  • 1/3 1/4 of the predicted genes remain
    orphans? with no known seq. similarity to genes
    in any other organism ? without functional data

Invertebrate Model Organisms
  • Ongoing EST sequencing, gene structure and
    mutational analysis
  • Unexpected there may be 50 more genes in
    C.elegans genome (19,000) than there are in the
    fly genome (13,500), despite the fact that the
    fly is much more complex at several levels,
    including (1) the number of cells, (2) number of
    cell types, and (3) organization of the nervous
  • Nematode a surprising surplus of
    steroid???-hormone receptors
  • Fruit fly olfactory??? receptor family
  • There is no simple relationship between gene
    number and tissue complexity
  • The high degree of conservation of all the major
    regulatory and biochemical pathways, most of all
    are identifiable not only in both nematode and
    flies but also in the unicellular eukaryote yeast
    and in vertebrate genomes

Invertebrate Model Organisms
  • Functional genomics ? a major impact of the
    invertebrate genome projects is the prospect of
    obtaining mutations in every single gene of the
  • In fly by a combination of saturation
    mutagenesis a library of overlapping
    deficiencies (deletion) that remove every segment
    of each chromosome
  • In nematode - saturation mutagenesis RNAi (a
    double-strand RNA fed to the worms

Figure 1.15 Human disease genes in model
Invertebrate Model Organisms
  • gt60 of a sample of 289 human disease genes have
    an orthologous genes in the fly
  • lt60 in nematode
  • 20 in yeast
  • Fig. 1.15 shows the fraction of human disease
    genes in each of six categories that have
    orthologs in the fly, nematode and yeast genome,
    as detected by seq. similarity at three level of
  • Conservation of genetic interactions across the
    animal kingdom ? uncover genes that are interact
    with known disease-promoting loci
  • Pharmaceutical companies interested in
    invertebrate genomics for its potential to
    identify drugs that affect neural function
  • Example fluoxetine resistance in nematodes,
    alcohol tolerance in files
  • Molecular interactions between gene products can
    be conserved allows the functional comparison of
    genes across species

??(Honey Bee)????
  • http//

  • ??(Sea urchin)????
  • http//

Box 1.3 Managing and Distributing Genome Data
  • Internet technology is essential for genomic
  • NCBI, EBI, LIMS (laboratory information
    management systems)
  • DB RDB (relational DB) and OODB
    (object-oriented DB)
  • RDB very effective for sorting, searching, and
    distributing data that fits into table form
  • OODB good at handling complex data structures
    and are useful for performing analyses on
    sequence objects (data with functions for
    operating on the data) ? a very efficient
    programming approach
  • DB query language SQL structured query
  • http//
  • Scripting language (no need to compile) PERL
    good for extracting and processing text files
  • http//

Box 1.3 Managing and Distributing Genome Data
Plant Genome Projects
  • Arabidopsis Thaliana the first plant genome to
    be sequenced between 1999 and 2000
  • 115 Mb, 25,000 genes, 2 times (no. fly genes)
  • Evolved via two rounds of whole genome
    duplication ? shuffling???? of chromosome regions
    and considerable gene loss
  • gt1500 tandem arrays (generally 2 or 3 copies) of
    repeated genes have been identified, 11,000 gene
  • Some geneticists regard this number as
    representative of the minimal complexity required
    to support multicellularity
  • It is believed that all plant and animal genomes
    represent modifications of a toolkit of gene
    families that evolved gt109 years ago

Figure 1.16 Chromosome duplications in the
Arabidopsis thaliana genome
Plant Genome Projects
  • gt30 Segmental duplications
  • 7 intra-chromosomal duplication are shown as
    duplicated blocks of color within three of the
    five chromosomes five duplications occur in the
    first chromosome and the fourth and fifth
    chromosomes display one duplication piece
  • Anther two dozen inter-chromosomal segmental
    duplications. A twist in the band ? inversion
    accompanied the duplication event

Plant Genome Projects
  • Plant genomes plant-specific genes
  • Enzymes required for cell wall biosynthesis
  • Transport proteins that move organic nutrients,
    inorganic ions, toxic compounds, metabolites, and
    even proteins and nucleic acids between cells
  • Enzymes required for photosynthesis, such as
    Rubisco and electron transport proteins
  • Products involved in plant turgor ???????,
    phototrophic??? and gravitrophic???
  • Enzymes and cytochromes involved in the
    production of second metabolites found in
    flowering plants
  • A large number of pathogen resistance R genes, as
    are mammalian immune system. R genes are
    dispersed throughout the genome rather than
    localized in a single complex

Plant Genome Projects
  • Plants share with animals many of the gene
    families - Intercellular communication,
    transcriptional regulation, signal transduction
  • A. Thaliana lacks homologs of the Ras G-protein
    family and tyrosine kinase receptors, Rel,
    forkhaed, nuclear steroids receptor transcription
  • TAIR The Arabidopsis Information Resource
  • UK CropNet http//

Grasses and Legumes??
  • gt50 different plant species are under way
  • The most important major feed crops the
    grasses maize, rice, wheat, sorghum??, barley??,
    the forage??? legumes soybean, alfalfa????,
    forage rye?? grasses, fescues(??,???) ? several
    genomes are very large ? whole genome sequencing
    is impractical
  • Both rice (Oryza sativa) and maize (Zea mays)
    have relatively small genomes
  • Two major rice genome cultivars????, japonica
    rice??? and indica rice??
  • MaizeGDB http//
  • waxy rice??

Figure 1.17 Rice-Arabidopsis synteny
Rice-Arabidopsis synteny
  • Comparison of genome sequences of rice and
    arabidopsis ?extensive complex patterns of
  • 20 of 54 genes in a 340 kb long of the rice
    genome (top) retain the same order in five
    different 80- to 200-kb regions of the
    Arabidopsis genome (below).
  • Conserved genes (red and green boxes) are found
    on both rice and Arabidopsis strands, but are
    interspersed by a variable number of different
    genes (yellow boxes) in Arabidopsis. Shaded boxes
    above the rice chromosome indicate that the
    conserved genes is in the opposite relative
    orientation on the Arabidopsis chromosomes.

Grasses and Legumes
  • Economically important traits include resistance
    to a broad range of pathogens flowering time,
    seed set, grain morphology, and related yield
    traits tolerance to drought, salt, heavy metals
    and other extreme environmental circumstances
    and measures of feed quality such as protein and
    sugar content.
  • Improved through genetic engineering
    specialized plant breeding techniques
  • Genome projects ? reveal much information
    regarding the evolution of domesticated species

Grasses and Legumes
  • Teosinte?????? versus Maize???
  • Modern maize is a derivative of the wild
    progenitor teosinte, which had multiple tillers.
  • Throughout the coding region of tb1, the level of
    polymorphism is substantially the same in a
    sample of maize and teosinte. However, in the 5
    UTR, there is a dramatic reduction in the level
    of polymorphism in maize relative to that seen in

Figure 1.18 Teosinte branched 1 and the
evolution of maize
Other Flowering Plants
  • gt90 angiosperm genome projects are listed on the
    US department of Agriculture web site
  • African, Australian, European, US projects
  • Genetic maps and search for a common set of plant
  • For some species, large EST seq. projects are
    also in place ? enable comparative genomic
  • Arabidopsis grasses several model organisms ?
    shed light on plant evolution

Figure 1.19 Forest genomics
Other Flowering Plants
Forest trees potential for economic
impact High-density genetic maps of spruce,
loblolly and several pines, a few species of
Eucalyptus Trait wood quality, growth and
flowering parameters Dendrome web site
http// Comparative analyses
and transcription profiling of genes involved in
wood properties including lignins??? and enzymes
that regulate cell wall biosynthesis Crops
plants potato, tomato, tobacco, beans,
cotton Analyzing the genome diversity ?affect
productivity, yield and quality improvements No
plant equivalent of the HGPs ELSI initiative has
been established.
Microbial Genome Projects
  • The minimal genome
  • 1995 the 1st complete genome, H. influenzae ?
    M. genitalium ? 3 other bacteria
  • 1997 E. coli
  • Seq. information genome structures (GC content,
    transposable elements, recombination), genome
    content (total number of genes, conserved gene
  • Gene annotation for prokaryotes are more
    straightforward ORF tend to be uninterrupted
    and genes tend to be closely spaced however the
    assignment of genes to operons is not trivial
  • 3/4 microbial genome can be assigned a function
    based on their similarity to genes on other
    organisms or by identifying protein domains
  • TIGR http//

Microbial Gene contents
  • M. genitalium 0.6 Mb, 471 genes
  • H. influenzae 1.8 Mb, 1750 genes
  • E. coli K12 4.6 Mb, 4288 genes ? average
    gene length 1.1 kb
  • Gene duplication and divergence in large genomes,
  • gene loss in small genomes

Exercise 1.4 Compare two microbial genomes using
the CMR
The minimal genome
  • the minimum complement of genes that are
    necessary and sufficient to maintain a living
  • To define genetically What is life?
  • Two general strategies
  • Bioinformatics strategy identify which genes
    are present in each and every sequenced genome
  • Some functions can be performed by
    non-orthologous genes
  • Conserved orthologs a small number of
    alternatives 256 genes

The minimal genome
  • Experimental strategy systematically knock out
    the function of individual genes mutations that
    cannot be recovered define genes that are likely
    to be components of the minimal genome
  • M. genitalium recovered 120 of the 470 genes
  • B. subtilis (4100 genes) 271 genes are
    indispensable (???) under favorable growth
    conditions, metabolism, cell division and shape,
    synthesis of cellular envelope
  • Synthetic lethal (?????) the nonviability
    (?????) in combination of two or more
    individually viable mutations
  • Infer that life can be supported by a genome of
    between 250 and 350 genes
  • Build a viable organism from scratch by stitching
    (????) together artificially synthesized genes
    build a poliovirus (???????)

Figure 1.20A Describing the minimal genome
The minimal genome
Deeper color ? presence of a gene Pale color ?
the genes is absent in that species Gene a, d, f
are present in all species, so are inferred to be
necessary for life.
Figure 1.20B Describing the minimal genome
The minimal genome
  • Mutagenesis experiments
  • Establish which genes are essential by
    systematically knockout each functional genes and
    seeing whether the organism can survive without
  • The overlap of these two approaches may define
    the minimal genome.

1.21 TIGR representation of a typical microbial
Sequenced Microbial Genomes
  • TIGR Comprehensive Microbial Resource (CMR)
  • New site http//
  • 39 genomes were generated by TIGR, and the rest
    by Brazil, Japan ? Omniome DB
  • Streptococcus pneumoniae TIGR4
  • The outer and inner circles represent genes
    encoded on the two strands of the chromosomes
  • Genes from HMM blue
  • BLAST yellow, Omniome pink
  • Click align genome MUMMER
  • Click Analyses for more tools, such as

Box 1.2 (Part 1) GenBank Files
(No Transcript)
(No Transcript)
Environmental Sequencing
  • Sequencing DNA extracted form an environment such
    as ocean, soil, or intestinal flora (?????)
  • The main reason is that the vast majority of
    bacteria cannot be cultured in vitro ? our
    knowledge of microflora is both limited by and
    biased by sampling
  • Pilot projects identify novel genes has the
    potential to change oceanographers understanding
    of the mechanisms of photosynthesis and global
    carbon and nitrogen cycling
  • Proteorhodopsin genes suggesting that light
    harvesting need not be coupled to chlorophyll in
  • C. Venter identified gt1M new genes !!, almost
    150 new types of bacteria
  • Fecal material human gut contains gt 500
    different species of bacteria, lt 30 can be
    cultured outside the body

  • Completed at 1997
  • MIPS http//
  • SGD http//

Parasite Genomics
  • World Health Organization (WHO)
  • 10 tropical diseases that affect billions of
    people worldwide
  • Eradicating (??) the pathogenic agents
  • Crop damage caused by parasitic plant nematodes
    costs billions of dollars

Parasite Genomics
  • Aims
  • Identify species-specific genes
  • Understanding the developmental genetics
  • Polymorphism surveys that address the population
    biology of the parasites
  • Mapping the genomics of the mosquito

(No Transcript)
  • 100 genomes, 10 days and 10 million dollars
  • 2006 News, http//

???? ????????? 2009/07/09 http//
  • The End