CS177 Lecture 8 Bioinformatics Databases (and genetic diseases) - PowerPoint PPT Presentation

About This Presentation
Title:

CS177 Lecture 8 Bioinformatics Databases (and genetic diseases)

Description:

... DDBJ; archival (International Nucleotide Sequence Database ... Synechocystis sp. (bacteria); yeast ... To determine the intron/exon ... – PowerPoint PPT presentation

Number of Views:204
Avg rating:3.0/5.0
Slides: 57
Provided by: mad81
Category:

less

Transcript and Presenter's Notes

Title: CS177 Lecture 8 Bioinformatics Databases (and genetic diseases)


1
CS177 Lecture 8 Bioinformatics Databases (and
genetic diseases)
  • Tom Madej 11.01.04

2
Lecture overview
  • Very brief overview of on-line databases.
  • Formulating queries in Entrez.
  • Example Molecular biology of diseases.

3
Bioinformatics Resources
  • Reference Chapter 3 in Sequence Evolution
    Function, E.V. Koonin and M.Y. Galperin, Kluwer
    Academic 2003.
  • Available on the NCBI Bookshelf.

4
Sequence Databases
  • GenBank, EMBL, DDBJ archival (International
    Nucleotide Sequence Database Collaboration)
    sequences have a common accession
  • SWISS-PROT curated, non-redundant, entries
    hyperlinked e.g. to PubMed TrEMBL entries not
    yet ready for SWISS-PROT
  • Motifs PROSITE, BLOCKS, PRINTS
  • Domains Pfam, SMART, ProDOM, COGs (NCBI)
  • Motifs/domains InterPro, CDD (NCBI)

5
More databases
  • Structure PDB/RCSB, MMDB (NCBI), SCOP, CATH,
    FSSP
  • Organism-specific e.g. E. coli, B. subtilis,
    Synechocystis sp. (bacteria) yeast (unicellular
    eukaryote) Arabidopsis, C. Elegans (WormBase),
    Fruitfly, Human
  • COGs clusters of orthologous groups KEGG
    biochemical pathways BIND protein-protein
    interactions ENZYME LIGAND enzymes and their
    substrates
  • PubChem (NCBI) chemical substances

6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
PubChem (new)
10
(No Transcript)
11
The (ever expanding) Entrez System
NLM Catalog
PubChem
Compounds
BioAssays
Substances
Literature
Organism
Expression
HomoloGene
Gene
12
Links Between and Within Nodes
Word weight
Computational
3-D Structure
3 -D Structures
VAST
Phylogeny
Computational
Protein sequences
BLAST
BLAST
Computational
Computational
13
Pubmed Computation of Related Articles
  • The neighbors of a document are those documents
    in the database that are the most similar to it.
    The similarity between documents is measured by
    the words they have in common, with some
    adjustment for document lengths.
  • The value of a term is dependent on Global and
    Local types of information
  • G - the number of different documents in the
    database that contain the term
  • L - the number of times the term occurs in a
    particular document

14
Global and local weights
  • The global weight of a term is greater for the
    less frequent terms. The presence of a term that
    occurred in most of the documents would really
    tell one very little about a document.
  • The local weight of a term is the measure of its
    importance in a particular document. Generally,
    the more frequent a term is within a document,
    the more important it is in representing the
    content of that document.

15
How we define similar documents
  • The similarity between two documents is computed
    by adding up the weights (local wt1 local wt2
    global wt) of all of the terms the two documents
    have in common. All results are ranked and the
    most similar documents become Related Articles

16
Entrez database queries
  • The databases are indexed by different sets of
    terms.
  • You can get to a particular DB by selecting it
    and then entering a null query.
  • The Preview/Index tab displays the index terms
    and can be used to formulate a query (if you
    cant remember the syntax for the index).
  • Limits can be used e.g. to select publications
    in a specified time range.
  • Details shows the interpretation of the query.

17
(No Transcript)
18
Exercises!
  • How many protein structures are there that
    include DNA and are from bacteria?
  • In PubMed, how many articles are there from the
    journal Science and have Alzheimer in the title
    or abstract, and amyloid beta anywhere? How
    many since the year 2000?
  • Notice that the results are not 100 accurate!
  • In 3D Domains, how many domains are there with no
    more than two helices and 8 to 10 strands and are
    from the mouse?

19
Investigating genetic diseases
  • Now we will see examples of how bioinformatics
    databases can be used to investigate genetic
    diseases.

20
Gene variants that can affect protein function
  • Mutation to a stop codon truncates the protein
    product!
  • Insertion/deletion of multiple bases changes the
    sequence of amino acid residues.
  • Single point change could alter folding
    properties of the protein.
  • Single point change could affect the active site
    of the protein.
  • Single point change could affect an interaction
    site with another molecule.

21
Lodish et al. Molecular Cell Biology, W.H.
Freeman 2000
22
Sickle cell anemia
  • The first molecular disease, i.e. the first
    genetic disease with a known molecular basis.
  • The most common variant is caused by a Glu6Val
    mutation in the Hemoglobin ß-chain (HbS).
    However, there are 100s of other mutations that
    can cause this (OMIM lists 524 variants!).
  • This mutation causes the hemoglobin to
    polymerize, in turn the red blood cells form
    sickle shapes and clump together under low oxygen
    conditions or high hemoglobin concentrations.
  • Confers some resistance to malaria, by inhibiting
    parasite growth.

23
NHLBI web site
24
Exercise!
  • Find an appropriate Hemoglobin structure and view
    it in Cn3D.
  • Check the position of the Glu6Val mutation.

25
P53 tumor suppressor protein
  • Li-Fraumeni syndrome only one functional copy of
    p53 predisposes to cancer.
  • Mutations in p53 are found in most tumor types.
  • p53 binds to DNA and stimulates another gene to
    produce p21, which binds to another protein cdk2.
    This prevents the cell from progressing thru the
    cell cycle.

26
G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21
217-228.
27
Exercise!
  • Use Cn3D to investigate the binding of p53 to
    DNA.
  • Formulate a query for Structure that will require
    the DNA molecules to be present (there are 2
    structures like this).

28
Important note!
  • Most diseases (e.g. cancer) are complex and
    involve multiple factors (not just a single
    malfunctioning protein!).

29
Investigating a genetic disease
  • The following EST comes from a hemochromatosis
    patient your task is to identify the gene and
    specific mutation causing the illness, and why
    the protein is not functioning properly.
  • The sequence
  • TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAG
  • TGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGA
  • ACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGAT
  • GCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGA
  • TGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGG
  • GGAAGAGCAGAGATATACGTACCAGGTGGAGCACCCAGGCC
  • TGGATCAGCCCCTCATTGTGATCTGGG

30
ESTs
  • Expressed Sequence Tags useful for discovering
    genes, obtaining data on gene expression/regulatio
    n, and in genome mapping.
  • Short nucleotide sequences (200-500 bases or so)
    derived from mRNA expressed in cells.
  • The introns from the genes will already be
    spliced out.
  • mRNA is unstable, however, and so it is reverse
    transcribed into cDNA.

31
Hemochromatosis 2
  • BLAST the EST vs. the Human genome (could take a
    few minutes).
  • - Which chromosome is hit?
  • - What is the contig that is hit (reference
    assembly)?
  • - Is the EST identical to the genomic sequence?
  • - Take note of the coords of the difference.
  • Click on Genome View.
  • Select the map element at the bottom
    corresponding to the contig.

32
Hemochromatosis 3
  • What gene is hit? Zoom in on the BLAST hit a few
    times.
  • Display the entire gene sequence vi dl and
    Display.
  • Copy and save the genomic sequence.
  • Record the coords for the start of the genomic
    sequence.

33
Hemochromatosis 4
  • Click on a UniGene link Hs.233325.
  • Note Expression profile presents data for the
    expression level of the gene in various tissues.
  • How many mRNAs and ESTs are there for the HFE
    gene?
  • Take note of the mRNA accession NM_000410.

34
Hemochromatosis 5
  • Go to spidey http//www.ncbi.nlm.nih.gov/spidey
    /
  • To determine the intron/exon structure, paste the
    HFE gene sequence into the upper box, and enter
    the HFE mRNA accession NM_000410 in the lower
    box.
  • Click Align.

35
Hemochromatosis 6
  • How many exons are there?
  • Which exon codes the residue that is changed in
    the original EST? (You have to do a little
    arithmetic!)
  • Record some of the protein sequence around the
    changed residue EQRYTCQVEHPG

36
Hemochromatosis 7
  • From the Map Viewer page click on the HFE gene
    link.
  • How many HFE transcripts are there? Which is the
    longest isoform?
  • Follow Links to Protein and then to the
    report for NP_000410.
  • Determine the residue number that corresponds to
    the mutation.

37
(No Transcript)
38
RNA splicing and isoforms
39
Hemochromatosis 8
  • What effect does the mutation in the original EST
    have on the protein? (Look at the table for the
    Genetic Code.)
  • Go back to the Gene Report read the summary and
    take note of the GeneRIF bibliography.
  • Now go to Links and then to GeneView in dbSNP
    to a list of known SNPs.

40
Hemochromatosis 9
  • In the SNP list note that the one you want is
    currently shown.
  • Select view rs in gene region and then click on
    view rs.
  • How many nonsynonomous substitutions do you see?
  • Do you see the one we are particularly interested
    in?

41
Digression SNPs
  • Single Nucleotide Polymorphisms.
  • A single base change that can occur in a persons
    DNA.
  • On average SNPs occur about 1 of the time, most
    are outside of protein coding regions.
  • Some SNPs may cause a disease some may be
    associated with a disease others may affect
    disposition to a disease others may be simple
    genetic variation.
  • dbSNP archives SNPs and other variations such as
    small-scale deletion/insertion polymorphisms
    (DIPs), etc.

42
(No Transcript)
43
Hemochromatosis 10
  • Back to the Gene Report, click on Links and go
    to OMIM (can also get there via the Map
    Viewer).
  • In the OMIM entry you can read a bit also click
    on View List for Allelic Variants, where you
    can see the mutation again.

44
Hemochromatosis 11
  • From the Gene Report again follow Links to
    Protein and scroll down to NP_000401.
  • Click on Domains and then Show Details.
  • What is the Conserved Domain in the region of
    interest?
  • Follow the link to the CD.
  • Click on View 3D Structure.

45
Hemochromatosis 12
  • Look for residue position 282 in the query
    sequence.
  • Highlight that column.
  • Is the Cys282 conserved in the family?
  • The C282Y mutation therefore likely has the
    effect of

46
Aligning a sequence on a structure with Cn3D
(example)
  • Example Use structure 1ne3A, align sequence for
    1m5xA.
  • In Sequence/Alignment Viewer window select the
    menu item Imports/Show Imports.
  • In the Import Viewer window select the menu item
    Edit/Import Sequences.
  • In the Select Chain dialogue box select 1N3E A
    and click OK.
  • In the Select Import Source dialogue box select
    Network via GI/Accession and click OK.
  • In the Import Identifier dialogue box enter the
    accession 31615545 and click OK. The new
    sequence will appear.
  • Select Algorithms/BLAST single and use the
    cursor to click anywhere on the 1m5xA sequence to
    align it using BLAST.

47
Aligning a sequence on a structure with Cn3D
(example cont.)
  • Select the menu item Alignments/Merge All to
    make the new alignment appear in the
    Sequence/Alignment Viewer window.
  • The alignment should now appear in the
    Sequence/Alignment Viewer window, aligned
    residues will be red.
  • Close the Import Viewer window, pick another
    color style for the alignment, if desired (e.g.
    identity).
  • You can do this with multiple sequences
    especially useful if there is no CD for the
    structure.

48
PDB
49
PDB File Header
HEADER ISOMERASE/DNA
01-MAR-00 1EJ9 TITLE CRYSTAL STRUCTURE OF
HUMAN TOPOISOMERASE I DNA COMPLEX
COMPND MOL_ID 1
COMPND 2
MOLECULE DNA TOPOISOMERASE I
COMPND 3 CHAIN A

COMPND 4 FRAGMENT C-TERMINAL DOMAIN, RESIDUES
203-765 COMPND 5 EC
5.99.1.2
COMPND 6 ENGINEERED YES

COMPND 7 MUTATION YES
COMPND 8
MOL_ID 2
COMPND 9 MOLECULE DNA (5'-

COMPND 10 D(CAPAPAPAPAPGPAPCPTPCPAP
GPAPAPAPAPAPTP COMPND 11
TPTPTPT)-3')
COMPND 12 CHAIN C

COMPND 13 ENGINEERED YES
COMPND 14
MOL_ID 3
COMPND 15 MOLECULE DNA (5'-

COMPND 16 D(CAPAPAPAPAPTPTPTPTPTPCP
TPGPAPGPTPCPTP COMPND 17
TPTPTPT)-3')
COMPND 18 CHAIN D

COMPND 19 ENGINEERED YES
SOURCE MOL_ID
1
SOURCE 2 ORGANISM_SCIENTIFIC HOMO
SAPIENS
SOURCE 3 EXPRESSION_SYSTEM_COMMON BACULOVIRUS
EXPRESSION SYSTEM SOURCE 4
EXPRESSION_SYSTEM_CELL SF9 INSECT CELLS
SOURCE 5 MOL_ID 2

SOURCE 6 SYNTHETIC YES
SOURCE 7
MOL_ID 3
SOURCE 8 SYNTHETIC YES

KEYWDS PROTEIN-DNA COMPLEX, TYPE I
TOPOISOMERASE, HUMAN
REMARK 1
REMARK 2

REMARK 2 RESOLUTION. 2.60
ANGSTROMS.
REMARK 3
REMARK 3
REFINEMENT.
REMARK 3 PROGRAM
X-PLOR 3.1
REMARK 3 AUTHORS BRUNGER
REMARK 280

REMARK 280 CRYSTALLIZATION
CONDITIONS 27 PEG 400, 145 MM MGCL2, 20
REMARK 280 MM MES PH 6.8, 5 MM TRIS PH 8.0,
30 MM DTT REMARK 290
...
50
PDB File Data
ATOM 1 N TRP A 203 30.156 -4.908
37.767 1.00 50.81 N ATOM 2 CA
TRP A 203 30.797 -4.667 36.431 1.00 49.96
C ATOM 3 C TRP A 203
30.369 -3.337 35.766 1.00 49.18 C
ATOM 4 O TRP A 203 29.315 -3.238
35.147 1.00 49.27 O ATOM 5 CB
TRP A 203 30.518 -5.863 35.513 1.00 46.77
C ATOM 6 CG TRP A 203
30.847 -5.651 34.081 1.00 44.60 C
ATOM 7 CD1 TRP A 203 32.028 -5.234
33.553 1.00 49.72 C ATOM 8 CD2
TRP A 203 29.980 -5.876 32.984 1.00 43.73
C ATOM 9 NE1 TRP A 203
31.956 -5.191 32.177 1.00 45.45 N
ATOM 10 CE2 TRP A 203 30.704 -5.582
31.805 1.00 45.23 C ATOM 11 CE3
TRP A 203 28.657 -6.305 32.877 1.00 46.48
C ATOM 12 CZ2 TRP A 203
30.149 -5.705 30.539 1.00 46.06 C
ATOM 13 CZ3 TRP A 203 28.101 -6.431
31.622 1.00 43.08 C ATOM 14 CH2
TRP A 203 28.849 -6.131 30.463 1.00 45.77
C
ATOM 1 N TRP A 203 30.156 -4.908
37.767 1.00 50.81
Y
X
Z
Name
Atom Number
Occupancy
Residue Number
Temperature Factor
Atom Name
Chain ID
Issues Justification Nomenclature
Residue Name
51
From Coordinates to Models
1EJ9 Human topoisomerase I
52
Building the Structure Summary
Taxonomy
Pubmed
Protein
3D Domains
Domains
Nucleotide
53
Indexing into MMDB
Structure
  • Import only experimentally determined structures
  • Convert to ASN.1
  • Verify sequences
  • Create backbone model (Ca, P only)
  • Create single-conformer model

Add secondary structure
Add chemical bonds
inter-residue-bonds atom-id-1
molecule-id 1 , residue-id 1 , atom-id
1 , atom-id-2 molecule-id 1 ,
residue-id 2 , atom-id 9 ,
id 1 , name "helix 1" , type helix ,
location subgraph residues
interval molecule-id 1 , from
49 , to 61 ,
54
Structure Indexing
topoisomerase AND 2dnachaincount AND
humanorganism
  • Entrez
  • MMDB-ID
  • MMDB entry date
  • EC number
  • Organism
  • Ligands
  • PDB code
  • PDB name
  • PDB description
  • Experimental
  • Method
  • Resolution
  • Literature
  • Article title
  • Author
  • Journal
  • Publication date
  • Counters
  • Ligand types
  • Modified amino acids
  • Modified nucleotides
  • Modified ribonucleotides
  • Protein chains
  • DNA chains
  • RNA chains
  • PDB
  • Accession
  • Release date
  • Class
  • Source
  • Description
  • Comment

55
Creating Sequence Records
One record per chain
Protein
Nucleotide
Nucleotide
1EJ9C
1EJ9D
1EJ9A
56
Annotating Secondary Structure
1EJ9 Human topoisomerase I
a-Helices ß-strands coils/loops
57
Creating 3D Domains
3D Domain 0 1EJ9A0 entire polypeptide
58
Creating 3D Domains
1EJ9A1
1EJ9A4
1EJ9A3
1EJ9A5
1EJ9A2
lt 3 Secondary Structure Elements
59
3D Domain Indexing
  • Entrez
  • SDI
  • MMDB-ID
  • Accession
  • MMDB entry date
  • Organism
  • Domain number
  • Cumulative number
  • Literature
  • Article title
  • Author
  • Publication date
  • Counters
  • Modified amino acids
  • a-Helices
  • ß-Strands
  • Residues
  • Molecular weight

Find all viral four helix bundles
  • PDB
  • Accession
  • Release date
  • Class
  • Source
  • Description
  • Comment

4helixcount AND 0strandcount AND 0domainno
AND virusesorganism
REMEMBER 3D Domain 0 is the entire polypeptide
chain!
Write a Comment
User Comments (0)
About PowerShow.com