Bioinformatics 101 BimCore www.bimcore.emory.edu - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Bioinformatics 101 BimCore www.bimcore.emory.edu

Description:

Default Scoring Matrix (Smith Waterman) for DNA. A B C D G H K M N R S T U V W X Y ... Smith Waterman Algorithm. Guaranteed to find all significant matches to ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 50
Provided by: jesses6
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics 101 BimCore www.bimcore.emory.edu


1
Bioinformatics 101BimCore www.bimcore.emory.edu
  • Bioinformatics is the computational management,
    analysis and dissemination of biological
    information.
  • 3 main types of data genome sequences, protein
    sequences and structure, and functional genomics
    (expression data).
  • Genomics mapping, sequencing and analysis of
    genomes (sequences, structural and functional).
  • Proteomics qualitative and quantitative
    comparison of a proteome (complete set of
    proteins of an organism) under different
    conditions to unravel biological processes.

2
Information available
  • Databases
  • Genome sequences - Flybase, GeneBank, EMBL,
  • SwissProt, PDB, ESTs, BACs, etc.
  • Experimental information
  • Published information
  • PubMed, MedLine

3
How locate relevant information
  • Text searches
  • Query matches
  • Sort by relevance
  • gtgt sequence to sequence comparisons

4
Definitions
  • Homolog Sequences that share a common
    ancestor may have similar function.
  • Paralogue Similar sequence within species, may
    have similar function.
  • Orthologue Same sequence separated by a
    speciation event, probably same function.
  • Analog Non-homolog proteins that have similar
    folding or similar functional sites, which arose
    through convergent evolution.

5
Evolution and Alignment
  • Similarity is an observable quantity (
    identity).
  • Homology is a conclusion drawn from the data,
    thus two genes share a common evolutionary
    history.
  • Alignments reflect amino acid substitutions, and
    insertions and deletions.
  • Certain regions are more conserved than others.
  • In biomolecular sequences (DNA, RNA, AA seqs),
    high sequence similarity usually implies
    significant structural and / or functional
    similarity.

6
Sequence alignment based onEdit Distance
  • Transforming one string into another by a series
    of edit operations on the individual characters.
  • The operations allowed are insertion of a
    character in the first string, deletion of a
    character from the first string, or the
    substitution of a character in the first string
    with a character in the second string.
  • To limit the transforming operations to one
    string, an insertion in one string can be
    considered a deletion in the other string and
    visa-versa.

7
Edit Distance Example
  • v-intner- vintner
  • RIMDMDMMI RRRMRRR
  • wri-t-ers writers
  • Edit Distance 5 Edit Distance 6
  • (R)eplacements 1, (I)nsertions/(D)eletions1,
    (M)atches0
  • Edit Distance (sometimes referred to as
    Levenshtein distance, Sov. Phys. Dokl. 10707-10,
    1966) is the minimum number of edit operations to
    transform one string into the other.

8
Similarity Example
  • v-intner- vintner
  • RIMDMDMMI RRRMRRR
  • wri-t-ers writers
  • Similarity 4 Similarity 1
  • (R)eplacements 0, (I)nsertions/(D)eletions0,
    (M)atches1
  • Similarity is value of the alignment that
    maximizes the total alignment value.

9
DNA Scoring Scheme
  • Default Scoring Matrix (Smith Waterman) for DNA
  • A B C D G H K M N R
    S T U V W X Y
  • A 10 -9 -9 10 -9 10 -9 10 10
    10 -9 -9 -9 10 10 10 -9
  • B -9 10 10 10 10 10 10 10 10
    10 10 10 10 10 10 10 10
  • C -9 10 10 -9 -9 10 -9 10 10
    -9 10 -9 -9 10 -9 10 10
  • D 10 10 -9 10 10 10 10 10 10
    10 10 10 10 10 10 10 10
  • G -9 10 -9 10 10 -9 10 -9 10
    10 10 -9 -9 10 -9 10 -9
  • H 10 10 10 10 -9 10 10 10 10
    10 10 10 10 10 10 10 10
  • K -9 10 -9 10 10 10 10 -9 10
    10 10 10 10 10 10 10 10
  • M 10 10 10 10 -9 10 -9 10 10
    10 10 -9 -9 10 10 10 10
  • N 10 10 10 10 10 10 10 10 10
    10 10 10 10 10 10 10 10
  • R 10 10 -9 10 10 10 10 10 10
    10 10 -9 -9 10 10 10 -9
  • S -9 10 10 10 10 10 10 10 10
    10 10 -9 -9 10 -9 10 10
  • T -9 10 -9 10 -9 10 10 -9 10
    -9 -9 10 10 -9 10 10 10
  • U -9 10 -9 10 -9 10 10 -9 10
    -9 -9 10 10 -9 10 10 10
  • V 10 10 10 10 10 10 10 10 10
    10 10 -9 -9 10 10 10 10
  • W 10 10 -9 10 -9 10 10 10 10
    10 -9 10 10 10 10 10 10
  • X 10 10 10 10 10 10 10 10 10
    10 10 10 10 10 10 10 10

10
Amino Acid Scoring Schemes
  • Identity scoring
  • Genetic code scoring
  • Chemical similarity
  • Observed substitutions

11
Amino acid substitution matrix
  • Align protein structures or sequences of known
    functionally similar proteins.
  • Look at the frequence of amino acid
    substitutions.
  • Compute a log-odds ration matrix
  • M(ai,aj) log(observed freq (ai,aj) /expected
    freq (ai,aj))
  • is more likely than random
  • - is less likely than random
  • 0 is at the random base rate

12
PAM Percent Accepted Mutations
  • Globally align protein sequences (at least 85
    identity)
  • Calculate the frequency of amino acid
    substitutions
  • Divide by the normalized frequency of amino acids
    -gt
  • log ratios.
  • A 1 PAM matrix specifies a unit of evolutionary
    change 1 accepted point mutation per 100
    residues.
  • Create a higher PAM matrix by multiplying the 1
    PAM by itself.

13
PAM 250 matrix
  • A B C D E F G H I K L
    M N P Q R S T V W Y Z
  • A 2 0 -2 0 0 -4 1 -1 -1 -1 -2
    -1 0 1 0 -2 1 1 0 -6 -3 0
  • B 0 2 -4 3 2 -5 0 1 -2 1 -3
    -2 2 -1 1 -1 0 0 -2 -5 -3 2
  • C -2 -4 12 -5 -5 -4 -3 -3 -2 -5 -6
    -5 -4 -3 -5 -4 0 -2 -2 -8 0 -5
  • D 0 3 -5 4 3 -6 1 1 -2 0 -4
    -3 2 -1 2 -1 0 0 -2 -7 -4 3
  • E 0 2 -5 3 4 -5 0 1 -2 0 -3
    -2 1 -1 2 -1 0 0 -2 -7 -4 3
  • F -4 -5 -4 -6 -5 9 -5 -2 1 -5 2
    0 -4 -5 -5 -4 -3 -3 -1 0 7 -5
  • G 1 0 -3 1 0 -5 5 -2 -3 -2 -4
    -3 0 -1 -1 -3 1 0 -1 -7 -5 -1
  • H -1 1 -3 1 1 -2 -2 6 -2 0 -2
    -2 2 0 3 2 -1 -1 -2 -3 0 2
  • I -1 -2 -2 -2 -2 1 -3 -2 5 -2 2
    2 -2 -2 -2 -2 -1 0 4 -5 -1 -2
  • K -1 1 -5 0 0 -5 -2 0 -2 5 -3
    0 1 -1 1 3 0 0 -2 -3 -4 0
  • L -2 -3 -6 -4 -3 2 -4 -2 2 -3 6
    4 -3 -3 -2 -3 -3 -2 2 -2 -1 -3
  • M -1 -2 -5 -3 -2 0 -3 -2 2 0 4
    6 -2 -2 -1 0 -2 -1 2 -4 -2 -2
  • N 0 2 -4 2 1 -4 0 2 -2 1 -3
    -2 2 -1 1 0 1 0 -2 -4 -2 1
  • P 1 -1 -3 -1 -1 -5 -1 0 -2 -1 -3
    -2 -1 6 0 0 1 0 -1 -6 -5 0
  • Q 0 1 -5 2 2 -5 -1 3 -2 1 -2
    -1 1 0 4 1 -1 -1 -2 -5 -4 3
  • R -2 -1 -4 -1 -1 -4 -3 2 -2 3 -3
    0 0 0 1 6 0 -1 -2 2 -4 0
  • S 1 0 0 0 0 -3 1 -1 -1 0 -3
    -2 1 1 -1 0 2 1 -1 -2 -3 0
  • T 1 0 -2 0 0 -3 0 -1 0 0 -2
    -1 0 0 -1 -1 1 3 0 -5 -3 -1

14
BLOSUM Block Substitution Matrix
  • Align ungapped protein sequences from the BLOCKS
    database
  • Look at the the frequency of amino acid
    substitutions.
  • Compute a log-odds ratio matrix (ie PAM
    calculations)
  • Higher BLOSUM Cluster groups by varying
    similarity and recalculate matrix.
  • Number following indicates percent identity
    within the set, BLOSUM62 62 identity.

15
BLOSUM62 matrix
  • A B C D E F G H I K L
    M N P Q R S T V W X Y Z
  • A 4 -2 0 -2 -1 -2 0 -2 -1 -1 -1
    -1 -2 -1 -1 -1 1 0 0 -3 -1 -2 -1
  • B -2 6 -3 6 2 -3 -1 -1 -3 -1 -4
    -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2
  • C 0 -3 9 -3 -4 -2 -3 -3 -1 -3 -1
    -1 -3 -3 -3 -3 -1 -1 -1 -2 -1 -2 -4
  • D -2 6 -3 6 2 -3 -1 -1 -3 -1 -4
    -3 1 -1 0 -2 0 -1 -3 -4 -1 -3 2
  • E -1 2 -4 2 5 -3 -2 0 -3 1 -3
    -2 0 -1 2 0 0 -1 -2 -3 -1 -2 5
  • F -2 -3 -2 -3 -3 6 -3 -1 0 -3 0
    0 -3 -4 -3 -3 -2 -2 -1 1 -1 3 -3
  • G 0 -1 -3 -1 -2 -3 6 -2 -4 -2 -4
    -3 0 -2 -2 -2 0 -2 -3 -2 -1 -3 -2
  • H -2 -1 -3 -1 0 -1 -2 8 -3 -1 -3
    -2 1 -2 0 0 -1 -2 -3 -2 -1 2 0
  • I -1 -3 -1 -3 -3 0 -4 -3 4 -3 2
    1 -3 -3 -3 -3 -2 -1 3 -3 -1 -1 -3
  • K -1 -1 -3 -1 1 -3 -2 -1 -3 5 -2
    -1 0 -1 1 2 0 -1 -2 -3 -1 -2 1
  • L -1 -4 -1 -4 -3 0 -4 -3 2 -2 4
    2 -3 -3 -2 -2 -2 -1 1 -2 -1 -1 -3
  • M -1 -3 -1 -3 -2 0 -3 -2 1 -1 2
    5 -2 -2 0 -1 -1 -1 1 -1 -1 -1 -2
  • N -2 1 -3 1 0 -3 0 1 -3 0 -3
    -2 6 -2 0 0 1 0 -3 -4 -1 -2 0
  • P -1 -1 -3 -1 -1 -4 -2 -2 -3 -1 -3
    -2 -2 7 -1 -2 -1 -1 -2 -4 -1 -3 -1
  • Q -1 0 -3 0 2 -3 -2 0 -3 1 -2
    0 0 -1 5 1 0 -1 -2 -2 -1 -1 2
  • R -1 -2 -3 -2 0 -3 -2 0 -3 2 -2
    -1 0 -2 1 5 -1 -1 -3 -3 -1 -2 0
  • S 1 0 -1 0 0 -2 0 -1 -2 0 -2
    -1 1 -1 0 -1 4 1 -2 -3 -1 -2 0
  • T 0 -1 -1 -1 -1 -2 -2 -2 -1 -1 -1
    -1 0 -1 -1 -1 1 5 0 -2 -1 -2 -1

16
Which matrix
  • Blosum finds short, highly similar sequences.
  • Blosum usually best for local similarity
    searches.
  • Blosum62 is default for blast.
  • BLOSUM80 BLOSUM62 BLOSUM45
  • PAM1 PAM120 PAM250
  • Less divergent More divergent

17
Which matrix
  • If sequences are thought to be related, a PAM250
    is best.
  • When sequences are not known to be related,
    PAM120 tends to give more sensitivity.
  • To pick up short segments of highly similar
    sequences, PAM40 is a good choice.

18
Global Alignment
  • Optimal Alignment over the entire length of both
    sequences.

19
Global Alignment
  • Needleman Wunsch Algorithm
  • Every residue of the two sequences has to
    participate.
  • Guaranteed to calculate an Optimal similarity
    score.
  • Cannot detect domains.

20
Local Alignment
  • Locates the highest scoring alignment regardless
    of position and length.

21
Local Alignment
  • Smith Waterman Algorithm
  • Guaranteed to find all significant matches to a
    given query.
  • Can find regions of strong similarity domains.
  • Computationally expensive.

22
Blast Basic Local Alignment Sequence Tool
  • Objective find all local regions of similarity
    distinguishable from random.
  • Local alignments
  • Gaps permitted
  • Statistically sound, but no guarantee of
    optimality
  • Fast
  • Less sensitive for (shorter) sequences.

23
BLAST Three step algorithm
  • Compile a list of high scoring words of length w
  • (w4 for proteins, w12 for nucleic acids)
  • Scan the word hits of score greater than
    threshold, T
  • Extend word hit in both directions to find High
    Scoring Pairs with scores greater than S.

24
BLAST Word size
  • A word is any short sequence lt 6 letters
  • Protein (1 - 2), nucleotide (1 - 6)
  • High word size results in
  • Faster
  • Less sensitive
  • More selective

25
Other BLAST programs
  • Query Database
  • BLASTN nucleic acid query nucleic acid
  • BLASTP protein query protein
  • BLASTX translated NA query protein
  • TBLASTN protein query translated NA
  • TBLASTX translated NA query translated NA

26
Other BLAST Programs
27
Other BLAST Programs
28
Other BLAST Programs
29
Blast Databases
  • Nucleotide Sequence Databases nr  All
    GenBankRefSeq NucleotidesEMBLDDBJPDB
    sequences (but no EST, STS, GSS, or phase 0, 1 or
    2 HTGS sequences). No longer "non-redundant". 
    est Database of GenBankEMBLDDBJ sequences from
    EST Divisions est_human, est_mouse, est_others
    gss  Genome Survey Sequence, includes
    single-pass genomic data, exon-trapped sequences,
    and Alu PCR sequences. htgs  Unfinished High
    Throughput Genomic Sequences phases 0, 1 and 2
    (finished, phase 3 HTG sequences are in nr) pat
    Nucleotides from the Patent division of GenBank.
    mito  Database of mitochondrial sequences
    vector  Vector subset of GenBank(R), NCBI, in
    ftp//ftp.ncbi.nih.gov/blast/db/ pdb  Sequences
    from the 3-dimensional structure from Brookhaven
    Protein Data Bank GENOMES (yeast, E. coli,
    Drosophila genome  month  All new or revised
    GenBankEMBLDDBJPDB sequences released in the
    last 30 days. alu  Select Alu repeats from
    REPBASE, suitable for masking Alu repeats from
    query sequences.
  • dbsts Database of GenBankEMBLDDBJ sequences
    from STS Divisions . chromosome Searches
    Complete Genomes, Complete Chromosome, or contigs
    form the NCBI Reference Sequence project..

30
Blast Databases
  • Peptide Sequence Databases
  • nr All non-redundant GenBank CDS
    translationsRefSeq ProteinsPDBSwissProtPIRPRF
    swissprot Last major release of the SWISS-PROT
    protein sequence database (no updates) pat
    Proteins from the Patent division of GenPept.
    pdb  Sequences derived from the 3-dimensional
    structure from Brookhaven Protein Data Bank
    Yeast yeast (Saccharomyces cerevisiae) genomic
    CDS translations ecoli  Escherichia coli
    genomic CDS translations Drosophila genome
    Drosophila genome proteins provided by Celera
    and Berkeley Drosophila Genome Project (BDGP).
    month All new or revised GenBank CDS
    translationPDBSwissProtPIRPRF released in the
    last 30 days.

31
Other Blast Programs
  • MegaBlast - optimized for aligning sequences that
    differ slightly as a result of sequencing or
    other similar "errors".
  • Uses a larger word size (16)
  • Is up to 10 times faster than more common
    sequence similarity programs
  • able to efficiently handle much longer DNA
    sequences
  • non-affine gapping parameters (open 0,
    extension variable

32
Other Blast Programs
  • TraceBlast - optimized for cross species
    comparisons.
  • word size (11)
  • Expect value (10)

33
Other BLAST programs
  • Gapped BLAST (BLAST 2.0)
  • extends words from no-gap to gap, generate gapped
    alignments
  • PSI-BLAST
  • Position specific iterated BLAST, use gapped
    BLAST, generate a Profile from multiple
    iterations used instead of the input and Distance
    Matrix.

34
Limitations of BLAST
  • Needs islands of strong homology
  • Limits on the combination of scoring and penalty
    values
  • Variants (blastx, tblastn, tblastx) use 6-frame
    translation, yet does miss sequences with frame
    shifts, etc.
  • Finds and reports only local alignments.

35
Rules of Thumb
  • For short amino acid sequences (20 - 40), 50
    identity happens by chance.
  • (Increase expect value and decrease word size.)
  • If A and B are homologous, and B and C are
    homologous, then A and C are homologous, even if
    you can not see it.
  • Locate and filter regions of low complexity.
    Leads to false positive alignment (similarity
    without homology).

36
FASTA Fast Alignment
  • Rapid global alignment
  • allows alignment to shift frames
  • not a strong mathmatical basis

37
FASTA
  • Show diagrams.

38
LALIGN
  • A FASTA derivative for local alignments
  • Compares two protein sequences to identify
    regions of similarity
  • Will report several sequence alignments within a
    given sequence
  • Works for internal repeats that are missed by
    FASTA because of gaps.

39
Precomputed Alignments
  • Related sequences, related structures, related
    articles, summaries, etc.
  • InterPro
  • Pfam
  • ProDom
  • Smart
  • etc.

40
(No Transcript)
41
(No Transcript)
42
BimCoreBioMolecular Computing Resource
  • Founded in 1992 as a subscription based
    computing support service for Emory
    bioinformatics based reasearch.
  • Bioinformatics is the MIS for molecular
    biology.
  • Our mission is to serve as a human interface
    between researchers and computing technology
    enabling investigators to refine scope of
    biological investigations and accelerate
    discovery.

43
BimCoreBioMolecular Computing Resource
  • Sequence Analysis Molecular Modeling
  • Microarray Analysis
  • BimCore provides
  • Software and computational hardware (evaluate
    software,
  • purchase Emory site licenses)
  • Training (workshops, courses and online
    tutorials)
  • Collaborations (evaluation, direct approach,
    training
  • and interpretation)
  • Computer expertise (programming, evaluate
    needs,
  • direct purchase and set-up)

44
The Biology of Molecular BiologyRobert J. Huskey
http//www.people.virginia.edu/
rjh9u/humbiol.html
45
Sequence Analysis Facility
  • ...ATATAA...GTA...ATGCTAGGCGCTTCTATCTTC
  • ..................UACGAUCCGCGAAGAUAGAAG

DNA
UAC GAU CCG CGA AGA UAG AAG...
RNA protein
M L G A S I F ..
protein
46
Sequence Analysis Facility
  • Given a sequence (DNA or protein) search against
  • biological databases for similarities in type
    and function.
  • Generate and review sequence alignments.
  • Discover evolutionary path of a DNA or protein
  • sequence (phylogeny).
  • Given a sequence of DNA (eg. ACGTGTGGG)
  • locate genes and mutations.
  • Assemble sequence fragments into larger
    component sequence.

47
Molecular Modeling Center
  • Analyze protein structure, interpret the
    mechanism of action.
  • Generate a model structure by Homology
    Modeling.
  • Mutate a protein structure, predict affect on
    the protein.
  • Design inhibition of protein function, drug
    design and molecular docking. Emerson

48
Microarray Data Analysis Facility
  • Determine the expression level of every gene in
    an organism,
  • tissue or cell culture.
  • Monitor the changes in expression levels in
    response to
  • environmental changes, such as stress,
    disease, drug.
  • Lead to a study and understanding of the
    biological response.

49
  • Sequence Analysis Facility
  • Given a sequence (DNA or protein) search against
    biological databases for
  • similarities in type and function.
  • Generate and review sequence alignments.
  • Discover evolutionary path of a DNA or protein
    sequence (phylogeny).
  • Given a sequence of DNA (eg. ACGTGTGGG) locate
    genes and mutations.
  • Assemble sequence fragments into larger
    component sequence.
  • Molecular Modeling Center
  • Generate a model structure by Homology Modeling.
  • Mutate a protein structure, predict affect on
    the protein.
  • Design inhibition of protein function, drug
    design and molecular docking.
  • Microarray Data Analysis Facility
  • Determine the expression level of every gene in
    an organism, tissue or cell
  • culture.
  • Monitor the changes in expression levels in
    response to environmental
  • changes, such as stress, disease, drug.
  • Lead to a study and understanding of the
    biological response.
Write a Comment
User Comments (0)
About PowerShow.com