Bioinformatics Overview Problems and Algorithms - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Bioinformatics Overview Problems and Algorithms

Description:

Informatics. Medicine/Health Professions patient information, disease data. Bio Informatics ... Informatics: storing, retrieving, analyzing, understanding ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 28
Provided by: debbu
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Overview Problems and Algorithms


1
Bioinformatics OverviewProblems and Algorithms
  • Debra T. Burhans, Ph.D.
  • Director, Bioinformatics Program
  • Canisius College
  • burhansd_at_canisius.edu
  • SIGCSE Workshop, February 2005

2
Outline
  • Overview of bioinformatics
  • Problems and Algorithms

3
Overview of Bioinformatics
4
What is Bioinformatics?
  • There are many definitions of Bioinformatics
  • The field is so new that it is still in the
    process of being defined
  • Bioinformatics involves the application of
    computational techniques to the representation
    and analysis of biological data
  • Bioinformatics is intertwined with a number of
    different sciences
  • According to the National Institutes of Health,
    bioinformatics is research, development, or
    application of computational tools and approaches
    for expanding the use of biological, medical,
    behavioral or health data, including those to
    acquire, store, organize, archive, analyze, or
    visualize such data.

5
Bioinformatics is Interdisciplinary
  • Biology complex systems, source of interesting
    problems
  • Chemistry biochemistry underlies molecular
    biology
  • Physics physical phenomena at molecular level
  • Mathematics modeling
  • Statistics understand large data sets,
    biostatistics
  • Computer Science formulate and solve problems,
    information representation, integration, storage,
    management
  • Informatics
  • Medicine/Health Professions patient
    information, disease data

6
Bio Informatics
  • Biologists are generating an enormous amount of
    data with new high-throughput laboratory
    technologies
  • A single experiment can now yield thousands of
    data points
  • Sequencing experiments
  • Microarrays (gene and protein expression arrays)
  • There are thousands of journals, most of which
    are available electronically
  • There is no way for a scientist to
    analyze/understand this data without the aid of
    computational tools and statistical analyses
  • Bio source of data and problems
  • Informatics storing, retrieving, analyzing,
    understanding data with the use of computers

7
A Few Problems and Algorithms
8
Sequences
  • There are three different sequence types of
    interest
  • DNA
  • 4 letter alphabet A, C, G, T a 5th letter N
    is added for unknown
  • Nucleotides adenine, cytosine, guanine, thymine
  • RNA
  • A, C, G, U
  • Like DNA but uracil instead of thymine
  • Protein
  • 20 letter alphabet A, R, N, D, C, E, Q, G, H, I,
    L, K, M, F, P, S, T, W, Y, V a 21st letter X is
    added to represent an unknown aa
  • Amino acids (aa)
  • Alanine (A), Arginine (R), Asparagine (N),
    Aspartic acid (D), Cysteine (C), Glutamic acid
    (E), Glutamine (Q), Glycine (G), Histidine (H),
    Isoleucine (I), Leucine (L), Lysine (K),
    Methionine (M), Phenylalanine (F), Proline (P),
    Serine (S), Threonine (T), Tryptophan (W),
    Tyrosine (Y), Valine (V), Unknown (X)

9
The Central Dogma involves Mappings
  • DNA -gt RNA
  • RNA is transcribed from DNA with the following
    nucleotide match-ups A-gtU, T-gtA, C-gtG, G-gtC
  • RNA -gt Protein
  • Protein is synthesized from RNA as groups of
    three letters of the RNA code (codons) are mapped
    onto single amino acids
  • There are only 20 amino acids yet 64 possible
    codons, the code is redundant
  • Good problems arise with regular expressions
  • Create a regular expression that describes the
    set of codons that code for each amino acid

10
Sequence Alignment
  • Problem of matching sequences (DNA, RNA, protein)
  • Helps to identify unknown sequences by comparing
    them to large databases of sequences whose
    function may be understood
  • Alignments may be global or local
  • Alignments may be gapped or ungapped
  • Alignments are rarely perfect in biology

11
Simple Alignment Problem
  • What is the relationship between two sequences,
    for example
  • ACTTA
  • AGGTACTAGACTTATTATATACTTAACTATATACTTAAAA
  • Overlap and containment
  • In biology the problem is much more complex due
    to insertions, deletions and changes in
    individual bases
  • Could be mutations (biological)
  • Could be errors in sequencing or data entry
  • Alignments are scored number of matches,
    mismatches
  • Pairwise vs. multiple alignments (consensus
    sequence)

12
Alignment with Gaps
  • Insertions and deletions in sequences (indels)
    lead to the notion of a gapped alignment
  • In addition to match and mismatch scores include
    a gap penalty
  • A C G T - - T
  • A - T T T T T
  • With match score of 2, mismatch score of -2, and
    gap penalty of -1, this alignment scores 2 -1
    -2 2 -1 -1 2 1
  • The scoring parameters are adjusted to reflect
    the underlying biology

13
Multiple Sequence Alignment
Alignment of protein sequences
14
A look at scoring matrices
  • BLOSUM62
  • Created from multiple sequence alignments
  • Larger numbered matrices reflect more closely
    related sequences
  • PAM250
  • 1 PAM is the amount of evolutionary change that
    yields, on average, one substitution in 100 amino
    acid (aa) residues
  • A PAM matrix is a matrix of similarity scores for
    all possible pairs of residues (protein)
  • The matrix was derived from aa replacements
    occuring in related proteins

15
Local vs. Global Alignment
  • Global alignment is concerned with aligning two
    sequences end to end
  • Local alignment seeks the highest scoring
    alignment between subsequences
  • BLAST Basic Local Alignment and Search Tool
  • The primary tool available to score alignment
  • Parameters can be set in web interface
  • Did you BLAST your sequence?
  • Many flavors, including
  • Nucleotide-nucleotide
  • Protein-protein
  • Protein-nucleotide

16
BLAST Exercise(handout)
17
Dynamic Programming
  • Alignments can be computed using dynamic
    programming
  • Needleman-Wunsch (global alignment)
  • Smith-Waterman (local alignment)
  • Good programming exercise

18
Alignment Matrix
Choose a scoring metric, e.g. Score 1 for
match Score 0 for gap penalty Score 0 for mismatch
Three steps in dynamic programming Initialization
Matrix fill (scoring) Traceback (alignment)
For each position, Mi,j is defined to be the
maximum score at position i,j i.e. Mi,j
MAXIMUM Mi-1, j-1 Si,j
(match/mismatch in the diagonal), Mi,j-1 w
(gap in sequence 1), Mi-1,j w (gap in
sequence 2)
19
Fragment Reassembly
  • Shotgun sequencing involves chopping up DNA into
    small pieces, sequencing those pieces, then
    figuring out how they all fit together
  • The original structure is reconstructed via
    fragment reassembly
  • This was the was the technique used by J. Craig
    Venter to revolutionize the sequencing of the
    human genome
  • Only possible due to computational power

20
Fragment Reassembly Example
  • Reassemble a set of sequences into a single
    sequence
  • ACCT
  • CTTAG
  • TAGTAGTAG
  • AGGTC
  • Construct overlap graph
  • Problem repetitive regions may be collapsed
  • Not minimal superstring!
  • In reality the sequences are hundreds of bases
    long and there are thousands of them

21
Pioneers are Revolutionizing Biology
Leaving colleagues and rivals to comb through the
finished human code in search of individual
genes, he has decided to sequence the genome of
Mother Earth. "My greatest success is that I
managed to get hated by both worlds," Venter told
me on St. Barts.
What separates him from your average 58-year-old
nude beachcomber is that he's in the midst of a
scientific enterprise as ambitious as anything
he's ever done.
22
Pioneers are Revolutionizing Biology
He wanted to play God, so he cracked the human
genome. Now he wants to play Darwin and collect
the DNA of everything on the planet.
In March of 2004, he announced that his Sargasso
team had discovered at least 1,800 new species
and more than 1.2 million new genes.
23
Gene Finding
  • Many programs and approaches are used
  • http//www.binf.ku.dk/users/krogh/genefinding.html
  • HMMs
  • Search for particular features associated with
    genes
  • Start codon ATG
  • Stop codon TAA, TAG, TGA
  • ORFs (open reading frames)
  • Areas of high complexity (a lot of nucleotide
    variation)
  • Splice sites
  • Protein binding sites
  • etc.

24
Phylogenetic Trees
  • Relationships between species or sequences

25
The Tree of Life
26
Ribosomal Small Subunit RNA Tree of Life
DOMAIN BACTERIA
DOMAIN ARCHAEA
DOMAIN EUKARYA
Gram-positive bacteria
Green sulfur bacteria
Methanobacterium
Methanococcus
Thermococcus
Thermoproteus
Archaeoglobus
Dinoflagellates
Methanopyrus
Flavobacteria
Purple bacteria
Cyanobacteria
Trypanosoma
Entamoebae
Slime molds
Brown algae
Thermotoga
Pyrodictium
Green algae
Halococcus
Sulfolobus
Red algae
Animals
Thermus
Euglena
Diatoms
Ciliates
Fungi
Aquifex
Giardia
Plants
pJP27
pSL17
pJP78
pSL12
27
Conclusion
  • Lots of interesting data and real problems
  • Can be incorporated into CS courses at all levels
Write a Comment
User Comments (0)
About PowerShow.com