Title: Basic Overview of Bioinformatics Tools and Biocomputing Applications I
1Basic Overview of Bioinformatics Tools and
Biocomputing Applications I
- Dr Tan Tin Wee
- Director
- Bioinformatics Centre
2Software Tools
- Data stored in retrievable forms in database
systems - Data generated by machines, DNA / Protein
sequencers, automated systems
AutomatedMachines
ResearchLabs
Biological Data
Analytical Tools
Databases
New Knowledge
3Common Computational Analyses
- Sequence Assembly
- Simple sequence analysis
- Translation and reverse Complement, ORF
- Composition statistics (protein DNA)
- Molecular mass
- Total charge and pI local hydropathy
- Simple determination of secondary structures
- Restriction site analysis
- Internal repeat analysis
- Detection of active sites, functional residues,
characteristic structures, substrates, and
processing signals
4Common Computational Analyses
- Database sequence search
- Multiple alignment
- 2 and 3 Structure prediction transmembrane
helix detection - Structure modeling
- Docking prediction and design
- Hidden Markov model searches
5Sequence Assembly
- Fragmented data from DNA sequencers
- Detection of Overlap
- Merging of Contigs
- Assembly into continuous sequence
3'
5'
6Sequence Format Interconversion
- DNA/Protein and other sequence data come in
different formats. - Annotations
- Different programs use different formats
- Interconversion utility tools
- eg. READSEQ, TOGCG, TOSTADEN, etc
7Simple Sequence Analysis
1. Linear Sequence eg. DNA/ Protein
2. Open a Window - n 1 n
variable n sliding
8Some Simple Sequence Analysis Applications
- DNA complementary strand eg. COMPLEMENT REVERSE
- Open window size 1
- A---gtT
- C ---gtG
- T ---gt A
- G ---gt C
- Slide to next Window of 1
- Proceed to end of sequence
- Reverse order of complement
- 5' ...ATCTCGATACTACTACG...3'
-
- 3' ...TAGAGCTATGATGATGC...5'
9Some Simple Sequence Analysis Applications
- DNA to Protein sequence translation, e.g.
TRANSLATE - Open window of 3 bases
- Look up Codon Usage table
- Assign Amino acid residue
- Slide window to next 3 bases
- Proceed till stop codon detected.
- Repeat whole procedure for six frames
ATACTACTGAGATCTAGGCTAGTACTGCGTGCG
Frame 1 Frame 2 Frame 3
Complement - Frames 4-6
10Some Simple Sequence Analysis Applications
- Detect Open Reading Frame e.g.ORF
- Translate sequence, report long stretches of
start and stop codons - Compositional analysis
- eg. Calculate total A, T, G, C
- eg. Calculate total molecular mass of protein,
analysis percentages of amino acids - eg. Total Charge composition, pI
11Some Simple Sequence Analysis Applications
- Simple prediction of secondary structure of
Protein sequence - decide a window size
- compute for each window of amino acids
statistical potential to form helix, beta sheet,
turn, etc. Chou-Fasman, GOR etc algorithms - use a statistical potential chart
- plot potentials in graphical or pictorial format
12Some Simple Sequence Analysis Applications
- Restriction Mapping eg. MAP, MAPPLOT,MAPSORT,
PLASMIDMAP etc - Table of Restriction Enzymes and cut siteseg.
EcoRI, BamHI AluIand their cut sites eg.
GAATTC , AATT - Take a DNA sequence
- Pattern match against the list of cut sites
- For each match, assign Restriction enzyme
- Calculate distance between cut sites
- Display in table, graphical, or restriction map,
etc
gel
Plasmidmap
13Some Simple Sequence Analysis Applications
- Protein sequence Motifs pattern matching eg.
PROSITEMAP, MOTIFS, BLOCKS etc - Table/Database of Sequence Patterns/Motifs and
their signature sequence eg. Arg-Gly-Asp (RGD)
or consensus sequence (eg. PROSITE, BLOCKS db) - Take Protein sequence
- Pattern match against the list of signature sites
- For each match, assign potential function
according to database - Display in table or graphically, or hyperlinked
14Some Simple Sequence Analysis Applications
- Peptide Cleavage Maps eg. PEPTIDESORT, PEPTIDE
MAP - Table of Protease vs Cleavage sites eg. Trypsin,
chymotrypsin, and Chemical cleavage sites
cyanogen bromide - Pattern match with entire protein sequence
- Calculate size of peptide fragments
- Sort and Map, Plot as electrophoretic patterns on
a log-linear simulated digest. - Compute Partial Digest patterns
15Some Simple Sequence Analysis Applications
- DOTPLOT- selfcomparison
- Take a Window size
- Compare against entire length of own sequence
- Report matches above a threshold
- Plot on Graph
- Slide window, repeat till end of sequence
- Detection of Internal repeats
- Pairwise comparison - detection of homology
Sequence A
Sequence A
16Some Simple Sequence Analysis Applications
- RNA secondary structure analysis
- Mfold, PlotFold, FoldRNA, Squiggles, Circles,
Domes, Mountains, StemLoop - Folding of RNA into stems, loops
- Calculation of energy - prediction of stability
of structure - Display of structure and alternatives
AUCG U G G A
AUGC
UACG
---- -- --
...AUCGA
AUCUC...
17Database Searching
- Text-based Database Searching -using a text
string to match an annotation in a sequence
database record, ie. Keyword search - Sequence-based Database Searching -using a
biological sequence to match its whole or parts
of its sequence to the sequences of every
sequence database records
18Text-Based Database Searching
- Examples Entrez, SRS, DBGET, AceDB- common
integrated database systems - Search Concepts
- Boolean Search - AND, OR, NOT
- Broadening Search
- Narrowing the Search
- Proximity searching, soundex
- Wild Card, Stemming eg. Thala for thalasemia,
thalassemia, thalassemic - Use standard string search algorithms and boolean
operations, vocabulary matches
19Text-based Database Searching
- Example To find the human homolog of the
Drosophila per gene - Procedure
- Web to Entrez
- All Fields enter "human" "per"
- Hits returned, irrelevant - broaden search
- "human" "period" - more hits
- check every one, find the human RIGUI gene
- Hit and miss, clever guess work, free form or
controlled vocabulary (MeSH terms)?Use Boolean
searches?
20Sequence-based Database Searching
- Homology Search
- Global or Local Sequence Alignment
- Needleman-Wunch Algorithm
- Smith-Waterman Algorithm
- Lipman - Pearson FASTA
- Altschul's BLAST
- Take a sequence, pairwise comparison with each
sequence in the database
21Sequence-based Database Searching
- Basic Assumptions
- Sequences of homologous Genes/Protein diverge
over time even though structure and/or function
change little - Significant sequence similarity inferred as
potential structural /functional similarity or
common evolutionary origin - Based on well-characterised protein, infer the
function of an unknown sequence at gene or
protein sequence level.
22Sequence-based Database Searching
- Global Alignmentforces complete alignment of the
pairwise comparison of the two input sequences - Local Alignmentlooks for local stretches of
similarity and tries to align the most similar
segments - Algorithms used may be similar, but output
different, statistics needed to assess results
23Sequence-based Database Searching
- Alignment Scoring
- Substitution score and substitution matrixPAM,
BLOSUM - affine gap costs/gap penalty and gap scores
- Optimal alignments, dynamic programmingNeedleman-
Wunsch algorithm,Smith-Waterman algorithm
(SSEARCH) - Additional heuristics - FASTA, BLAST