Title: Challenges in Bioinformatics Part II
1Challenges in BioinformaticsPart II
- Vasileios Hatzivassiloglou
- University of Texas at Dallas
2More observations on GloFish
- GloFish license prohibits
- intentional breeding
- sale/barter/trade of offspring
- Attitudes on recombinant DNA / genetic
engineering in Europe - Le Monde Diplomatique (January 2004)
- Frankenfish and the future
3Sequence analysis
- Given a genome sequence determine the gene
regions - Uses data from observed output (expression)
- Models based on transducers and Markov Models
4String Similarity
- Two issues
- Given two sequences what is their similarity
- How are two similar sequences aligned
- Global alignment algorithms based on dynamic
programming - Local alignment FASTA BLAST
- Useful for homology search (finding related genes
/ proteins in different organisms)
5Motif finding
- Motifs are short sequences in DNA or proteins
that are over-repeated in a sample - Find motifs and understand their role
- Typically with statistical word counting
methods - Alternatively with a predictive model and the EM
(expectation-maximization) algorithm - Important for
- recognizing promoter regions
6Clustering
- Genes that are co-expressed in a DNA microarray
experiment may be functionally related - Cluster genes on
- sequence similarity or
- expression data
- Issues include feature selection outlier
detection validation of clusters (mathematically
and biologically)
7Classification
- Detect particular types of proteins or genes
(e.g. find the gene responsible for a disease
find highly active proteins) - HIV and leukemia as applications
- Multiple models based on different feature spaces
- Many algorithms including decision trees kNN
and support vector machines (SVM)
8Protein folding
- Proteins fold in a native state
- 3D properties important for drug interactions
(pharmacogenomics) - Variations from the native state can be
indicators of disease (Alzheimers mad cow
disease) - Folding depends on chemical interactions and
thermodynamics - Direct simulation impractical (1 day for 1ns)
- Treat as an optimization problem with approximate
solutions (minimize an energy function)
9Protein Folding Prior Knowledge
- Knowledge from other proteins can help
- Find similar primary structure in other proteins
with known folding - Folding can be predicted (at least locally) and
experimentally verified
10Phylogenetics
- Recovery of the tree of life
- Important for molecular biology because
- we can predict missing sequences from known
sequences in related species - we can predict function from related
genes/proteins in another species
11Phylogenetics
- Goal Classify species based on data
- Cladistics
- Characters are the differentiating features which
have different states (e.g. flower color) - Construct matrix of features and automatically
locate best discriminating feature - Misleading evidence can exist (homeoplasies) due
to convergent evolution e.g. - wings in insects and birds
12Phylogenetics Other approaches
- Maximum likelihood models
- Estimate probability of change for each
character and arrange species in maximum
likelihood tree - Molecular systematics
- Measure similarity between short sequences of DNA
(haplotypes) and use clustering to create the
tree - Can also use mRNA
13Phylogeny of the dog (1997)
- Haplotype based on a sequence of 261 bp
- Compared 72 dog breeds and 27 geographically-based
groups of wolves as well as a control group of
other canids (coyotes jackals) - 27 haplotypes for dogs 26 for wolves
- Maximum differences 10 (dog) 12 (wolf) 12
(dogwolf) - Minimum difference 20 (dog vs. other canids)
- Evidence shows
- Dog evolved from wolf
- Four classes of dogs some more homogeneous
- Evidence conflicts with current dog breed
classification
14Other important computational issues
- Data storage
- Efficient database access and search
- Text search and information retrieval
- Lossless compression
- Database interactions
- Representation and visualization
- Image processing
- Robotics
15Curation versus discovery
- Much easier (and faster) to have an expert check
system results than produce such results from
scratch - Automated discovery followed by curation
- increases thoroughness
- potentially removes bias (assuming system is not
biased)
16Curation
- Experimental results need to be verified by
experts - This is a large and time-consuming task
- How can it be facilitated
- Interface issues access to primary data access
to literature - Concurrent verification
- Can it be modeled and automated
- AI and statistical models
17Knowledge modeling
- Models of biological processes and the steps in
them (e.g. actions in regulatory networks) - Needed to support automated processing of
extracted data - Distinct from data extraction
18Examples of modeled knowledge
- Types of protein actions (bind activate
phosporylate ...) - Constraints on actions
- (Functional) classes of proteins
- Ontologies of concepts in the biological domain
- Much of this is derived via text mining