Data Mining For Bioinformatics: Tools and Applications - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Data Mining For Bioinformatics: Tools and Applications

Description:

Example: Yeast Sporulation. Chu et.al. Science 282 ... Example: Yeast Sporulation. Data Mining Tools and Applications - Craig A. Struble. 12 ... – PowerPoint PPT presentation

Number of Views:1075
Avg rating:3.0/5.0
Slides: 29
Provided by: CraigAS7
Category:

less

Transcript and Presenter's Notes

Title: Data Mining For Bioinformatics: Tools and Applications


1
Data Mining For Bioinformatics Tools and
Applications
  • Craig A. Struble
  • Department of Mathematics, Statistics, and
    Computer Science
  • craig.struble_at_marquette.edu

2
Overview
  • Clustering
  • Hierarchical Clustering, SOMs, Model-based
    clustering
  • Classification
  • SVMs, neural networks
  • Tool building

3
Clustering
  • Basic idea
  • Group similar things together
  • d(x,y) Distance function between x and y
  • Euclidean, mismatches, etc.
  • Bioinformatics context
  • Similar expression profiles imply similar
    function
  • This is under some scrutiny
  • Unsupervised
  • Useful when no other information is available
  • Just to see what happens

4
Example Data
  • Genes x Experiments
  • 6000 genes x 16 experiments
  • Could use ratios or other values for data

5
Hierarchical Clustering
  • Bottom up (agglomerative)
  • Top down (divisive)
  • Linkage
  • How groups are combined (or split)

6
Hierarchical Clustering (Example)
  • Analyzing yeast data (different experiment)

7
Self Organizing Maps
  • Also called Kohonen maps
  • Example of neural networks

8
Self Organizing Maps
  • Yeast data set
  • Typical values
  • Error bars
  • Using GeneSOM in R
  • Many other visualizations

9
Clustering With Models
  • Create/select representative points (i.e. models)
  • Perform cluster analysis (K-means, K-medoids,
    etc.)
  • Classify/identify real data items by finding
    which representative they cluster with

10
Example Yeast SporulationChu et.al. Science 282
11
Example Yeast Sporulation
12
After Clustering
  • Try multiple sequence alignment of genes closely
    clustered together
  • Include upstream/downstream sequence to look for
    promoter regions, etc.
  • Search in KEGG for metabolic pathways genes may
    be involved with
  • Look at functional classification of genes in the
    same cluster
  • May be able to assign putative function to genes
    with unknown function (again, this is under
    scrutiny)

13
Classification
  • Use data to create a classifier
  • A predictive model for labeling new data items
  • Supervised learning
  • Generate data associated with known labels
  • Train the specific technique with labeled data

14
Support Vector Machines
  • Find a hyperplane to separate data points

15
Feature Selection
  • Identify subset of attributes as most important
  • Identify group of genes that play most important
    role in distinguishing classes
  • Use information gain or other statistical
    measures to determine the importance of a data
    item
  • In many cases, feature selection is the true goal

16
Example Leukemia Classification
  • Ovarian Cancer (Furey et al, 2000)
  • Also tested on Leukemia data (Golub et al, 1999)
  • 97,802 DNA clones used
  • 31 tissue samples
  • Cancerous and normal ovarian tissue
  • Non-ovarian tissue

17
Example (cont.)
  • Feature selection
  • High scoring genes differ most on average and
    have small deviations in value
  • 50 relevant clones identified
  • Leave one out testing
  • 80 accuracy
  • Really testing if selected features are good

18
Neural Networks
  • Network of connected neurons
  • Trained with data with known output values
  • Errors propogated through network for learning

Input Layer
Output Layer
Hidden Layer
19
Example Cleavage Site Prediction (Nielsen et al,
1997)
  • Predict where cleavage sites are in protein
    precursors
  • Training data from SWISS-PROT database

20
Example (cont.)
  • Input layer
  • Groups of 20 neurons per location
  • 20 amino acids
  • Sliding window of 5-39 amino acids
  • Hidden layer
  • 0-10 neurons tried
  • Output layer
  • 2 neurons per location, P(c),P(s)
  • Trained with backpropogation

a
P(c)

P(s)
v
a



v
a

P(c)
v
P(s)
21
Tool Building
  • Many commercial packages have lots of tools
  • Not always integrated
  • May not provide enough flexibility
  • Sometimes youve just gotta do it yourself

22
Tool Building
  • Identify problem to work on
  • E.g. predict where miRNAs are on the genome
  • Determine where to get data
  • E.g. NCBI, KEGG, literature
  • Determine the final format of your data
  • E.g. Oracle, PostgreSQL, CSV, XML, etc.
  • Select data mining techniques to use
  • Literature and experience

23
Tool Building
  • Select user interface style
  • Web-based vs. applet vs. application
  • Upload data/download data?
  • Select visualizations
  • Decide how to present the data
  • Communicate with your users

24
Tool Building
  • Select software to use
  • Does it contain the data mining techniques to
    use?
  • Is the source code available?
  • Library vs. application vs. interpreter
  • What do you know and what are you willing to
    learn?
  • Eventually, youll build up a collection of tools
    to build on

25
Typical architecture for small project
Neural Network
Output
Perl Scripts
Perl Scripts
Clustering
26
Where to Find Examples
  • Search Google (http//www.google.com)
  • E.g. self organizing maps bioinformatics
  • Citeseer (http//citeseer.nj.nec.com/)
  • PubMed (http//www.ncbi.nih.gov)
  • Web pages and papers usually contain links to
    software used

27
Be Prepared
  • Programming for bioinformatics/data mining often
    requires knowing many languages (Perl, C, Java
    at a minimum)
  • Practice on supplied sample data sets if any
  • Read, read, read!

28
Useful free tools
  • R (http//www.r-project.org)
  • Statistics, plotting, clustering, etc.
  • Octave (http//www.octave.org)
  • Matlab clone (Matlab itself is EXTREMELY useful
    and popular)
  • PostgreSQL (http//www.postgresql.org)
  • Powerful ORDBMS (when you cant afford Oracle)
  • http//bistro.mscs.mu.edu/cstruble/biolinks.html
Write a Comment
User Comments (0)
About PowerShow.com