Data Mining in Bioinformatics - PowerPoint PPT Presentation


PPT – Data Mining in Bioinformatics PowerPoint presentation | free to download - id: 70ab7-YWFiN


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Data Mining in Bioinformatics


Data Mining in Bioinformatics – PowerPoint PPT presentation

Number of Views:999
Avg rating:3.0/5.0
Slides: 74
Provided by: lisag5


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data Mining in Bioinformatics

Data Mining in Bioinformatics
  • Introduction
  • Interdisciplinary Problem Statement
  • Microarray Problem Overview
  • Microarray Data Processing
  • Image Analysis and Data Mining
  • Prior Knowledge
  • Data Mining Methods
  • Database and Optimization Techniques
  • Visualization
  • Validation
  • Artificial Immune Systems
  • Summary

Introduction Recommended Literature
  • 1. Bioinformatics The Machine Learning Approach
    by P. Baldi S. Brunak, 2nd edition, The MIT
    Press, 2001
  • 2. Data Mining Concepts and Techniques by J.
    Han M. Kamber, Morgan Kaufmann Publishers, 2001
  • 3. Pattern Classification by R. Duda, P. Hart and
    D. Stork, 2nd edition, John Wiley Sons, 2001

Bioinformatics, Computational Biology, Data Mining
  • Bioinformatics is an interdisciplinary field
    about the information processing problems in
    computational biology and a unified treatment of
    the data mining methods for solving these
  • Computational Biology is about modeling real data
    and simulating unknown data of biological
    entities, e.g.
  • Genomes (viruses, bacteria, fungi, plants,
  • Proteins and Proteomes
  • Biological Sequences
  • Molecular Function and Structure
  • Data Mining is searching for knowledge in data
  • Knowledge mining from databases
  • Knowledge extraction
  • Data/pattern analysis
  • Data dredging
  • Knowledge Discovery in Databases (KDD)

Basic Terms in Biology
  • Example
  • The human body contains 100 trillion cells
  • Inside each cell is a nucleus
  • Inside the nucleus are two complete sets of the
    human genome (except in egg, sperm cells and
    blood cells)
  • Each set of genomes includes 30,000-80,000 genes
    on the same 23 chromosomes
  • Gene A functional hereditary unit that occupies
    a fixed location on a chromosome, has a specific
    influence on phenotype, and is capable of
  • Chromosome A DNA containing linear body of the
    cell nuclei responsible for determination and
    transmission of hereditary characteristics

Basic Terms in Data Mining
  • Data Mining A step in the knowledge discovery
    process consisting of particular algorithms
    (methods) that under some acceptable objective,
    produces a particular enumeration of patterns
    (models) over the data.
  • Knowledge Discovery Process The process of using
    data mining methods (algorithms) to extract
    (identify) what is deemed knowledge according to
    the specifications of measures and thresholds,
    using a database along with any necessary
    preprocessing or transformations.
  • A pattern is a conservative statement about a
    probability distribution.
  • Webster A pattern is (a) a natural or chance
    configuration, (b) a reliable sample of traits,
    acts, tendencies, or other observable
    characteristics of a person, group, or

Introduction Problems in Bioinformatics Domain
  • Problems in Bioinformatics Domain
  • Data production at the levels of molecules,
    cells, organs, organisms, populations
  • Integration of structure and function data, gene
    expression data, pathway data, phenotypic and
    clinical data,
  • Prediction of Molecular Function and Structure
  • Computational biology synthesis (simulations)
    and analysis (machine learning)


Microarray Problem Major Objective
  • Major Objective Discover a comprehensive theory
    of lifes organization at the molecular level
  • The major actors of molecular biology the
    nucleic acids, DeoxyriboNucleic acid (DNA) and
    RiboNucleic Acids (RNA)
  • The central dogma of molecular biology

Proteins are very complicated molecules with 20
different amino acids.
Input and Output of Microarray Data Analysis
  • Input Laser image scans (data) and underlying
    experiment hypotheses or experiment designs
    (prior knowledge)
  • Output
  • Conclusions about the input hypotheses or
    knowledge about statistical behavior of
  • The theory of biological systems learnt
    automatically from data (machine learning
  • Model fitting, Inference process

Overview of Microarray Problem
Biology Application Domain
Data Analysis
Microarray Experiment
Data Mining
Image Analysis
Experiment Design and Hypothesis
Data Warehouse
Artificial Intelligence (AI)
Knowledge discovery in databases (KDD)
Statistics Community
  • Random Variables
  • Statistical Measures
  • Probability and Probability Distribution
  • Confidence Interval Estimations
  • Test of Hypotheses
  • Goodness of Fit
  • Regression and Correlation Analysis

Artificial Intelligence (AI) Community
  • Issues
  • Prior knowledge (e.g., invariance)
  • Model deviation from true model
  • Sampling distributions
  • Computational complexity
  • Model complexity (overfitting)

Collect Data
Choose Features
Choose Model
Train Classifier
Evaluate Classifier
Design Cycle of Predictive Modeling
Knowledge Discovery in Databases (KDD) Community
Microarray Data Mining and Image Analysis Steps
  • Image Analysis
  • Normalization
  • Grid Alignment
  • Spot Quality Assurance Control
  • Feature construction (selection and extraction)
  • Data Mining
  • Prior knowledge
  • Statistics
  • Machine learning
  • Pattern recognition
  • Database techniques
  • Optimization techniques
  • Visualization
  • Validation
  • Issues
  • Cross validation techniques


Microarray Image Analysis

Why Data Mining ? Sequence Example
  • Biology Language and Goals
  • A gene can be defined as a region of DNA.
  • A genome is one haploid set of chromosomes with
    the genes they contain.
  • Perform competent comparison of gene sequences
    across species and account for inherently noisy
    biological sequences due to random variability
    amplified by evolution
  • Assumption if a gene has high similarity to
    another gene then they perform the same function
  • Analysis Language and Goals
  • Feature is an extractable attribute or
    measurement (e.g., gene expression, location)
  • Pattern recognition is trying to characterize
    data pattern (e.g., similar gene expressions,
    equidistant gene locations).
  • Data mining is about uncovering patterns,
    anomalies and statistically significant
    structures in data (e.g., find two similar gene
    expressions with confidence gt x)

Types of Expected Data Mining and Analysis Results
  • Hypothetical Examples
  • Binary answers using tests of hypotheses
  • Drug treatment is successful with a confidence
    level x.
  • Statistical behavior (probability distribution
  • A class of genes with functionality X follows
    Poisson distribution.
  • Expected events
  • As the amount of treatment will increase the gene
    expression level will decrease.
  • Relationships
  • Expression level of gene A is correlated with
    expression level of gene B under varying
    treatment conditions (gene A and B are part of
    the same pathway).
  • Decision trees
  • Classification of a new gene sequence by a
    domain expert.


Prior Knowledge Experiment Design
  • Microarray sources of systematic and random
  • Feature selection and variability
  • Expectations and Hypotheses
  • Data cleaning and transformations
  • Data mining method selection
  • Interpretation

Collect Data
Data Cleaning and Transformations
Choose Features
Prior Knowledge
Choose Model and Data Mining Method
Prior Knowledge from Experiment Design
  • Complexity Levels of Microarray Experiments
  • Compare single gene in a control situation versus
    a treatment situation
  • Example Is the level of expression (up-regulated
    or down-regulated) significantly different in the
    two situations? (drug design application)
  • Methods t-test, Bayesian approach
  • Find multiple genes that share common
  • Example Find related genes that are dependent?
  • Methods Clustering (hierarchical, k-means,
    self-organizing maps, neural network, support
    vector machines)
  • Infer the underlying gene and protein networks
    that are responsible for the patterns and
    functional pathways observed
  • Example What is the gene regulation at system
  • Directions mining regulatory regions, modeling
    regulatory networks on a global scale
  • Goal of Future Experiment Designs Understand
    biology at the system level, e.g., gene networks,
    protein networks, signaling networks, metabolic
    networks, immune system and neuronal networks.

Data Mining Techniques

Inductive Statistics
Descriptive Statistics
Make forecast and inferences
Describe data
Are two sample sets identically distributed ?
Statistical t-test
  • Gene Expression Level in Control and Treatment
  • Is the behavior of a single gene different in
    Control situation than in Treatment situation ?

Normalized distance
Normalized distance t follows a Student
distribution with f degrees of freedom.
If tgtthresh then the control and treatment data
populations are considered to be different.
  • m sample mean
  • s variance

  • AND

Machine Learning
Machine Learning
Natural groupings
Pattern Recognition
Pattern Recognition
k-nearest neighbors, support vectors
Locally Weighted Learning
Statistical Models
Linear Correlation and Regression
Decision Trees
Neural Networks
NN representation and genetic algorithm based
NN representation and gradient based optimization
Unsupervised Learning and Clustering
  • A cluster is a collection of data objects that
    are similar to one another within the same
    cluster and are dissimilar to the objects in
    other clusters.
  • Examples of data objects
  • gene expression levels, sets of co-regulated
    genes (pathways), protein structures
  • Categories of Clustering Methods
  • Partitioning Methods
  • Hierarchical Methods
  • Density-Based Methods

Natural groupings
Unsupervised Clustering Partitioning Methods
Example Centroid-Based Technique
  • K-means Algorithm partitions a set of n objects
    into k clusters so that the resulting
    intra-cluster similarity is high but the
    inter-cluster similarity is low.
  • Input number of desired cluster k
  • Output k labels assigned to n objects
  • Steps
  • Select k initial clusters centers
  • Compute similarity as a distance between an
    object and each cluster center
  • Assign a label to an object based on the minimum
  • Repeat for all objects
  • Re-compute the clusters centers as a mean of
    all objects assign to a given cluster
  • Repeat from Step 2 until objects do not change
    their labels.

Unsupervised Clustering Partitioning Methods
Example Representative-Based Technique
  • K-medoids Algorithm partitions a set of n objects
    into k clusters so that it minimizes the sum of
    the dissimilarities of all the objects to their
    nearest medoid.
  • Input number of desired cluster k
  • Output k labels assigned to n objects
  • Steps
  • Select k initial objects as the initial medoids
  • Compute similarity as a distance between an
    object and each cluster medoid
  • Assign a label to an object based on the minimum
  • Repeat for all objects
  • Randomly select a non-medoid object and swap with
    the current medoid it would decrease
    intra-cluster square error
  • Repeat from Step 2 until objects do not change
    their labels.

Unsupervised Clustering Hierarchical Clustering
  • Hierarchical Clustering partitions a set of n
    objects into a tree of clusters
  • Types of Hierarchical Clustering
  • Agglomerative hierarchical clustering
  • Bottom-up strategy of building clusters
  • Divisive hierarchical clustering
  • Top-down strategy of building clusters

Unsupervised Agglomerative Hierarchical Clustering
  • Agglomerative Hierarchical Clustering partitions
    a set of n objects into a tree of clusters with a
    bottom-up strategy.
  • Steps
  • Assign a unique label to each data object and
    form n clusters
  • Find nearest clusters and merge them
  • Repeat Step 2 till the number of desired clusters
    is equal to the number of merged clusters.
  • Types of Agglomerative Hierarchical Clustering
  • The nearest neighbor algorithms (minimum or
    single-linkage algorithm, minimal spanning tree)
  • The farthest neighbor algorithms (maximum or
    complete-linkage algorithm)

Unsupervised Clustering Density-Based Clustering
  • Density-Based Spatial Clustering with Noise
    aggregates objects into clusters if the objects
    are density connected.
  • Density connected objects
  • Simplified explanation P and Q are density
    connected if there is an object O such that both
    P and Q are density connected to O.
  • Aggregate P and Q if they are density connected
    with respect to R-radius neighborhood and Minimum
    Object criteria

Supervised Learning or Classification
  • Classification is a two-step process consisting
    of learning classification rules followed by
    assignment of classification label.

Supervised Learning Decision Tree
  • Decision tree algorithm constructs a tree
    structure in a top-down recursive
    divide-and-conquer manner

Age lt 25 ?
Risk High
Sports car ?
Risk High
Risk Low
Car Insurance Risk Assessment
Visualization of Decision Boundaries
Supervised Learning Bayesian Classification
  • Bayesian Classification is based on Bayes theorem
    and it can predict class membership
  • Bayes Theorem (X-data sample, H-hypothesis of
    data label)
  • P(H/X) posterior probability
  • P(H) prior probability
  • Classification-maximum posteriori hypothesis

Statistical Models Linear Discriminant
  • Linear Discriminant Functions form boundaries
    between data classes.
  • Finding Linear Discriminant Functions is achieved
    by minimizing a criterion error function.

Linear discriminant function
Quadratic discriminant function
Finding w coefficients -Gradient Descent
Procedures -Newtons algorithm
Artificial Neural Networks
  • Artificial Neural Network (ANN) is a
    computational analogue of neurons.
  • Artificial neural network is a set of connected
    input/output units where each connection has a
    weight associated with it.
  • Phase I learning adjust weights such that the
    network predicts accurately class labels of the
    input samples
  • Phase II classification- assign labels by
    passing an unknown sample through the network

Network topology or Structure
Artificial Neural Networks (cont.)
  • Steps
  • Initial weights from -1,1
  • Propagate the inputs forward
  • Backpropagate the error
  • Terminate learning (training) if (a) delta w lt
    thresh or (b) percentage of misclassified samples
    lt thresh or (c) max number of iterations has been
  • Pros Cons of ANN Good performance with noisy
    data, rule extraction long training, poor
    interpretability, trial-and-error network design

Unit or node j
Support Vector Machines (SVM)
  • SVM algorithm finds a separating hyperplane with
    the largest margin and uses it for classification
    of new samples

  • AND

Data Types and Databases
  • Relational Databases
  • Data Warehouses
  • Transactional Databases
  • Advanced Database Systems
  • Object-Relational
  • Spatial and Temporal
  • Time-Series
  • Multimedia
  • Text
  • Heterogeneous, Legacy, and Distributed
  • WWW

Structure - 3D Anatomy
Function 1D Signal
Metadata Annotation
Database Techniques
  • Database Design and Modeling (tables, procedures,
    functions, constraints)
  • Database Interface to Data Mining System
  • Efficient Import and Export of Data
  • Database Data Visualization
  • Database Clustering for Access Efficiency
  • Database Performance Tuning (memory usage, query
  • Database Parallel Processing (multiple servers
    and CPUs)
  • Distributed Information Repositories (data

Search and Optimization Techniques Search Types
  • Types of search methods
  • Calculus-based
  • Indirect (solve a nonlinear set of equations)
  • Direct (follow local gradient - hill climbing)
  • Enumerative (search objective function values at
    every point dynamic programming)
  • Random (search with random sampling)
  • Randomized search methods guide the search with
    random processes simulated annealing, genetic

Search and Optimization Techniques Challenges
  • Search and optimization challenges
  • Global versus local maxima
  • Existence of derivatives (calculus-based)
  • High dimensionality
  • Highly nonlinear search space (global versus
    local maxima)
  • Large search space
  • Example A genome with N genes can encode 2N
    states (active or inactive states, regulated is
    not considered). Human genome 230,000
    Nematode genome 220,000 patterns.

Genetic Algorithm
  • Genetic Algorithm (GA) based optimization is a
    computational analogue of Darwins evolution
    theory (survival of the fittest).
  • Description of GA based optimization
  • Uses coding of the parameter set (not the
    parameters themselves)
  • Searches from a population of points (not a
    single point)
  • Uses an objective function (not derivatives or
    other auxiliary knowledge)
  • Employs probability transition rules (not
    deterministic rules)
  • Is composed of three operators
  • Reproduction (or selection)
  • Crossover
  • Mutation
  • Reference D. Goldberg Genetic Algorithms in
    Search, Optimization Machine Learning,Addison-We
    sley Publishing Co., 1989.

Genetic Algorithm Additional Operators
  • Additional operators
  • Niching for optimization of multimodal and
    multiobjective functions
  • Fitness sharing the number of individuals
    residing near any peak will be proportional to
    the height of that peak (reduce individual
    fitness according to their similarity)
  • Crowding spread individuals among the most
    prominent peaks and do not allocate individuals
    proportionally to fitness (maintain diversity)
  • Speciation for optimization of multimodal
  • Mating restriction scheme (restrict mating or
    crossover according to the similarity among

Genetic Algorithm Example
Objective Function
(on,off,on,off) input sequence is converted to a
string (1010)
  • Steps
  • Randomly generate initial population of size n2
    e.g., strings 0110 1100
  • Reproduction is a process of copying strings
    according to their objective function a
    roulette wheel
  • Crossover proceeds in two steps (1) random mating
    of strings and (2) selecting random positions of
    each string for mating e.g., obtain 1110 0100
  • Mutation is the occasional random alteration of
    the value of a string position to protect
    premature loss of information obtain 0110 0100


  • Data 3D cubes,distribution charts, curves,
    surfaces, link graphs, image frames and movies,
    parallel coordinates
  • Results pie charts, scatter plots, box plots,
    association rules, parallel coordinates,
    dendograms, temporal evolution

Parallel coordinates
Pie chart
Temporal evolution
Novel Visualization of Features
Feature Selection and Visualization
Feature Selection
Mean Feature Image
Novel Visualization of Clustering Results
Class Labeling and Visualization
Isodata (K-means) Clustering
Mean Feature Image
Label Image

Why Validation?
  • Validation type
  • Within the existing data
  • With newly collected data
  • Errors and uncertainties
  • Systematic or random errors
  • Unknown variables - number of classes
  • Noise level - statistical confidence due to
  • Model validity error measure, model over-fit or
  • Number of data points - measurement replicas
  • Other issues
  • Experimental support of general theories
  • Exhaustive sampling is not permissive

Error Detection Example of Spot Screening
Mask Image Location and Size Screening
Mask Image No Screening
Mask Image SNR Screening
Cross Validation Example
  • One-tier cross validation
  • Train on different data than test data
  • Two-tier cross validation
  • The score from one-tier cross validation is used
    by the bias optimizer to select the best learning
    algorithm parameters ( of control points) . The
    more you optimize the more you over-fit. The
    second tier is to measure the level of over-fit
    (unbiased measure of accuracy).
  • Useful for comparing learning algorithms with
    control parameters that are optimized.
  • Number of folds is not optimized.
  • Computational complexity
  • folds of top tier X folds of bottom tier X
    control points X CPU of algorithm


Artificial Immune Systems
  • Artificial Immune Systems (AIS) are adaptive
    systems, inspired by theoretical immunology and
    observed immune functions, principles and models,
    which are applied to problem solving.
  • Other types of AIS are hybrids of ANN, GA and
    fuzzy systems combined with theoretical
    immunology models
  • Applications of AIS
  • Pattern recognition (surveillance of infectious
  • Fault and anomaly detection ((image inspection
    and segmentation)
  • Data analysis (reinforced, unsupervised learning)
  • Agent-based systems
  • Scheduling (adaptive scheduling)
  • Autonomous navigation and control (walking
  • Search and optimization methods (constrained,
    time-dependent optimization)
  • Security of information systems (virus detection,
    network intrusion)

Basic Terms Used in Artificial Immune Systems
  • Immune system is understood as a complex set of
    cells and molecules that protect our bodies
    against infection under constant attack by
    antigens (foreign or self-antigens)
  • Immune system consists of two-tier line of
    defense adaptive (lymphocytes B-cells
    T-cells) and innate (granulocytes macrophages)
    immune systems. Both systems depend upon the
    activity of white blood cells (leukocytes).

The organs that make up the immune system
(lymphoid organs) are thymus bone marrow
(primary) and tonsils,adenoids, spleen, appendix,
lymph nodes, lymphatic vessels, peyers patches
Mechanisms Adapted in Artificial Immune Systems
  • Pattern recognition lymphocytes (B-cells
    T-cells) carry surface receptors capable of
    recognizing antigens
  • Example recognition via complementary regions
  • The clonal selection principle only cells
    capable of recognizing an antigen stimulus will
    proliferate and differentiate into effector cells
  • Immune learning and memory reinforced
  • learning strategy
  • Self/Nonself discrimination distinguish between
    molecules of its own cell (self) and foreign
    molecules (nonself)- positive and negative
    selection, clonal expansion and ignorance

Why Artificial Immune System?
  • Pattern recognition cells and molecules of the
    immune system have several ways of recognizing
  • Uniqueness each individual possesses its own
    immune system
  • Self identity other than native elements to
    the body can be recognized and eliminated by the
    immune system
  • Diversity there exist varying types of elements
    that together protect the body
  • Disposability no single native element is
    essential for the functioning of the immune
  • Autonomy there is no central element controlling
    the immune system
  • Multi-layered multiple layers of different
    mechanisms provide overall security
  • No secure layer any cell of the organism can be
    attacked by the IS
  • Anomaly detection the IS can recognize and react
    to pathogens that the body has never encountered
  • Dynamically changing coverage the IS maintains a
    circulating repertoire of lymphocytes constantly
    being changed through cell death, production and

Why Artificial Immune System? (cont.)
  • Distributivity the immune elements are
    distributed all over the body
  • Noise tolerance an absolute recognition of
    pathogens is not required (tolerance to molecular
  • Resilience the IS is capable of functioning
    despite disturbances
  • Fault tolerance the complementary roles of
    several immune components allow the re-allocation
    of tasks to other elements
  • Robustness diversity number of immune elements
  • Immune learning and memory the molecules of the
    IS can adapt to themselves, structurally and in
    number, to the antigenic challenges
  • Predator-prey pattern of responsepathogens goes
    up gtimmune cells goes up
  • Self-organization clonal selection and affinity
    maturation are responsible for selecting the most
    adapted cells to be maintained as long living
    memory cells
  • Integration with other systems the IS
    communicates with parts of the body

General Framework for Artificial Immune Systems
  • General Framework for AIS
  • A representation for the components of the system
  • A set of mechanisms to evaluate the interaction
    of individuals with the environment and each
    other (input stimuli, 1 to N fitness functions or
    other means) Affinity measures
  • Procedures of adaptation that govern the dynamics
    of the system (e.g., behavior over time) -

ReferenceL. N. de Castro and J. Timmis,
Artificial Immune Systems A New Computational
Intelligence Approach,Springer 2002.
Components of Artificial Immune Systems
  • Representation
  • Generalized shape of any molecule in shape space
    is described by an attribute string (set of
    coordinates) of length L.
  • Shape-space describes interactions between
    molecules of the immune system and antigens.
  • Immune system is represented as a pattern
    (molecular) recognition system that is designed
    to identify shapes.
  • Affinity Measures
  • Euclidean, Manhattan and Hamming
  • Real-valued, integer, symbolic or alphabet
    sub-string spaces

Components of Artificial Immune Systems
  • Immune system algorithms
  • Bone marrow model generate repertoire of cells
    and molecules (generate random attribute strings)
  • Thymus model generate repertoire of cells and
    molecules capable of performing self/non-self
    discrimination (Positive selection initialize
    strings, evaluate affinity and keep strings with
    affinity lt threshold Negative selection
    eliminate strings gt threshold)
  • Clonal selection algorithms modeling interaction
    control of the IS and external environment or
    antigens (similar to GA without crossover and
    with affinity proportional to reproduction and
  • Immune network models simulate immune networks
    (differential equations describing the dynamics)

Examples of Artificial Immune Systems

Summary Interdisciplinary Science
  • CS and ECE have been used to gain a better
    understanding of biological processes through
    modeling and simulation
  • CS and ECE have been enriched with the
    introduction of biological ideas, e.g., ANN, GA,
    cellular automata, artificial life, artificial
    immune systems (AIS)
  • New fields bio-informatics, bio-medical
  • Bilateral interactions between CS, ECE and
  • Biologically motivated computing (ANN, GA,
    artificial immune systems)
  • Computationally motivated biology (cellular
  • Computing with biological mechanisms
    (silicon-based computing gt quantum and DNA

Summary Bioinformatics
  • Bioinformatics and Microarray problem
  • Interdisciplinary Challenges Terminology
  • Understanding Biology and Computer Science
  • Data mining and image analysis steps
  • Image Analysis
  • Experiment Design as Prior Knowledge
  • Expected Results of Data Mining
  • Which Data Mining Technique to Use?
  • Data Mining Challenges Complexity, Data Size,
    Search Space
  • Validation
  • Confidence in Obtained Results?
  • Error Screening
  • Cross validation techniques
  • Artificial Systems
  • Biologically motivated computing