Data Mining for Bioinformatics - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Data Mining for Bioinformatics

Description:

Data Mining for Bioinformatics. Craig A. Struble, Ph.D. Marquette University ... Bioinformatics data. Survey of KDD steps. Case Study: miRNA Project ... – PowerPoint PPT presentation

Number of Views:408
Avg rating:3.0/5.0
Slides: 44
Provided by: CraigAS7
Category:

less

Transcript and Presenter's Notes

Title: Data Mining for Bioinformatics


1
Data Mining for Bioinformatics
  • Craig A. Struble, Ph.D.
  • Marquette University
  • craig.struble_at_marquette.edu

2
Overview
  • Survey of KDD for Bioinformatics
  • KDD overview
  • Bioinformatics data
  • Survey of KDD steps
  • Case Study miRNA Project
  • Identifying the problem
  • Data collection with Perl
  • Selection/cleansing
  • Future work
  • Next Time

3
Knowledge Discovery in Databases
Selection Transformation
Cleaning Integration
Evaluation Visualization
Data Mining
Data Warehouse
Prepared data
Patterns
Knowledge
Knowledge Base
Data
4
Bioinformatics Data
  • DNA Sequences
  • Genes
  • Location, introns, exons, function, etc.
  • Gene products
  • RNA, Proteins
  • Pathways
  • Signaling, metabolic, genomic, etc.

5
Bioinformatics Data
  • Experimental
  • Gene expression, knockouts, etc.
  • Literature
  • Diseases, viruses, bacteria
  • Organisms
  • Textbooks
  • Expert knowledge
  • Unpublished
  • Insights
  • Etc.

6
KDD for Bioinformatics
Normalization Curation Validation Etc.
Sampling Expressed Genes Homologs Etc.
Evaluation Visualization
Clustering SVMs ILP Classification Etc.
Genomic
Data Warehouse
Literature
Prepared data
Patterns
Experimental
Often not explicitly implemented
Knowledge
Expert Knowledge
Data
7
Data Collection and Cleansing
  • Perl scripts (BioPerl)
  • From literature
  • Read a paper and enter the information
  • Supplemental data for papers
  • Public databases
  • GenBank
  • Stanford Microarray Database
  • SWISS-Prot
  • Etc.

8
Data Cleansing
  • Remove invalid, redundant, or otherwise useless
    data
  • Extrapolate missing data values
  • Data formatting/transformation
  • Binning, normalization, scaling, etc.

9
Data Selection
  • Database queries for specific genes, organisms,
    sequences, etc.
  • Statistical analysis (microarray)
  • Random sampling
  • Etc.

10
Data Mining Techniques
  • Statistical
  • Principal Component Analysis
  • ANOVA
  • Outlier analysis
  • Discrimination
  • Some clustering techniques (K-Means)

11
Data Mining Techniques
  • Machine Learning
  • Neural Networks
  • Support Vector Machines
  • Decision Trees
  • Inductive Logic Programming
  • Fuzzy Logic
  • Rough Sets
  • Bayesian Belief Networks

12
Data Mining Techniques
  • More Techniques
  • Clustering
  • Self Organizing Maps
  • Hidden Markov Models
  • Maximum Likelihood Estimators
  • Association Rules

13
Kinds of Techniques
  • Unsupervised
  • Technique makes no assumption about a priori
    knowledge
  • Useful when not much known
  • Supervised
  • Attach class labels to data items
  • Identify (or learn about) properties that
    distinquish classes

14
Kinds of Techniques
  • Unsupervised
  • Clustering
  • SOMs
  • Supervised
  • Support Vector Machines
  • Neural Networks
  • Bayesian Belief Networks
  • HMMs

15
Kinds of Techniques
  • Supervised techniques require training
  • Data split into training and test sets
  • Many kinds of validation
  • N-way cross validation
  • Leave one out testing
  • Etc

16
Visualization of Results
  • Graphs/Charts
  • Rules
  • If expression of X lt 1035, then tissue is
    cancerous
  • Largely dependent on the technique used

17
Case Study miRNA Project
  • Started Jan, 2002
  • Participants
  • Dr. Craig Struble
  • Dr. Stephen Munroe
  • Dr. John Simms
  • Parthav Jailwala
  • Peigang Li
  • http//bistro.mscs.mu.edu/miRNA

18
Case Study miRNA Project
  • Lee, R. C. Ambros, V. An extensive class of
    small RNAs in Caenorhabditis elegans. Science
    294, 862-864 (2001).
  • Lagos-Quintana, M., Rauhut, R., Lendeckel, W.
    Tuschl, T. Identification of novel genes coding
    for small expressed RNAs. Science 294, 853-858
    (2001).
  • Hutvßgner, G. et al. A cellular function for the
    RNA-interference enzyme Dicer in the maturation
    of the let-7 small temporal RNA. Science 293,
    834-838 (2001).
  • N.C. Lau, Lee P. Lim, Earl G. Weinstein, David P.
    Bartel. An abundant class of tiny RNAs with
    probable regulatory roles in Caenorhabditis
    elegans. Science 294, 858-86 (2001).

19
Research Questions
  • Can we identify features of existing miRNAs that
    can be used to predict the existence of other
    miRNA genes?
  • Which mRNA (messenger RNA) are targeted by
    miRNAs?
  • What other family-wide behavioral and structural
    questions can be answered about miRNAs?

20
Current Implementation
Data Selection/Cleansing
Data warehouse
miRNA library
Perl Script
Genbank
BLAST Reports
Perl Script
Multiple Sequence Alignment
Perl Script
Homolog library
Initial mining and cleansing
21
Perl
  • Practical Extraction and Report Language
  • Language of choice for many bioinformaticians
  • Excellent support for parsing/transforming data
  • http//www.perl.com

22
Data Collection with Perl
E.G. Using Entrez
23
Data Collection with Perl
Construct a URL to search and access information
in Entrez
24
Data Collection with Perl
  • Use LWP module
  • Makes network connections easy
  • Use BioPerl (http//www.bioperl.org)
  • Perl modules/objects for handling bioinformatics
    data
  • Handles connections to databases

25
Sample Perl Script
!/usr/local/bin/perl Simple Entrez Query in
Perl Craig A. Struble For internet
requests and protocols use LWP A user agent
for testing my ua LWPUserAgent-gtnew ua-gtage
nt('miRNA/0.1 ') URL base for Entrez
search my NCBI_ENTREZ 'http//www.ncbi.nlm.nih.
gov/entrez/query.fcgi?'
26
Script (cont.)
Building up the URL for the Entrez Search my
search_URL NCBI_ENTREZ URL Base
. 'cmdSearch' Command
. 'dbnucleotide' Database
. 'dispmax100' Max results
. 'termmiRNA' Search term
. 'doptcmdlFASTA' result
format Make an HTTP GET request for a Entrez
search my req HTTPRequest-gtnew(GET gt
search_URL) req-gtpush_header(Connection gt
'Keep-Alive') Get the response my res
ua-gtrequest(req)
27
Script (cont.)
Check the response. If it's OK, print out the
content if (res-gtis_success) print
res-gtcontent else print
res-gterror_as_HTML exit 1
28
Sample Result
ltinput name"showndispmax" type"hidden"
value"100"gtltinput name"page" type"hi dden"
value"0"gtlt/tablegtlt/tdgtlt/trgt lt/tablegtltdlgtltdtgtlttabl
e cellpadding"0" cellspacing"0"
width"100"gtlttrgtlttdgtltinp ut name"uid"
type"checkbox" value"17646034"gtltbgt1
lt/bgtAJ421749. Homo sapiens micr...gi17646034lt/t
dgt lttd align"right"gtltSPANgtlta CLASS"dblinks"
href"query.fcgi?dbnucleotideampcm dDisplayam
pdoptnucleotide_pubmedampfrom_uid17646034"gtPu
bMed, lt/agtlt/SPANgt ltSPANgtlta CLASS"dblinks"
href"query.fcgi?dbnucleotideampcmdDisplayamp
dopt nucleotide_taxonomyampfrom_uid17646034"gt
Taxonomylt/agtlt/SPANgt lt/tdgt lt/trgtlt/tablegtlt/dtgtlt/dlgtlt
pregtgtgi17646034embAJ421749.1HSA421749 Homo
sapiens m icroRNA miR-27 TTCACAGTGGCTAAGTTCCGCT lt/
pregtltdlgtltdtgtlttable cellpadding"0"
cellspacing"0" width"100"gtlttrgtlttdgtltinput
name"uid" type"checkbox" value"17646061"gtltbgt2
lt/bgtAJ421776. Drosophila mela no...gi17646061lt/
tdgt
29
Parsing Result
  • Result is big, ugly HTML file
  • Need to take out data in ltpregt tags
  • Fortunately, Perl can come to the rescue!

30
Parsing Result with Perl
!/usr/local/bin/perl Use an HTML parser use
HTMLTreeBuilder Extract out FASTA entries
for each file on the command line foreach my
file_name (_at_ARGV) Build an HTML Parse
Tree my tree HTMLTreeBuilder-gtnew
tree-gtparse_file(file_name) FASTA
entries are in PRE tags _at_entries
tree-gtfind_by_tag_name('pre') Print out
each entry foreach my entry (_at_entries)
_at_children entry-gtcontent_list
print children0 . "\n" first child
is text content
31
Processed Results
gtgi17646034embAJ421749.1HSA421749 Homo
sapiens microRNA miR-27 TTCACAGTGGCTAAGTTCCGCT gtg
i17646061embAJ421776.1DME421776 Drosophila
melanogaster microRNA miR-14 TCAGTCTTTTTCTCTCTCCTA
gtgi17646060embAJ421775.1DME421775
Drosophila melanogaster microRNA
miR-13b-2 TATCACAGCCATTTTGACGAGT gtgi17646059emb
AJ421774.1DME421774 Drosophila melanogaster
microRNA miR-13b-1 TATCACAGCCATTTTGACGAGT gtgi176
46058embAJ421773.1DME421773 Drosophila
melanogaster microRNA miR-13a TATCACAGCCATTTTGATGA
GT
32
Getting BLAST Reports
  • Can automate getting BLAST reports with Perl
  • URL format documentation is available at
  • http//www.ncbi.nlm.nih.gov/BLAST/Doc/urlapi.html
  • Perl code not displayed

33
Parsing BLAST Reports
  • Use BioPerl BioToolsBPLite
  • Find high scoring pairs that contain surrounding
    sequence
  • BLAST also reports original sequence hits
  • Extract out matching sequence with up and
    downstream surrounding sequence

34
Perl Script
!/usr/local/bin/perl Create homolog database
from BLAST reports Author Craig A. Struble
Various BioPerl modules to use use
BioToolsBPlite use BioDBGenBank use
BioSeqIO use BioSeq
35
Script (cont.)

Function
rev_comp Description Calculates the reverse
complement of a DNA sequence.

sub rev_comp my _at_seqs
foreach seq (_at__) seq
tr/AaCcTtGg/TtGgAaCc/ seq reverse
seq push _at_seqs, seq
wantarray checks whether we were called in list
context return wantarray ? _at_seqs seqs0
36
Script (cont.)

Function
around_seq Description Returns the upstream
and downstream sequence around an HSP
Parameters hsp - the high scoring pair
seq - the sequence of reference
upstream - number of basepairs upstream
downstream - number of basepairs
downstream
sub
around_seq my (hsp, seq, upstream,
downstream) _at__ Code deleted due to
space return subseq
37
Script (cont.)
Open the BLAST report open(BLAST, "lt" .
ARGV0) or die "open failed" report new
BioToolsBPlite(-fh gt \BLAST) gb new
BioDBGenBank Open output file out
BioSeqIO-gtnew('-file' gt "gtARGV1", '-format'
gt 'fasta') Amount up and downstream to
get upstream ARGV2 downstream ARGV3
38
Script (cont.)
while (my sbjct report-gtnextSbjct) my
(db, accv, acc, rest) split /\ /,
sbjct-gtname seq gb-gtget_Seq_by_acc(acc)
print seq-gtaccession_number . "\n"
while (my hsp sbjct-gtnextHSP) my
seqstr around_seq(hsp, seq, upstream,
downstream) my subseq
BioSeq-gtnew('-seq' gt seqstr,
'-accession_number' gt
seq-gtaccession_number,
'-display_id' gt seq-gtaccession_number
.
"_" .
hsp-gtsubject-gtstart .
".." .

hsp-gtsubject-gtend .
"_" .
hsp-gtsubject-gtstra
nd ) out-gtwrite_seq(subseq)
39
Results
gtAC084471_10966..10987_-1 TCCCCCTTGGTCCCTTCTCATATA
CCATACTACATTTCTTTCAAAACTAACCGGGATTTT TCAGGGGATTGCA
GGATGATGGCTCTACACTGGGGTACGGTGAGGTAGTAGGTTGTATAG TT
TAGAATATTACTCTCGGTGAACTATGCAAGTTTCTACCTCACCGAATACC
AGGTTCTC AACTGCATCGTGTCAATTACTCTCAAACGACGGACACCTTC
A gtAF274345_1763..1784_1 CACATCTCCCTTTGAATTTATATGT
CTAATTTAACAACAAGTACTAATCCATTTTTCAGG CAAGCAGGCGATTG
GTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG TTT
GGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAG
AACTCTT CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT
gtZ70203_12425..12446_-1 CACATCTCCCTTTGAATTTATATGT
CTAATTTAACAACAAGTACTAATCCATTTTTCAGG CAAGCAGGCGATTG
GTGGACGGTCTACACTGTGGATCCGGTGAGGTAGTAGGTTGTATAG TTT
GGAATATTACCACCGGTGAACTATGCAATTTTCTACCTTACCGGAGACAG
AACTCTT CGAAGCTGCGTCGTCTTGCTCTCACAACTTTCTTTTCGTTTT
40
Multiple Sequence Alignment
  • Currently using clustalw/clustalx
  • Eventually generate web pages with sequence
    alignments
  • Investigate conserved regions of the surrounding
    sequence

41
Multiple Sequence Alignment
42
Future Work
  • Process homolog library with RNA fold predication
    software (mFold)
  • Collect together fold structure information and
    other information
  • Transform into logical representation for ILP
    analysis
  • Store data in a database (Postgres)

43
Next Time
  • Applications of
  • Clustering
  • Neural Networks
  • Support Vector Machines
  • Etc.
  • Available tools to use, etc.
Write a Comment
User Comments (0)
About PowerShow.com