Nada Lavrac - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Nada Lavrac

Description:

DNA Microarray Data Analysis with SD. RSD approach to Descriptive Analysis ... its procedure for repetitive subgroup generation, via its weighted covering algo. ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 70
Provided by: velblodVid
Category:
Tags: algo | lavrac | nada

less

Transcript and Presenter's Notes

Title: Nada Lavrac


1
  • Nada Lavrac
  • Subgroup Discovery
  • Recent Biomedical Applications
  • Solomon seminar, Ljubljana, January 2008

2
Talk outline
  • Data mining in a nutshell
  • Subgroup discovery in a nutshell
  • Relational data mining and propositionalization
    in a nutshell
  • DNA Microarray Data Analysis with SD
  • RSD approach to Descriptive Analysis of
    Differentially Expressed Genes
  • Future work Towards service-oriented knowledge
    technologies for information fusion

3
Data Mining in a Nutshell
Data Mining
knowledge discovery from data
model, patterns,
data
Given transaction data table, relational
database, text documents, Web pages
Find a classification model, a set of
interesting patterns
4
Data Mining in a Nutshell
Data Mining
knowledge discovery from data
model, patterns,
data
Given transaction data table, relational
database, text documents, Web pages
Find a classification model, a set of
interesting patterns
symbolic model symbolic patterns explanation
new unclassified instance
classified instance
black box classifier no explanation
5
Data mining exampleInput Contact lens data
6
Output Decision tree for contact lens
prescription
tear prod.
reduced
normal
astigmatism
NONE
no
yes
spect. pre.
SOFT
hypermetrope
myope
NONE
HARD
7
Output Classification/prediction rules for
contact lens prescription
tear productionreduced ? lensesNONE tear
productionnormal astigmatismyes spect.
pre.hypermetrope ? lensesNONE tear
productionnormal astigmatismno ? lensesSOFT
tear productionnormal astigmatismyes
spect. pre.myope ? lensesHARD DEFAULT
lensesNONE
8
Task reformulation Concept learning problem
(positive vs. negative examples of Target class)
9
Classification versus Subgroup Discovery
  • Classification task constructing models from
    data (constructing sets of predictive rules)
  • Predictive induction aimed at learning models
    for classification and prediction
  • Classification rules aim at covering only
    positive examples
  • A set of rules forms a domain model
  • Subgroup discovery task finding interesting
    patterns in data (constructing individual
    subgroup describing rules)
  • Descriptive induction aimed at exploratory data
    analysis
  • Subgroups descriptions aim at covering a
    significant proportion of positive examples
  • Each rule (pattern) is an independent chunk of
    knowledge

10
Classification versus Subgroup Discovery










11
Talk outline
  • Data mining in a nutshell
  • Subgroup discovery in a nutshell
  • Relational data mining and propositionalization
    in a nutshell
  • DNA Microarray Data Analysis with SD
  • RSD approach to Descriptive Analysis of
    Differentially Expressed Genes


12
Subgroup discovery in a nutshell
  • SD Task definition
  • Given a set of labeled training examples and a
    target class of interest
  • Find descriptions of most interesting
    subgroups of target class examples
  • are as large as possible
  • (high target class coverage)
  • have significantly different distribution of the
    target class examples
  • (high TP/FP ratio, high RelAcc, high significance
  • Other (subjective) criteria of interestingness
  • Surprising to the user, simple, .useful -
    actionable

13
CHD Risk Group Discovery Task
  • Task Find and characterize population subgroups
    with high CHD risk
  • Input Patient records described by stage A
    (anamnestic), stage B (an. lab.), and stage C
    (an., lab. ECG) attributes
  • Output Best subgroup descriptions that are most
    actionable for CHD risk screening at primary
    health-care level

14
Subgroup discovery in the CHD application
  • From best induced subgroup descriptions, five
    were selected by the expert as most actionable
    for CHD risk screening (by GPs)
  • A1 CHD ? male pos. fam. history age gt 46
  • A2 CHD ? female bodymassIndex gt 25 age gt 63
  • B1 CHD ? ..., B2 CHD ? ..., C1 CHD ? ...
  • Principal risk factors (found by subgroup mining)
  • Supporting risk factors (found by statistical
    analysis)
  • A1 psychosocial stress, as well as cigarette
    smoking, hypertension and overweight
  • A2

15
Characteristics of Subgroup Discovery Algorithms
Remark Subgroup discovery can be viewed as
cost-sensitive rule learning, rewarding TP
covered, and punishing FP covered.
  • SD algorithm does not look for a single complex
    rule to describe all examples of target class A
    (all CHD patients), but several rules that
    describe parts (subgroups) of A.
  • SD prefers rules that are accurate (cover only
    CHD patients) and have high generalization
    potential (cover large patient subgroups)
  • This is modeled by parameter g of the rule
    quality heuristic of SD.
  • SD naturally uses example weights in its
    procedure for repetitive subgroup generation, via
    its weighted covering algo., and rule quality
    evaluation heuristic.

16
Weighted covering algorithm for rule set
construction
CHD patients
other patients
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
For learning a set of subgroup describing rules,
SD implements an iterative weigthed covering
algorithm. Quality of a rule is measured by
tading off coverage and precision.
17
Weighted covering algorithm for rule set
construction
f2 and f3
CHD patients
other patients
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
  • Rule quality measure in SD q(Cl ? Cond)
    TP/(FPg)
  • Rule quality measure in CN2-SD WRAcc(Cl ?Cond)
    p(Cond) x p(Cl Cond) p(Cl) coverage x
    (precision default precision)
  • Pos/N x Neg/N x TPr FPr
  • Coverage sum of the covered weights,
    Precision purity of the covered examples

18
Weighted covering algorithm for rule set
construction
CHD patients
other patients
0.5
0.5
1.0
1.0
0.5
1.0
0.5
1.0
1.0
0.5
0.5
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
In contrast with classification rule learning
algorithms (e.g. CN2), the covered positive
examples are not deleted from the training set in
the next rule learning iteration they are
re-weighted, and a next best rule is learned.
19
Subgroup visualization
The CHD task Find, characterize and visualize
population subgroups with high CHD risk (large
enough, distributionally unusual, most actionable)
20
Induced subgroups and their statistical
characterization
Subgroup A2 for femle patients High-CHD-risk IF


body mass index over 25 kg/m2 (typically 29)
AND
age over 63 years Supporting
characteristics are positive family history and
hypertension. Women in this risk group typically
have slightly increased LDL cholesterol values
and normal but decreased HDL cholesterol values.
21
Statistical characterization of expert selected
subgroups
22
Statistical characterization of subgroups
  • starts from induced subgroup descriptions
  • statistical significance of all available
    features (all risk factors) is computed given two
    populations true positive cases (CHD patients
    correctly included into the subgroup) and all
    negative cases (healthy subjects)
  • ?2 test with 95 confidence level is used

23
Propositional subgroup discovery algorithms
  • SD algorithm (Gamberger Lavrac, JAIR 2002)
  • APRIORI-SD (Kavsek Lavrac, AAI 2006)
  • CN2-SD (Lavrac et al., JMLR 2004) Adapting CN2
    classification rule learner to Subgroup
    Discovery
  • Weighted covering algorithm
  • Weighted relative accuracy (WRAcc) search
    heuristics, with added example weights
  • WRAcc trade-off between rule coverage and rule
    accuracy
  • Probabilistic classification
  • Evaluation with different rule interestingness
    measures

24
Subgroup discovery lessons learned
  • In expert-guided subgroup discovery, the expert
    may decide to choose sub-optimal subgroups, which
    are the most actionable
  • A weigted covering algorithm for rule subset
    construction (or rule set selection), using
    decreased weights of covered positive examples,
    can be used to construct/select a small set of
    relatively independent patterns
  • Additional evidence in the form of supporting
    factors increases experts confidence in rules
    resulting from automated discovery
  • Value-added Subgroup visualization

25
Talk outline
  • Data mining in a nutshell
  • Subgroup discovery in a nutshell
  • Relational data mining and propositionalization
    in a nutshell
  • DNA Microarray Data Analysis with SD
  • RSD approach to Descriptive Analysis of
    Differentially Expressed Genes


26
Relational Data Mining (Inductive Logic
Programming) in a Nutshell
Relational Data Mining
knowledge discovery from data
model, patterns,
Given a relational database, a set of tables.
sets of logical facts, a graph, Find a
classification model, a set of interesting
patterns
27
Relational Data Mining (ILP)
  • Learning from multiple tables
  • Complex relational problems
  • temporal data time series in medicine, trafic
    control, ...
  • structured data representation of molecules and
    their properties in protein engineering,
    biochemistry, ...
  • Illustrative example structured objects - Trains

28
RSD Upgrading CN2-SD to Relational Subgroup
Discovery
  • Implementing an propositionalization approach to
    relational data mining, through efficient
    first-order feature construction
  • Using CN2-SD for propositional subgroup discovery

Subgroupdiscovery
features
rules
First-order featureconstruction
29
Propositionalization in a nutshell
TRAIN_TABLE
Propositionalization task Transform a
multi-relational (multiple-table) representation
to a propositional representation (single
table) Proposed in ILP systems LINUS (1991),
1BC (1999),
PROPOSITIONAL TRAIN_TABLE
30
Propositionalization in relational data mining
TRAIN_TABLE
Main propositionalization step first-order
feature construction f1(T)-hasCar(T,C),clength(C
,short). f2(T)-hasCar(T,C), hasLoad(C,L),
loadShape(L,circle) f3(T) - . Propositional
learning t(T) ? f1(T), f4(T) Relational
interpretation eastbound(T) ? hasShortCar(T),has
ClosedCar(T).
PROPOSITIONAL TRAIN_TABLE
31
Relational subgroup discovery
  • RSD algorithm (Lavrac et al., ILP 2002, Zelezny
    Lavrac, MLJ 2006)
  • Implementing an propositionalization approach to
    relational learning, through efficient
    first-order feature construction
  • Syntax-driven feature construction, using
    Progol/Aleph style of modeb/modeh declaration
  • f121(M)- hasAtom(M,A), atomType(A,21)
  • f235(M)- lumo(M,Lu), lessThr(Lu,1.21)
  • Using CN2-SD for propositional subgroup discovery
  • mutagenic(M) ? feature121(M), feature235(M)

Subgroupdiscovery
features
rules
First-order featureconstruction
32
RSD Lessons learned
  • Efficient propositionalization can be applied to
    individual-centered, multi-instance learning
    problems
  • one free global variable (denoting an individual,
    e.g. molecule M)
  • one or more structural predicates (e.g.
    has_atom(M,A)), each introducing a new
    existential local variable (e.g. atom A), using
    either the global variable (M) or a local
    variable introduced by other structural
    predicates (A)
  • one or more utility predicates defining
    properties of individuals or their parts,
    assigning values to variables
  • feature121(M)- hasAtom(M,A), atomType(A,21)
  • feature235(M)- lumo(M,Lu), lessThr(Lu,-1.21)
  • mutagenic(M)- feature121(M), feature235(M)

33
Talk outline
  • Data mining in a nutshell
  • Subgroup discovery in a nutshell
  • Relational data mining and propositionalization
    in a nutshell
  • DNA Microarray Data Analysis with SD
  • RSD approach to Descriptive Analysis of
    Differentially Expressed Genes


34
DNA microarray data analysis
  • Genomics The study of genes and their function
  • Functional genomics is a typical scientific
    discovery domain characterized by
  • a very large number of attributes (genes)
    relative to the number of examples
    (observations).
  • typical values 7000-16000 attributes, 50-150
    examples
  • Functional genomics using gene expression
    monitoring by DNA microarrays (gene chips)
    enables
  • better understanding of many biological processes
  • improved disease diagnosis and prediction in
    medicine

35
Gene Expression Data data mining format
1
2
100

fewcases
many features
35/71
36
Standard approach High-Dimensional
Classification Models
  • Neural Networks, Support Vector Machines, ...

37
High-Dimensional Classification Models (contd)
  • Usually good at predictive accuracy
  • Golub et al., Science 286531-537 1999
  • Ramaswamy et al., PNAS 9815149-54 2001
  • Resistance to overfitting (mainly SVM,
    ensembles, ...)
  • But black box models are hard to interpret

??
38
Subgroup discovery in DNA microarray data
analysis Functional genomics domains
  • Two-class diagnosis problem of distinguishing
    between acute lymphoblastic leucemia (ALL, 27
    samples) and acute myeloid leukemia (AML, 11
    samples), with 34 samples in the test set. Every
    sample is described with gene expression values
    for 7129 genes.
  • Multi-class cancer diagnosis problem with 14
    different cancer types, in total 144 samples in
    the training set and 54 samples in the test set.
    Every sample is described with gene expression
    values for 16063 genes.
  • http//www-genome.wi.mit.edu/cgi-bin/cancer/datase
    ts.cgi

39
Subgroup discovery in microarray data analysis
  • Applying SD algorithm to cancer diagnosis problem
    with 14 different cancer types (leukemia, CNS,
    lung cancer, lymphoma, )
  • Altogether 144 samples in the training set, 54
    samples in the test set.
  • Every sample is described with gene expression
    values for 16063 genes.
  • IF (KIAA0128_gene EXPRESSED) AND
    (prostaglandin_d2_synthase_gene NOT_EXP)
  • THEN Leukemia
  • training set test set
  • sensitivity 23/24
    4/6
  • specificity 120/120 47/48

40
Subgroup discovery in microarray data analysis
Exterts comments
  • SD results in simple IF-THEN rules,
    interpretable by biologists.
  • The best-scoring rule for leukemia shows
    expression of KIAA0128 (Septin 6) whose relation
    to the disease is directly explicable.
  • The second condition is concerned with the
    absence of Prostaglandin Dsynthase (PGDS). PGDS
    is an enzyme active in the production of
    Prostaglandins (pro-inflammatory an
    anti-inflammatory molecules). Elevated expression
    of PGDS has been found in brain tumors, ovarian
    and breast cancer Su2001,Kawashima2001, while
    hematopoietic PGDS has not been, to our
    knowledge, associated with leukemias.

41
Propositional subgroup discovery
Accuracy-Interpretability trade off
  • Patterns in the form of IF-THEN rules induced by
    SD
  • Interpretable by biologists
  • D. Gamberger, N. Lavrac, F. elezný, J. Tolar Jr
    Biomed Informatics 37(5)269-284 2004
  • Special care taken to avoid fluke rules
  • Still, inferior in terms of accuracy

42
Talk outline
  • Data mining in a nutshell
  • Subgroup discovery in a nutshell
  • Relational data mining and propositionalization
    in a nutshell
  • DNA Microarray Data Analysis with SD
  • RSD approach to Descriptive Analysis of
    Differentially Expressed Genes


43
Accuracy-Interpretability trade off
  • Dilemma Accuracy or Interpretability ?
  • Our approach to achieve both at the same time
  • Learn an accurate high-dim classifier
  • Learn comprehensible summarizations of genes in
    the classifier by relational subgroup discovery
  • Learning 2 instances are genes, not patients !!!

44
Actual approach approach to Learning 1
Identifying sets of differentially expressed
genes in data preprocessing
44/28
45
Identifying diffferentially expressed genes
45/28
46
Identifying diffferentially expressed genes
  • We want to find genes that display a large
    difference in gene expression between groups and
    are homogeneous within groups
  • Typically, one would use statistical tests (e.g.
    t-test) and calculate p-values (e.g. permutation
    test)
  • p-values from these tests have to be corrected
    for multiple testing (e.g. Bonferroni correction)

The two sample tstatistic is used to test
equality of the group means m1, m2.
47
Ranking of differentially expressed genes
The genes can be ordered in a ranked list L,
according to their differential expression
between the classes. The challenge is to
extract meaning from this list, to describe
them. The terms of the Gene Ontology were used
as a vocabulary for the description of the
genes. Selected genes have different influence
on the classifier. The weight of that influence
can be extracted from the learned model (e.g.
voting algorithm or SVM) or from the gene
selection algorithm, in a form of a score, or
weight. Description of the selected genes should
be biased towards the genes with higher weights.
48
Statistical Significance Meets Biological
Relevance Motivation for relational feature
construction
  • Gene A is obvious, well-known, and not
    interesting
  • Gene J activates gene X, which is an oncogene

48/28
49
Relational Subgroup Discovery
  • Learning 2 technically
  • Discovery of gene subgroups which
  • largely overlap with those associated by the
    classifier with a given class
  • can be compactly summarized in terms of their
    features
  • What are features?
  • attributes of the original attributes (genes),
    and
  • first-order features extracted from the Gene
    Ontology and NCBI gene annotation database ENTREZ
  • Recent work first-order features generated from
    GO, ENTREZ and KEGG

50
Gene Ontology (GO)
  • GO is a database of terms for genes
  • Function - What does the gene product do?
  • Process - Why does it perform these activities?
  • Component - Where does it act?
  • Known genes are annotated to GO terms
    (www.ncbi.nlm.nih.gov)
  • Terms are connected as a directed acyclic graph
    (is_a, part_of)
  • Levels represent specificity
  • of the terms

12093 biological process 1812 cellular
components 7459 molecular functions
51
Gene Ontology (2)
12093 biological process 1812 cellular
components 7459 molecular functions
52
Multi-Relational representation
GENE-GENEINTERACTION
GENE(main table,class labels)
GENE-FUNCTION
GENE-PROCESS
GENE-COMPONENT
FUNCTION
PROCESS
COMPONENT
is_a
part_of
is_a
part_of
is_a
part_of
53
Encoding as relational background knowledge
  • Prolog facts
  • predicate(geneID, CONSTANT).
  • interaction(geneID, geneID).
  • component(2532,'GO0016020').
  • component(2532,'GO0005886').
  • component(2534,'GO0008372').
  • function(2534,'GO0030554').
  • function(2534,'GO0005524').
  • process(2534,'GO0007243').
  • interaction(2534,5155).
  • interaction(2534,4803).

fact(class, geneID, weight). fact(diffexp',6449
9, 5.434). fact(diffexp',2534,
4.423). fact(diffexp',5199, 4.234). fact(diffexp
',1052, 2.990). fact(diffexp',6036,
2.500). fact(random',7443,
1.0). fact('random',9221, 1.0). fact('random',2339
5,1.0). fact('random',9657, 1.0). fact('random',19
679, 1.0).
Basic, plus generalized background knowledge
using GO zinc ion binding -gt metal ion binding,
ion binding, binding
54
RSD First order feature construction
First order features with support gt min_support
  • f(7,A)-function(A,'GO0046872').
  • f(8,A)-function(A,'GO0004871').
  • f(11,A)-process(A,'GO0007165').
  • f(14,A)-process(A,'GO0044267').
  • f(15,A)-process(A,'GO0050874').
  • f(20,A)-function(A,'GO0004871'),
    process(A,'GO0050874').
  • f(26,A)-component(A,'GO0016021').
  • f(29,A)- function(A,'GO0046872'),
    component(A,'GO0016020').
  • f(122,A)-interaction(A,B),function(B,'GO0004872'
    ).
  • f(223,A)-interaction(A,B),function(B,'GO0004871'
    ), process(B,'GO0009613').
  • f(224,A)-interaction(A,B),function(B,'GO0016787'
    ), component(B,'GO0043231').

existential
55
Propositionalization
56
Propositional learning subgroup discovery
f2 and f3 4,0
57
Subgroup Discovery
diff. exp. genes
Not diff. exp. genes
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
58
Subgroup Discovery
f2 and f3
diff. exp. genes
Not diff. exp. genes
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
In RSD (using propositional learner
CN2-SD) Quality of the rules Coverage x
Precision Coverage sum of the covered
weights Precision purity of the covered genes
59
Subgroup Discovery
diff. exp. genes
Not diff. exp. genes
0.5
0.5
1.0
1.0
0.5
1.0
0.5
1.0
1.0
0.5
0.5
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
RSD naturally uses gene weights in its procedure
for repetitive subgroup generation, via its
heuristic rule evaluation weighted relative
accuracy
60
Summary The RSD approach
  • The RSD system
  • Constructs relational logic features of genes
    such as
  • (g interacts with another gene whose functions
    include protein binding)
  • Feature subject to constraints (undecomposability,
    minimum support, ...)
  • Then discovers subgroups using these features

interaction(g, G) function(G, protein_binding)
Genes coding for proteins located in the nucleus
whose functions include protein binding and whose
related processes include transcription are
highly expressed in the TEL class
61
Experiments
  • We have applied the proposed methodology of
    first-order feature construction and subgroup
    discovery to discover descriptions of groupd of
    differentially expressed genes on three popular
    classification problems from gene expression
    data
  • ALL vs. AML, 6817 genes, 73 labeled samples, 2
    classes
  • Subtypes ALL, 22283 genes, 132 labeled samples, 6
    classes
  • 14 cancers types, 16063 genes, 198 labeled
    samples, 14 classes
  • For all three problems and all classes we
    selected a set of most differentially expressed
    genes (highest t-score ranking) and the same
    number of randomly chosen non-differentially
    expressed genes
  • In initial work 50 genes selected
  • In recent work Using Gene Set Enrichment
    Analysis (GSEA) to determine the top-ranked genes

62
Results Discovered subgroup descriptions
  • Descriptions of differentially expressed genes
    for Acute lymphoblastic leukemia (ALL) vs. Acute
    myeloid leukemia (AML), (Golub 99)
  • 12, 0 interaction(A,B),process(B,'humoral
    immune response').
  • 11, 0 interaction(A,B),function(B,'peptidase
    activity').
  • 8, 0 interaction(A,B),process(B,'proteolysis')
    .
  • interaction(A,B),function(B,'peptidase
    activity').
  • 10, 0 interaction(A,B),process(B,'immune
    response'),

  • component(B,'extracellular space').
  • 8, 1 function(A,'signal transducer activity').

63
Results Discovered subgroup descriptions (2)
  • Subtypes ALL, 6 classes all other (Ross 03)
  • BCR subtype
  • 9, 0 interaction(A,B),function(B,'metal ion
    binding').
  • component(A,'membrane').
  • process(A,'cell adhesion').
  • 8, 0 interaction(A,B),function(B,'transmembrane
    receptor activity').
  • function(A,'receptor activity').
  • 10, 1 interaction(A,B),function(B,'protein
    homodimerizat. activity').
    interaction(A,B),process(B,'response to
    stimulus').

64
Results Clear effect of using background
knowledge and weigths in learning
65
Related and Recent work
  • There are several approaches for descriptive
    analysis of gene expression data Onto-Expres,
    GOstat, GoMiner, FunSpec, FatiGO, GOTermFinder.
  • They do not consider the weight of importance and
    do not use the interaction information
  • They define a cutoff (fixed or calculated) in
    gene list and calculate the significance of
    single GO terms.
  • Our recent work overcomes the fixed cut-off
    problem
  • In SEGS - Search for Enriched Gene Sets the
    cutoff is dynamically determined through Gene Set
    Enrichment Analysis
  • Extended feature construction
  • GO, ENTREZ, KEGG

66
Summary of the presented RSD approach
  • A method for adding interpretability to
    high-dimensional gene expression based
    classifiers was presented
  • Sequence of two data mining tasks
  • predictive classifier construction
  • descriptive subgroup discovery
  • The 2nd learning task integrates public gene
    annotation data through relational features
  • Highlight genes are attributes in the 1st task
    but become examples in the 2nd
  • Good because of their abundance

67
Summary of recent work
  • The SEGS approach (Trajkovski, Lavrac and Tolar,
    JBI 2008 in press) to descriptive subgroup
    discovery, together with Gene Set Enrichment
    Analysis for initial gene set selection, proved
    very effective
  • In 2nd learning task propositionalization through
    first-order feature construction was in this
    experiment
  • not used to transform multiple-table
    representation of training instances into their
    single-table representation
  • was used to transform the information available
    in structured web repositories (GO, KEGG, ENTREZ)
    into features used in exploratory data analysis
  • Side effect Consistent, automatically updateable
    database of biological background knowledge is
    now available on the Web kt.ijs.si/segs/

68
Future work Towards service-oriented knowledge
technologies for information fusion
  • Steps towards SOKT for information fusion
  • Service-oriented knowledge technologies (SOKT)
    workshop in Ljubljana, Jan 9-18, 2008
  • Implementation of the SEGS workflow as a first
    step towards a SOKT toolbox (previously named
    KMET), in collaboration with Leiden University
  • Implementation of other modules of the future
    SOKT toolbox

69
Conclusions Thanks
  • Results show that out methodology is capable of
    automatic extraction of meaningful biomedical
    knowledge.
  • Extracted knowledge can be used for guiding the
    medical research, generating different
    interpretations of the learned model or for
    constructing complex gene features for building
    interpretable classifiers.
  • We have high hopes to discover novel, yet
    reliable medical knowledge from the relational
    combination of gene expression data with public
    gene annotation databases.
  • Thanks
  • This work was done in collaboration with Dragan
    Gamberger, Filip Zelezny, Igor Trajkovski and
    Jakub Tolar.
Write a Comment
User Comments (0)
About PowerShow.com