Concept Extraction based on Optimal Clique Searches - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Concept Extraction based on Optimal Clique Searches

Description:

1st Franco-Japanese Workshop on Information Search, Integration ... June 30 July 2, 2003 at Hokudai VBL. Concept Extraction based on ... for brach-and ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 18
Provided by: kmMemeHo
Category:

less

Transcript and Presenter's Notes

Title: Concept Extraction based on Optimal Clique Searches


1
Concept Extraction based on Optimal Clique
Searches
1st Franco-Japanese Workshop on Information
Search, Integration and Personalization June 30
July 2, 2003 at Hokudai VBL
Makoto HARAGUCHI and Yoshiaki OKUBO makoto,
yoshiaki_at_db-ei.eng.hokudai.ac.jp http//mhjcc3-ei
.eng.hokudai.ac.jp/index-e.html
  • English paper now submitting to an
    international conference.
  • Abstract in English

    http//mhjcc3-ei.eng.hokudai.ac.jp/web/makoto/hp/p
    dfs/ida03.ps
  • Japanese report Concept Learning based on
    Optimal Clique Searches", JSAI SIG report
    SIG-FAI-A202-11(12/19),63-66, 2002.

2
Data Abstraction, an information theoretical
extension of attribute-oriented induction
Large DB
Generalized DB
Generalization
control
select
Classification algorithm
A set of abstractions, called data abstractions
Machine readable Dictionary
rules
The effect of data abstraction
Use it as a generator of possible data
abstraction for categorical attributes
  • Improvement of readability and efficiency
  • Reduction of the number of output rules

3
MRD as a generator of candidate cluster of
attribute values
Cluster of categorical values dominated by a term
is a candidate.
Domain of categorical attribute
Possible partition ?a1,a2,a3,a5,a6,
where each cell is a candidate generated by MRD.
Select ? that minimizes the conditional entropy
after the abstraction
4
Entropy Minimization Principle
IBM DM group the optimal binary partition (1997)
Prior distribution
The attribute is numerical or boolean. They
provide a very efficient way to compute the
optimal binary partition
An extension
K.Takahata Optimal clustering of class
distributions consider an optimal
m-partition consisting of m cells given a
parameter m
5
The optimal partition is not always a good one
Prob. 0.32
The optimal 3-partition. a cluster of more
closer distributions is formed by the optimality.
(0.1,0.9)
xy1
.04
.32
.32
A cluster of major distributions
Whose probabirity, 0.64, is sufficiently large
enough, compared with the probability of minor
distribution, 0.04.
An admissible cluster (whose entropy is not so
high) of major distributions and some minor
distributions as well.
6
Empirical condition for data abstration to work
well.
A metric for class distribution
Some exceptions whose probability is relatively
low, compare with Or relatively close to the
core of clusters
Core
Information loss by adding is relatively
small. (Entory does not increase so much.)
Many distributions close each other
7
Entropy as a Constraint
  • Without dictionary,
  • Never consider a partition with optimal entropy.
  • Attain a maximal probability (support) under an
    entropy condition to extract a major reasonable
    concept of attribute values w.r.t. a given target
    class.

Maximize
where
Entropy of cluster g
8
Basic Defs
schema
and its instances
Target class, attribute domain
Cluster Atomic distribution (posterior class
distribution, given attribute value
Posterior class distribution, given a cluster of
attributes
Distribution obtained by merging two
distributions linear combination
9
Entropy is not monotone w.r.t. the addition of
new distribution
(0,1)
p
(1-p,p)
dlog 2 1
(p,1-p)
is concave
  • Entropy of smaller set of values is not
    necessarily within a given bound d, even when its
    larger set of values is within d

p
(0,0)
(1,0)
Zd(the upperbound)
M12
a1,a2NG
No bottom-up construction of solution is possible
in general. So, we introduce a new distance
notion, and make the bottom-up construction
possible.
a1,a2,a3 OK
10
Distance Notion
Basic Strategy collect two or more distributions
that are close each other and the entropy of
integrated distribution is less than d
Convex hull of close distributions and the red
region should be separated. In order to guarantee
the condition, we introduce the following def.
tangent line at N
p2
p1 and p2 are close ?
Either A(p1,p2) or A(p2,p1)
p1
A(pi,pj) ?
Consider a perpendicular dropped from pj to
at its foot N. pi and pj are on the same
side of the tangent at N.
N
Region of d?H(q)
11
Separation by Tangent Line(Hyper Plane)
A set G of points (distributions) that satisfy
the entropy constraint and are close each other
Convex hull of G
and convex hull of G can be separated by
the tangent line
12
Definition of Graph
  • A set of vertices An attribute domain (their
    corresponding distributions)
  • Edges There is no edge between p and q ?

p and q satisfy the entropy constraint and are
not close
Complete graph G1 outside of the red zone
Red Region of q s.t. d?H(q)
Some G1 with its exceptions G2 s.t.
is greatest
13
Branch-and-Bound Search (depth-first)
R (candidate node set)
Q (Node of Search Tree
Inequality for brach-and-bound search
More tight condition
Addition of candidate node
Approximation of minimal chromatic number under
some order of nodes
C1
Q?q
C2
C1
C2
Refined inequality for branch-and-bound search
C1
14
Entropy-based Pruning
  • The order of distributions
  • Entropy-based pruning

d? H(Q?q) , where q ?R ? Never
generate a successor node Q?q
The Safety
is a clique, and
Then, for any j,
15
Experimentation not yet enough!
  • Census DB of 199523 tuples with 42 attributes
  • Target class marital state(never married,
    married, divorced) with single condition
    attribute, country of birth self of 42 values.

H?d0.9
  • Branch-and-bound search is successful for graphs
    of 1000 nodes gtgt 42.
  • clusters formed very similar to those computed
    by NN given allowable distance. But the NN might
    miss the greatest cluster.
  • In addition, Objects near to the boundary that
    may be added for lower threshold is calculated as
    exceptions.

Hltd
d,95
Hungary Yugoslavia
Italy Greece
d1
Cluster g of ltAa,Bb,Ccgt s.t. H(g)?d
  • Clique search is much more meaningful when we
    applied for combined condition of more than two
    attributes whose domain is about 1000 2000
    vectors.

Greatest probability.
16
For two or more attributes
  • Condition attributes Education(16-values) and
    Gender(2-values)
  • Target attribute Workclass(8 classes)
  • Cluster with the greatest probability

The clique found
Preschool, 1st-6th, 9th, 10th, 11th, 12th and
Prof-school male,female
  • It might be interpreted as Primary Education.
    10th, 11th,
    12th and Prof-school are exceptions
  • It will tell us the Gender attribute is
    neglectable (abstractable), given Education for
    Workclass.
  • 7th and 8th is excluded. This might be due to the
    fact that DB is relatively small.

17
End
Write a Comment
User Comments (0)
About PowerShow.com