Artificial Immune Systems: An Emerging Technology - PowerPoint PPT Presentation

About This Presentation
Title:

Artificial Immune Systems: An Emerging Technology

Description:

Title: Artificial Immune Systems: An Emerging Technology Author: Computing Laboratory Last modified by: klopotek Created Date: 3/27/2001 8:49:50 AM – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 22
Provided by: Comput104
Category:

less

Transcript and Presenter's Notes

Title: Artificial Immune Systems: An Emerging Technology


1
Bayesian Networks in Document Clustering
Slawomir Wierzchon , Mieczyslaw Klopotek Michal
Draminski Krzysztof Ciesielski Mariusz
Kujawiak Institute of Computer Science, Polish
Academy of Sciences Warsaw
Research partially supported by the KBN research
project 4 T11C 026 25 "Maps and intelligent
navigation in WWW using Bayesian networks and
artificial immune systems"
2
A search engine with SOM-based document set
representation
3
Map visualizations in 3D (BEATCA)
4
  • The preparation of documents is done by an
    indexer, which turns a document into a
    vector-space model representation
  • Indexer also identifies frequent phrases in
    document set for clustering and labelling
    purposes
  • Subsequently, dictionary optimization is
    performed - extreme entropy and extremely
    frequent terms excluded
  • The map creator is applied, turning the
    vector-space representation into a form
    appropriate for on-the-fly map generation
  • The best (wrt some similarity measure) map is
    used by the query processor in response to the
    users query

5
Document model in search engines
My dog likes this food
dog
  • In the so-called vector model a document is
    considered as a vector in space spanned by the
    words it contains.

food
When walking, I take some food
walk
6
Clustering document vectors
r
x
m
Mocna zmiana polozenia (gruba strzalka)
Document space
2D map
Important difference to general clustering not
only clusters with similar documents, but also
neighboring clusters similar
7
Our problem
  • Instability
  • Pre-defined major themes needed
  • Our approach
  • Find a coarse clustering into a few themes

8
Bayesian Networks in Document Clustering
  • SOM document-map based search engines require
    initial document clustering in order to present
    results in a meaningful way.
  • Latent semantic Indexing based methods appear to
    be promising for this purpose.
  • One of them, the PLSA, has been empirically
    investigated.
  • A modification is proposed to the original
    algorithm and an extension via TAN-like Bayesian
    networks is suggested.

9
A Bayesian Network
Represents joint probability distribution as a
product of conditional probabilities of childs on
parents in a directed acyclic graph High
compression, Simpliofication of reasoning .
10
BN application in text processing
  • Document classification
  • Document Clustering
  • Query Expansion

11
Hidden variable approaches
  • PLSA (Probabilistic Latent Semantic Analysis)
  • PHITS (Probabilistic Hyperlink Analysis)
  • Combined PLSA/PHITS
  • Assumption of a hidden variable expressing the
    topic of the document.
  • The topic probabilistically influence the
    appearence of the document (links in PHITS, terms
    in PLSA)

12
PLSA - concept
T1
Hidden variable
  • N be term-document matrix of word counts, i.e.,
    Nij denotes how often a term (single word or
    phrase) ti occurs in document dj.
  • probabilistic decomposition into factors zk
    (1? k ? K)
  • P(ti dj) Sk P(tizk)P(zkdj), with
    non-negative probabilities and two sets of
    normalization constraints
  • Si P(tizk) 1 for all k and
  • Sk P(zk dj) 1 for all j.

T2
D
Z
.....
Tn
13
PLSA - concept
  • PLSA aims at maximizing L Si,j Nij log Sk
    P(tizk)P(zkdj).
  • Factors zk can be interpreted as states of a
    latent mixing variable associated with each
    observation (i.e., word occurrence),
  • Expectation-Maximization (EM) algorithm can be
    applied to find a local maximum of L.

.....
  • different factors usually capture distinct
    "topics" of a document collection
  • by clustering documents according to their
    dominant factors, useful topic-specific document
    clusters often emerge

14
EM algorithm step 0
Z randomly initialized
15
EM algorithm step 1
BN trained
16
EM algorithm step 2
Z sampled for each record according to the
probability distribution P(Z1Dd,T1t1,...,Tntn
) P(Z2Dd,T1t1,...,Tntn) ....
Z sampled from BN
GOTO step 1 untill convergence (Z assignment
stable)
17
The problem
  • Too high number of adjustable variables
  • Pre-defined clusters not identified
  • Long computation times
  • instability

18
Solution
  • Our suggestion
  • Use the Naive Bayes sharp version document
    assigned to the most probable class
  • We were successful
  • Up to five classes well clustered
  • High speed (with 20,000 documents)

19
Next step
  • Naive bayes assumes document and term
    independence
  • What if they are in fact dependent?
  • Our solution
  • TAN APPROACH
  • First we create a BN of terms/documents
  • Then assume there is a hidden variable
  • Promissing results, need a deeper study

20
PLSA a model with term TAN
Hidden variable
D1
T6
T5
D2
Z
Dk
T4
T2
T3
T1
21
PLSA a model with document TAN
Hidden variable
T1
T2
Z
Ti
Write a Comment
User Comments (0)
About PowerShow.com