Xavier Polanco - PowerPoint PPT Presentation

About This Presentation
Title:

Xavier Polanco

Description:

Visual artifacts aid ... with marks and graphical properties to encode information ... data in the Data Table are also represented in the Visual Structure ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 43
Provided by: xavierpola
Category:
Tags: polanco | xavier

less

Transcript and Presenter's Notes

Title: Xavier Polanco


1
Textual Information Clusteringand Visualization
for Knowledge Discovery and Management
  • Xavier Polanco
  • URI-INIST-CNRS

2
Introduction
  • We are concerned with the design and development
    of computer-based information analysis tools
  • Cluster analysis, computational linguistics and
    artificial intelligence techniques are combined

3
On the technology side
  • An information analysis computer-based system is
  • an integrated environment that somehow assisted a
    user
  • in carrying out the complex process of converting
    information from the textual data sources to
    knowledge

4
Information Analysis System
Lexicons or terminological resources
French or English text-data
Dataset or Corpus
Clustering and Mapping
DBMS-R
Term Extraction And Indexation
Bibliometric statistics
WWW Server
SDOC
HENOCH
NEURODOC
MIRIAD
ILC
Mac
PC
WS
5
Home Pages
Intranet
Extranet
6
Plan
  • Text Mining
  • Cluster Analysis
  • Visualization or Mapping
  • Knowledge Discovery
  • Knowledge Management

7
Textual Information
  • Big amount of information is available in textual
    form in databases and online sources
  • In this context, manual analysis and effective
    extraction of useful information are not possible
  • It is relevant to provide automatic tools for
    analyzing large textual collections

8
Text Mining
  • Text mining consists of extraction information
    from hidden patterns in large text-data
    collections
  • The results can be important both
  • for the analysis of the collection, and
  • for providing intelligent navigation and browsing
    methods

9
Process
  • The text mining process can be organized roughly
    into five-major steps
  • Data Selection
  • Term Extraction and Filtering
  • Data Clustering and Classification
  • Mapping or Visualization
  • Result Interpretation
  • Iterative and interactive process

10
Natural Language Processing
  • Experience shows that linguistic engineering
    approach insures a higher performance of the data
    mining algorithms
  • Part-of-speech tagging (tagging texts), and
    lemmatization are tasks generally admit

11
The approach
  • Our approach to text mining is based on
    extracting meaningful terms from documents
  • In this presentation, the focus is on the term
    extraction process, and
  • The need of the organization of the generated
    terms in a taxonomy

12
The main tasks
  • Term extraction or acquisition
  • Indexation
  • Human control and screening
  • Indexing quality control
  • Index screening ? clustering phase

13
Language Engineering
Natural Language Engineering System
Lexicons
Text-DB
Indexed Corpus
Lexicons Management and Linguistic
Processing Texts Part-of-speech tagging,
lemmatization, and indexation
14
Variation
15
Taxonomy
  • A taxonomic structure should improve text mining
  • Considering the clustering techniques that might
    be used in text mining. One must be mindful that
    more taxonomic classifying capabilities would be
    incorporated into text mining
  • A taxonomic classifying capability might also
    facilitate cluster interpretation by giving the
    user some kind of rules

16
Clustering
  • Clustering is a descriptive task where one seeks
    to identify a finite set of categories
  • Clustering is used to segment a database into
    subsets or clusters
  • Clustering means finding the clusters themselves
    from a given set of data

17
Clustering Process
Similarity Measures s(x,y)
Clustering Algorithm
D(n,p)
C(m,p)
Dissimilarity Measures d(x,y)
18
Documents ? Keywords
KW1 KW2 KW3 KW4 KW5 KW6
D1 1 0 1 0 1
1 D2 1 0 1 0 1
1 D3 0 1 0 1 0
0 D4 1 0 0 1 0
1
Di ? KWj 1,0 Di ? KWj 1, 2, , n
C1 (D1,D2KW1,KW3,KW5,KW6) C2
(D4KW1,KW4,KW6) C3 (D3KW2,KW4)
19
Clustering Algorithms
  • Major families of clustering methods
  • Sequential algorithms
  • Hierarchical algorithms
  • Agglomerative algorithms
  • Divisive algorithms
  • Fuzzy clustering algorithms

20
Information Analysis Process
  • The text-data information analysis is divided
    into two phases
  • Cluster generation
  • Map display of clusters
  • A hypertext user interface enables the analyst to
    explore and interpret results

21
Example
Antibiotic Resistance
2 DB
4025 documents (1998-1999)
Data
30
Medicine
Molecular Biology
Hypertext
Clusters
Map
22
Information Visualization
  • Definition The use of computer-supported,
    interactive, visual representation of abstract
    data to amplify the acquisition or use of
    knowledge (Card et al., 1999)
  • Visual artifacts aid human thought
  • The progress of civilization can be read in the
    invention of visual artifacts, from writing to
    mathematics, to maps, to diagrams, to visual
    computing

23
Process
  • Raw Data ? Data Tables
  • Data Tables ? Clustering
  • Clustering ? Visual Structures Map
  • Visual Structures ? Views

24
Visual Structures
  • Data Tables are mapped to Visual Structures,
    which augment a spatial substrate with marks and
    graphical properties to encode information
  • A Graphic Representation is said to be expressive
    if all and only the data in the Data Table are
    also represented in the Visual Structure
  • A Graphic Representation is said to be more
    effective if it is faster to interpret

25
Map Display
  • We are concerned with map display of the clusters
  • A problem of particular interest is how to
    visualize data set with many variables
  • Multivariate-Data are clustered, and
  • Clusters are mapped

26
Mapping tools
  • For mapping, we use the following techniques
  • Density and Centrality Diagrams
  • Principal Component Analysis (PCA)
  • Multi-Layer Perceptrons (MLP)
  • Self-Organizing Maps (SOM)
  • Multi-SOMs

27
Multi-Layer Perceptron 1
ISEs-x2
prion
proteins
Wcij
Wsjk
s1
scrapie
x1
sk
xi
human disease
spongiform encephalopathy
mankind
Wc(p,2)
Ws(2,p)
xp
sp
CJD
28
Multi-Layer Perceptron 2
protein
infection resistance
Agrobacterium
plasmids
29
Multi-SOM Platform
30
Multi-Self-Organizing Map Display
Maps associated to 5 viewpoints Map 1 ?
Plants Map 2 ? Plant Parts Map 3 ? Pathogen
Agents Map 4 ? Genetic Techniques Map 5 ?
Patenting Firms
5
4
2
1
Rice Area Activated
Use of the inter-Map Communication Mechanism
31
Knowledge Discovery
  • KD is informally defined as the extraction of
    useful knowledge from databases or large amounts
    of data
  • One of the most important research topics in KD
    is the rule discovery or extraction
  • The discovered knowledge is usually expressed in
    the form of  if-then  rules

32
Association Rules
  • Association rules can be seen as one of the key
    tasks of KDD
  • The intuitive meaning of an association rule X ?
    Y, where X and Y are keywords or descriptors, is
    a document set containing keyword X is likely
    to also contain keyword Y

33
Example
  • In a given a food-industry corpus
  • 98 of the documents which are interested on
    apple juice does it related with the
    chromatography analytic technique
  • X ? Y apple juice ? chromatography

34
The Galois Lattice
  • Our current research includes an approach based
    on the lattice structure to discover concepts and
    rules to the objects (documents) and their
    properties (keywords)
  • The Galois lattice approach is also known as
    conceptual clustering

35
The concept lattice
Given the context (D1,T1) where D1
d1,d2,d3,d4 T1 t1,t2,t3,t4,t5,t6
Hasse Diagram
C1(D1,Ø)
R t1 t2 t3 t4 t5 t6 d1 1 0 1 0 1 1 d2 1
0 1 0 1 1 d3 0 1 0 1 0 0 d4 1 0 0 1
0 1
C2(d1,d2,d4,t1,t6
C3(d3,d4,t4
C4(d1,d2,t1,t3,t5,t6
C5(d4,t1,t4,t6
C6(d3,t2,t4
Table The input relation R documents ?
keywords
C7(Ø, T1)
The formal concept C4 has two own terms t3,t5
and two inherited terms t1,t6
36
Association Rules Extraction
  • The formal concept C4 makes it possible the
    following rules
  • R1 t3 ? t1 ? t6
  • R2 t5 ? t1 ? t6
  • R3 t3 ? t5
  • The interpretation of the R1 and R2 The use of
    terms t3 or t5 is always associated with that of
    terms t1 and t6
  • The rule R3 express mutual equivalence of the
    terms t3,t5 All the documents which have the
    term t3 also have the t5 term.

37
Summary
Text Mining
Clustering
Mapping
Knowledge Discovery
38
Knowledge Management
  • A knowledge management system is concerned with
    the identification, acquisition, development,
    diffusion, use, and preservation of the
    enterprises knowledge

39
KM Objectives
  • Using advanced technology
  • For facilitating creation, access, and reuse of
    knowledge
  • For converting knowledge from the sources
    accessible to an organization and connecting
    people with that knowledge

40
Project
  • Adding to the information analysis system a
    formalized operator for processing together
  • The knowledge that is extracted from databases
  • The knowledge that the experts produce when they
    analyze the clusters, maps, concepts and rules

41
We have reached our last subject, but not the
end !
42
Merci
Gracias
Obrigado
Thanks
Xavier Polanco
Write a Comment
User Comments (0)
About PowerShow.com