Database%20Clustering%20and%20Summary%20Generation - PowerPoint PPT Presentation

About This Presentation

Title:

Database%20Clustering%20and%20Summary%20Generation

Description:

(store-name = flea-market')) 'Typically, members in the cluster have spent more than $1,000 cash for shopping in a flea-market' Summary Representation ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 30

Provided by: eick

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Database%20Clustering%20and%20Summary%20Generation

1
Database Clustering and Summary Generation

Tae-Wan Ryu and Christoph F. Eick

2
Similarity Measures For Multi-valued Attributes
for Database Clustering

Tae-wan Ryu and Christoph F. Eick
Department of Computer Science
University of Houston
Talk Organization
Database Clustering
Problems of Database Clustering
Extended Data Sets
Similarity Measures for Sets and Bags
An Architecture for Database Clustering
Summary and Conclusion

3
General KDD Steps
Data sources
Selected/Preprocessed data
Transformed data
Extracted information
Knowledge
Select/preprocess
Transform
Data mine
Interpret/Evaluate/Assimilate
Data preparation
4
Research Goal

To develop methodologies, techniques, and tools
to create summaries from databases using cluster
analysis and genetic programming
Our approach
Partition the database into groups of similar
objects using cluster analysis
Find commonalities that objects belonging to each
group share using genetic programming

5
Database Summary Generation Steps and Example
lt Example gt
lt Steps gt
Database
Restaurant database
Database Clustering
Clusters
Groups of similar objects
White color
Retired
Young
Summary Generation
Midnight
Dinner
Lunch
Summaries describing the commonalities within
each group
6
An Example Schema Diagram
7
Preprocessing forDatabase Clustering

Preparing input data sets for clustering
Appropriate data selection and preparation from a
database is important task
Key Problems
How to support a users viewpoint including
attribute selection
Data model discrepancy between storage format and
the input format that clustering algorithms
assume
How to cope with structural information,
especially 1n and nm relationships

8
Input Format for Data Mining Algorithms

Data Format for Input Data Sets
Single flat file format (basically, the data set
has to be stored as a single(!) relation)
Complex and structured formats
Problem Almost all existing data mining and
clustering approaches assume that input data set
is in single flat file format.

9
An Example Database to Illustrate the Problems
with Relationship Information in Database
Clustering

Person Purchase
Joined result
(a) (b)
ptype (payment type) 1 for cash, 2 for
credit, and 3 for check, the cardinality ratio is
1n
(a) an example of Personal relational database,
(b) a joined table from Person and

Purchase relations

ssn name age sex 111111111
Johny 43 M 222222222 Andy 21 F
333333333 Post 67 M 444444444 Jenny 35
F
ssn location ptype amount
date 111111111 Warehouse 1 400
02-10-96 111111111 Grocery 2
70 05-14-96 111111111 Mall
3 200 12-24-96 222222222 Mall
2 300 12-23-96
222222222 Grocery 3 100
06-22-96 333333333 Mall 1
30 11-05-96
name age sex ptype amount location
Johny 43 M 1 400 Mall
Johny 43 M 2 70 Grocery
Johny 43 M 3 200 Warehouse
Andy 42 F 2 300 Mall
Andy 42 F 3 100
Grocery Post 67 M 1 30
Mall Jenny 35 F null null
null
10
Existing Approaches

Applying aggregate functions or generalization
operators to convert a multi-valued attribute
into a single
valued attribute.
Problems
User has to make a critical decision (e.g., which
aggregate function to use?)
Valuable related information may be lost.

11
Extended Data Sets
name age sex ptype amount location
Johny 43 M 1 400 Mall
Johny 43 M 2 70 Grocery
Johny 43 M 3 200 Warehouse
Andy 42 F 2 100 Mall
Andy 42 F 3 100
Grocery Post 67 M 1 30
Mall Jenny 35 F null null
null

name age sex p.ptype p.amount
p.location Johny 43 M 1,2,3
400,70,200 Mall, Grocery, Warehouse Andy
21 F 2,3 100,100 Mall,
Grocery Post 67 M 1
30 Mall Jenny 35 F
null null null
A converted table with a bag of values
How to measure similarity between bags of values?

Group similarity measures are needed.

12
Approaches for Database Clustering
Structured database
Clustering algorithms
Manual transformation
Flat file
ltCurrent approachgt
Structured database
Extended data set
Generalized Clustering algorithms
Automated preprocessing
ltProposed approachgt
13
Related Work

LABYRINTH (Thompson et al.)
Ketterlins extended COBWEB
KATE (Manago et al.)
SUBDUE (Holder et al.)
INLEN (Ribeiro et al.)
KBG (Bisson et al.), KLUSTER (Kietz et al.)

14
Research Objectives for Database Clustering

To alleviate the representational gab between
databases on the one hand and input formats of
clustering algorithms on the other hand
To design and implement semi-automatic tools to
facilitate database clustering
To generalize clustering algorithms

15
Generating Extended Data Sets Froma Structured
Database
Database d1, d2, , dn
Users interests and objectives
Extended data set generator
Extended data set1
16
A Unified Similarity Measure for Clustering
Extended Data Sets

Group Similarity Measures
Mixed Types qualitative, quantitative types.
Qualitative type Tverskys set-theoretical
similarity models.
Contrast model
S(a,b) ?f(A?B) ? ?f(A ? B) ? ?f(B ? A),
where a and b be two objects, and A and B denote
the sets of features for some ?, ?, ? ? 0 f is
the cardinality of the set
Ratio model (e.g., normalized similarity)
S(a,b) f(A?B) / f(A?B) ?f(A ? B) ?f(B ?
A), ?, ? ? 0

17
Group Similarity Measures... continued

Quantitative type group average
Group average between group A and B
where n is the total number of object-pairs,
d(a,b)i is the dissimilarity measure for the ith
pair of objects a and b,
a ? A, b ? B.
By taking the average of all the inter-object
measures for those pairs of
objects from which each object of a pair is in
different groups.

18
A Framework for Mixed Type Similarity Measures
for Extended Data Sets

Gowers similarity measure for data sets with
mixed-types.
Extended similarity measure for multi-valued data
sets with mixed-types.
where m l q. The functions, sl(a,b) and
sq(a,b) are similarity functions for qualitative
attributes and quantitative attributes
respectively.

19
Clustering Algorithms for Extended Data Sets

Nearest-neighbor clustering
DBSCAN
Leader algorithm
Hierarchical clustering

20
Database Clustering Environment
A set of clusters
Library of clustering algorithms
Extended Data set
Similarity measure
Clustering Tool
Library of similarity measures
Similarity Measure Tool
Data Extraction Tool
User Interface
Type and weight information
Default choice and domain information
DBMS
21
A More Detailed Tool Architecture
22
A Join Template Form
A Join Template Form Begin-spec
Database-name DB Link-definitions Link-list
Begin-join Dataset-of-interest
Dsetintrest Selected-attributes
Attr-list Objective-attributes
Obj-attr-list Extended-data-set E
End-join End-spec
23
An Example of the Interface of the Extended Data
Set Generation Tool
Begin-spec DB-name Company
Link-definitions superv(Employee.ssn,
Employee.superssn), husband(Employee.ssn
, Marriage.hssn), wife(Employee.ssn,Marri
age.wssn), ehusband(Marriage.hssn,
Employee.ssn), ewife(Marriage.wssn,
Employee.ssn), works_on(Employee.ssn,
Works_on.essn), project(Works_on.pno,
Project.pnum), works_for(Employee.dno,
Department.dnum), works_loc(Department.dnu
m, Dept_loc.dnum) Begin-join
Dateset-of-interest Employee
Selected-attributes ssn, sex, salary,
superv.salary, wife.ewife.salary,
works_on.hours, works_on.project.pname,
works_for.works_loc.dloc
Objective-attributes ssn Output-data-set
E1 End-join End-spec
24
Algorithm to Generate Extended Data Sets

Project the Data Set of Interest by Primary key
and Selected Attributes
Join the Data set of Interest and related data
sets to get all related attributes for each
join-path
Group attributes together that describe the same
object

25
Summary Representation

Our approach uses database queries as our summary
representation language.
Queries that compute the objects belonging to a
cluster and no other objects are considered to be
perfect summaries for a cluster.
An example query for a cluster
(SELECT ssn name address
FROM person purchase
WHERE (amount-spent gt 1000) and
(payment-type cash)and
(store-name flea-market))
Typically, members in the cluster have spent
more than
1,000 cash for shopping in a flea-market

26
Summary and Contributions

Discussed the data model discrepancy between
database storage format and input data format for
traditional clustering algorithms
Discussed the problems of dealing with
relationship information in database clustering
Presented a different way of representing related
information using extended data sets
Introduced the design and architecture of an
automatic tools to generate extended data sets
from databases
Generalized the traditional similarity measures
and present a framework to cope with extended
data sets in similarity-based clustering

27
Architecture of MASSON

g1
cluster
Clustering module
g2
...
Schema information
Object set
gk
user input
system input
user interface
GP based discovery system
generate
apply
DBMS
DB
select
Query set
Interface
user input
KB
GP engine
Domain knowledge
Query result
return
evaluate
system input
Discovered query set
28
Evolution Process
Generationn
Initial generation
generation2
evolve
evolve
evolved population
Initial population
evolved population
qn1, qn2,..,qnm
q11, q12,..,q1m
q21, q22,..,q2m
selection crossover mutation
selection crossover mutation
selection
Solution Q
n number of generation m the size of population
29
Evolution Process
Generationn
Initial generation
generation2
evolve
evolve
evolved population
Initial population
evolved population
qn1, qn2,..,qnm
q11, q12,..,q1m
q21, q22,..,q2m
selection crossover mutation
selection crossover mutation
selection
Solution Q
n number of generation m the size of population

Write a Comment

User Comments (0)