Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison

About This Presentation

Title:

Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison

Description:

See the notes for information on how the s are organized. – PowerPoint PPT presentation

Number of Views:172

Avg rating:3.0/5.0

Slides: 60

Provided by: csTuftsE

Learn more at: https://www.cs.tufts.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison

1
Data Warehousing/Mining Comp 150 DW Chapter 5
Concept Description Characterization and
Comparison

Instructor Dan Hebert

2
Chapter 5 Concept Description Characterization
and Comparison

What is concept description?
Data generalization and summarization-based
characterization
Analytical characterization Analysis of
attribute relevance
Mining class comparisons Discriminating between
different classes
Mining descriptive statistical measures in large
databases
Discussion
Summary

3
What is Concept Description?

Descriptive vs. predictive data mining
Descriptive mining describes concepts or
task-relevant data sets in concise, summarative,
informative, discriminative forms
Predictive mining Based on data and analysis,
constructs models for the database, and predicts
the trend and properties of unknown data
Concept description
Characterization provides a concise and succinct
summarization of the given collection of data
Comparison provides descriptions comparing two
or more collections of data

4
Concept Description vs. OLAP

Concept description
can handle complex data types of the attributes
and their aggregations
a more automated process
OLAP
restricted to a small number of dimension and
measure types
user-controlled process

5
Data Generalization and Summarization-based
Characterization

Data generalization
A process which abstracts a large set of
task-relevant data in a database from a low
conceptual levels to higher ones.
Approaches
Data cube approach(OLAP approach)
Attribute-oriented induction approach

1
2
3
4
Conceptual levels
5
6
Characterization Data Cube Approach (without
using Attribute Oriented-Induction)

Perform computations and store results in data
cubes
Strength
An efficient implementation of data
generalization
Computation of various kinds of measures
e.g., count( ), sum( ), average( ), max( )
Generalization and specialization can be
performed on a data cube by roll-up and
drill-down
Limitations
handle only dimensions of simple nonnumeric data
and measures of simple aggregated numeric values.
Lack of intelligent analysis, cant tell which
dimensions should be used and what levels should
the generalization reach

7
Attribute-Oriented Induction

Proposed in 1989 (KDD 89 workshop)
Not confined to categorical data nor particular
measures.
How it is done?
Collect the task-relevant data( initial relation)
using a relational database query
Perform generalization by attribute removal or
attribute generalization.
Apply aggregation by merging identical,
generalized tuples and accumulating their
respective counts.
Interactive presentation with users.

8
Basic Principles of Attribute-Oriented Induction

Data focusing task-relevant data, including
dimensions, and the result is the initial
relation.
Attribute-removal remove attribute A if there is
a large set of distinct values for A but (1)
there is no generalization operator on A, or (2)
As higher level concepts are expressed in terms
of other attributes.
Attribute-generalization If there is a large set
of distinct values for A, and there exists a set
of generalization operators on A, then select an
operator and generalize A.
Attribute-threshold control typical 2-8,
specified/default.
Generalized relation threshold control control
the final relation/rule size.

9
Basic Algorithm for Attribute-Oriented Induction

InitialRel Query processing of task-relevant
data, deriving the initial relation.
PreGen Based on the analysis of the number of
distinct values in each attribute, determine
generalization plan for each attribute removal?
or how high to generalize?
PrimeGen Based on the PreGen plan, perform
generalization to the right level to derive a
prime generalized relation, accumulating the
counts.
Presentation User interaction (1) adjust levels
by drilling, (2) pivoting, (3) mapping into
rules, cross tabs, visualization presentations.

10
Example

DMQL Describe general characteristics of
graduate students in the Big-University database
use Big_University_DB
mine characteristics as Science_Students
in relevance to name, gender, major, birth_place,
birth_date, residence, phone, gpa
from student
where status in graduate
Corresponding SQL statement
Select name, gender, major, birth_place,
birth_date, residence, phone, gpa
from student
where status in Msc, MBA, PhD

11
Class Characterization An Example
Initial Relation
Prime Generalized Relation
12
Presentation of Generalized Results

Generalized relation
Relations where some or all attributes are
generalized, with counts or other aggregation
values accumulated.
Cross tabulation
Mapping results into cross tabulation form
(similar to contingency tables).
Visualization techniques
Pie charts, bar charts, curves, cubes, and other
visual forms.
Quantitative characteristic rules
Mapping generalized result into characteristic
rules with quantitative information associated
with it, e.g.,

13
Presentation of Generalized Results(continued)

t-weight
Interesting measure that describes the typicality
of
each disjunct in the rule
each tuple in the corresponding generalized
relation
n number of tuples for target class for
generalized relation
qi qn tuples for target class in generalized
relation
qa is in qi qn

14
PresentationGeneralized Relation
15
PresentationCrosstab
16
Implementation by Cube Technology

Construct a data cube on-the-fly for the given
data mining query
Facilitate efficient drill-down analysis
May increase the response time
A balanced solution precomputation of subprime
relation
Use a predefined precomputed data cube
Construct a data cube beforehand
Facilitate not only the attribute-oriented
induction, but also attribute relevance analysis,
dicing, slicing, roll-up and drill-down
Cost of cube computation and the nontrivial
storage overhead

17
Characterization vs. OLAP

Similarity
Presentation of data summarization at multiple
levels of abstraction.
Interactive drilling, pivoting, slicing and
dicing.
Differences
Automated desired level allocation.
Dimension relevance analysis and ranking when
there are many relevant dimensions.
Sophisticated typing on dimensions and measures.
Analytical characterization data dispersion
analysis.

18
Attribute Relevance Analysis

Why?
Which dimensions should be included?
How high level of generalization?
Automatic vs. interactive
Reduce attributes easy to understand patterns
What?
statistical method for preprocessing data
filter out irrelevant or weakly relevant
attributes
retain or rank the relevant attributes
relevance related to dimensions and levels
analytical characterization, analytical
comparison

19
Attribute relevance analysis (contd)

How?
Data Collection
Analytical Generalization
Use information gain analysis (e.g., entropy or
other measures) to identify highly relevant
dimensions and levels.
Relevance Analysis
Sort and select the most relevant dimensions and
levels.
Attribute-oriented Induction for class
description
On selected dimension/level
OLAP operations (e.g. drilling, slicing) on
relevance rules

20
Relevance Measures

Quantitative relevance measure determines the
classifying power of an attribute within a set of
data.
Methods
information gain (ID3)
gain ratio (C4.5)
gini index
?2 contingency table statistics
uncertainty coefficient

21
Information-Theoretic Approach

Decision tree
each internal node tests an attribute
each branch corresponds to attribute value
each leaf node assigns a classification
ID3 algorithm
build decision tree based on training objects
with known class labels to classify testing
objects
rank attributes with information gain measure
minimal height
the least number of tests to classify an object

22
Top-Down Induction of Decision Tree
Attributes Outlook, Temperature, Humidity,
Wind
PlayTennis yes, no
23
Entropy and Information Gain

S contains si tuples of class Ci for i 1, ,
m
Information measures info required to classify
any arbitrary tuple
Entropy (weighted average) of attribute A with
values a1,a2,,av
Information gained by branching on attribute A
gtinfo gained gt discriminating attribute

24
Example Analytical Characterization

Task
Mine general characteristics describing graduate
students using analytical characterization
Given
attributes name, gender, major, birth_place,
birth_date, phone, and gpa
Gen(ai) concept hierarchies on ai
Ui attribute analytical thresholds for ai
Ti attribute generalization thresholds for ai
R attribute relevance threshold

25
Example Analytical Characterization (contd)

1. Data collection
target class graduate student
contrasting class undergraduate student
2. Analytical generalization using Ui
attribute removal
remove name and phone
attribute generalization
generalize major, birth_place, birth_date and
gpa
accumulate counts
candidate relation gender, major, birth_country,
age_range and gpa

26
Example Analytical characterization (2)
Candidate relation for Target class Graduate
students (?120)
Candidate relation for Contrasting class
Undergraduate students (?130)
27
Example Analytical characterization (3)

3. Relevance analysis
Calculate expected info required to classify an
arbitrary tuple
Calculate entropy of each attribute e.g. major

28
Example Analytical Characterization (4)

Calculate expected info required to classify a
given sample if S is partitioned according to the
attribute
Calculate information gain for each attribute
Information gain for all attributes

0
0.9892
0.9183
0.7873
0.9988
29
Example Analytical characterization (5)

4. Initial working relation (W0) derivation
R (attribute relevance threshold) 0.1
remove irrelevant/weakly relevant attributes from
candidate relation gt drop gender, birth_country
remove contrasting class candidate relation
5. Perform attribute-oriented induction on W0
using Ti

Initial target class working relation W0
Graduate students
30
Mining Class Comparisons

Comparison Comparing two or more classes.
Method
Partition the set of relevant data into the
target class and the contrasting class(es)
Generalize both classes to the same high level
concepts
Compare tuples with the same high level
descriptions
Present for every tuple its description and two
measures
support - distribution within single class
comparison - distribution between classes
Highlight the tuples with strong discriminant
features
Relevance Analysis
Find attributes (features) which best distinguish
different classes.

31
Example Analytical comparison

Task
Compare graduate and undergraduate students using
discriminant rule.
DMQL query

use Big_University_DB mine comparison as
grad_vs_undergrad_students in relevance to
name, gender, major, birth_place, birth_date,
residence, phone, gpa for graduate_students whe
re status in graduate versus undergraduate_stud
ents where status in undergraduate analyze
count from student
32
Example Analytical comparison (2)

Given
attributes name, gender, major, birth_place,
birth_date, residence, phone and gpa
Gen(ai) concept hierarchies on attributes ai
Ui attribute analytical thresholds for
attributes ai
Ti attribute generalization thresholds for
attributes ai
R attribute relevance threshold

33
Example Analytical comparison (3)

1. Data collection
target and contrasting classes
2. Attribute relevance analysis
remove attributes name, gender, major, phone
3. Synchronous generalization
controlled by user-specified dimension thresholds
prime target and contrasting class(es)
relations/cuboids

34
Example Analytical comparison (4)
Prime generalized relation for the target class
Graduate students
Prime generalized relation for the contrasting
class Undergraduate students
35
Example Analytical comparison (5)

4. Drill down, roll up and other OLAP operations
on target and contrasting classes to adjust
levels of abstractions of resulting description
5. Presentation
as generalized relations, crosstabs, bar charts,
pie charts, or rules
contrasting measures to reflect comparison
between target and contrasting classes
e.g. count

36
Quantitative Discriminant Rules

Cj target class
qa a generalized tuple covers some tuples of
class
but can also cover some tuples of contrasting
class
d-weight
range 0.0, 1.0 or 0, 100

37
Quantitative Discriminant Rules

High d-weight in target class indicates that
concept represented by generalized tuple is
primarily derived from target class
Low d-weight implies concept is derived from
contrasting class
Threshold can be set to control the display of
interesting tuples
quantitative discriminant rule form
Read if X satisfies condition, there is a
probability (d-weight) that x is in the target
class

38
Example Quantitative Discriminant Rule
Count distribution between graduate and
undergraduate students for a generalized tuple

Quantitative discriminant rule
where 90/(90210) 30

39
Example Quantitative Description Rule

Quantitative description rule for target class
Europe

Crosstab showing associated t-weight, d-weight
values and total number (in thousands) of TVs and
computers sold at AllElectronics in 1998
40
Mining Data Dispersion Characteristics

Motivation
To better understand the data central tendency,
variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance,
etc.
Numerical dimensions correspond to sorted
intervals
Data dispersion analyzed with multiple
granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed
cube

41
Measuring the Central Tendency

Mean
Weighted arithmetic mean
Median A holistic measure
Middle value if odd number of values, or average
of the middle two values otherwise
estimated by interpolation
Mode
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula

42
Measuring the Dispersion of Data

Quartiles, outliers and boxplots
Quartiles Q1 (25th percentile), Q3 (75th
percentile)
Inter-quartile range IQR Q3 Q1
Five number summary min, Q1, M, Q3, max
Boxplot ends of the box are the quartiles,
median is marked, whiskers, and plot outlier
individually
Outlier usually, a value higher/lower than 1.5 x
IQR
Variance and standard deviation
Variance s2 (algebraic, scalable computation)
Standard deviation s is the square root of
variance s2

43
Boxplot Analysis

Five-number summary of a distribution
Minimum, Q1, M, Q3, Maximum
Boxplot
Data is represented with a box
The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
(interquartile range)
The median is marked by a line within the box
Whiskers two lines outside the box extend to
Minimum and Maximum

44
A Boxplot
45
DBMiner

4 examples of Boxplot Analysis

46
Mining Descriptive Statistical Measures in Large
Databases

Variance
Standard deviation the square root of the
variance
Measures spread about the mean
It is zero if and only if all the values are
equal
Both the deviation and the variance are algebraic

47
Histogram Analysis

Graph displays of basic statistical class
descriptions
Frequency histograms
A univariate graphical method
Consists of a set of rectangles that reflect the
counts or frequencies of the classes present in
the given data

48
DBMiner

2 examples of Histogram Analysis

49
Quantile Plot

Displays all of the data (allowing the user to
assess both the overall behavior and unusual
occurrences)
Plots quantile information
For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi of the data
are below or equal to the value xi

50
Quantile-Quantile (Q-Q) Plot

Graphs the quantiles of one univariate
distribution against the corresponding quantiles
of another
Allows the user to view whether there is a shift
in going from one distribution to another

51
Scatter plot

Provides a first look at bivariate data to see
clusters of points, outliers, etc
Each pair of values is treated as a pair of
coordinates and plotted as points in the plane

52
Loess Curve

Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of
dependence
Loess curve is fitted by setting two parameters
a smoothing parameter, and the degree of the
polynomials that are fitted by the regression

53
Graphic Displays of Basic Statistical Descriptions

Histogram (shown before)
Boxplot (covered before)
Quantile plot each value xi is paired with fi
indicating that approximately 100 fi of data
are ? xi
Quantile-quantile (q-q) plot graphs the
quantiles of one univariant distribution against
the corresponding quantiles of another
Scatter plot each pair of values is a pair of
coordinates and plotted as points in the plane
Loess (local regression) curve add a smooth
curve to a scatter plot to provide better
perception of the pattern of dependence

54
AO Induction vs. Learning-from-example Paradigm

Difference in philosophies and basic assumptions
Positive and negative samples in
learning-from-example positive used for
generalization, negative - for specialization
Positive samples only in data mining hence
generalization-based, to drill-down backtrack the
generalization to a previous state
Difference in methods of generalizations
Machine learning generalizes on a tuple by tuple
basis
Data mining generalizes on an attribute by
attribute basis

55
Comparison of Entire vs. Factored Version Space
56
Incremental and Parallel Mining of Concept
Description

Incremental mining revision based on newly added
data ?DB (delta DB)
Generalize ?DB to the same level of abstraction
in the generalized relation R to derive ?R
Union R U ?R, i.e., merge counts and other
statistical information to produce a new relation
R
Similar philosophy can be applied to data
sampling, parallel and/or distributed mining, etc.

57
Summary

Concept description characterization and
discrimination
OLAP-based vs. attribute-oriented induction
Efficient implementation of AOI
Analytical characterization and comparison
Mining descriptive statistical measures in large
databases
Discussion
Incremental and parallel mining of description
Descriptive mining of complex types of data

58
Homework Assignment

This homework assignment will utilize the data
warehouse you previously built incorporating the
hurricane data
Implement an automated Attribute-Oriented
Induction capability on your hurricane data
Input needed
Your relation tables for the hurricane data
A DMQuery for characterization
A list of attributes
A set of concept hierarchies or generalization
operators
A generalized relation threshold
Attribute generalization thresholds for each
attribute

59
Homework Assignment

Transform the DMQL statement to a relational
query (can do this by hand)
Use this relational query in your program to
retrieve data and then perform the AOI (see
algorithm on pg. 188 of book)
Visualize the results of the AOI via a
generalized relation
Due April 22

Write a Comment

User Comments (0)

About PowerShow.com

Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison - PowerPoint PPT Presentation

Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison

See the notes for information on how the s are organized. – PowerPoint PPT presentation