Title: Data Warehousing/Mining Comp 150 DW Chapter 5: Concept Description: Characterization and Comparison
1Data Warehousing/Mining Comp 150 DW Chapter 5
Concept Description Characterization and
Comparison
2Chapter 5 Concept Description Characterization
and Comparison
- What is concept description?
- Data generalization and summarization-based
characterization - Analytical characterization Analysis of
attribute relevance - Mining class comparisons Discriminating between
different classes - Mining descriptive statistical measures in large
databases - Discussion
- Summary
3What is Concept Description?
- Descriptive vs. predictive data mining
- Descriptive mining describes concepts or
task-relevant data sets in concise, summarative,
informative, discriminative forms - Predictive mining Based on data and analysis,
constructs models for the database, and predicts
the trend and properties of unknown data - Concept description
- Characterization provides a concise and succinct
summarization of the given collection of data - Comparison provides descriptions comparing two
or more collections of data
4Concept Description vs. OLAP
- Concept description
- can handle complex data types of the attributes
and their aggregations - a more automated process
- OLAP
- restricted to a small number of dimension and
measure types - user-controlled process
5Data Generalization and Summarization-based
Characterization
- Data generalization
- A process which abstracts a large set of
task-relevant data in a database from a low
conceptual levels to higher ones. - Approaches
- Data cube approach(OLAP approach)
- Attribute-oriented induction approach
1
2
3
4
Conceptual levels
5
6Characterization Data Cube Approach (without
using Attribute Oriented-Induction)
- Perform computations and store results in data
cubes - Strength
- An efficient implementation of data
generalization - Computation of various kinds of measures
- e.g., count( ), sum( ), average( ), max( )
- Generalization and specialization can be
performed on a data cube by roll-up and
drill-down - Limitations
- handle only dimensions of simple nonnumeric data
and measures of simple aggregated numeric values. - Lack of intelligent analysis, cant tell which
dimensions should be used and what levels should
the generalization reach
7Attribute-Oriented Induction
- Proposed in 1989 (KDD 89 workshop)
- Not confined to categorical data nor particular
measures. - How it is done?
- Collect the task-relevant data( initial relation)
using a relational database query - Perform generalization by attribute removal or
attribute generalization. - Apply aggregation by merging identical,
generalized tuples and accumulating their
respective counts. - Interactive presentation with users.
8Basic Principles of Attribute-Oriented Induction
- Data focusing task-relevant data, including
dimensions, and the result is the initial
relation. - Attribute-removal remove attribute A if there is
a large set of distinct values for A but (1)
there is no generalization operator on A, or (2)
As higher level concepts are expressed in terms
of other attributes. - Attribute-generalization If there is a large set
of distinct values for A, and there exists a set
of generalization operators on A, then select an
operator and generalize A. - Attribute-threshold control typical 2-8,
specified/default. - Generalized relation threshold control control
the final relation/rule size.
9Basic Algorithm for Attribute-Oriented Induction
- InitialRel Query processing of task-relevant
data, deriving the initial relation. - PreGen Based on the analysis of the number of
distinct values in each attribute, determine
generalization plan for each attribute removal?
or how high to generalize? - PrimeGen Based on the PreGen plan, perform
generalization to the right level to derive a
prime generalized relation, accumulating the
counts. - Presentation User interaction (1) adjust levels
by drilling, (2) pivoting, (3) mapping into
rules, cross tabs, visualization presentations.
10Example
- DMQL Describe general characteristics of
graduate students in the Big-University database - use Big_University_DB
- mine characteristics as Science_Students
- in relevance to name, gender, major, birth_place,
birth_date, residence, phone, gpa - from student
- where status in graduate
- Corresponding SQL statement
- Select name, gender, major, birth_place,
birth_date, residence, phone, gpa - from student
- where status in Msc, MBA, PhD
11Class Characterization An Example
Initial Relation
Prime Generalized Relation
12Presentation of Generalized Results
- Generalized relation
- Relations where some or all attributes are
generalized, with counts or other aggregation
values accumulated. - Cross tabulation
- Mapping results into cross tabulation form
(similar to contingency tables). - Visualization techniques
- Pie charts, bar charts, curves, cubes, and other
visual forms. - Quantitative characteristic rules
- Mapping generalized result into characteristic
rules with quantitative information associated
with it, e.g.,
13Presentation of Generalized Results(continued)
- t-weight
- Interesting measure that describes the typicality
of - each disjunct in the rule
- each tuple in the corresponding generalized
relation - n number of tuples for target class for
generalized relation - qi qn tuples for target class in generalized
relation - qa is in qi qn
14PresentationGeneralized Relation
15PresentationCrosstab
16Implementation by Cube Technology
- Construct a data cube on-the-fly for the given
data mining query - Facilitate efficient drill-down analysis
- May increase the response time
- A balanced solution precomputation of subprime
relation - Use a predefined precomputed data cube
- Construct a data cube beforehand
- Facilitate not only the attribute-oriented
induction, but also attribute relevance analysis,
dicing, slicing, roll-up and drill-down - Cost of cube computation and the nontrivial
storage overhead
17Characterization vs. OLAP
- Similarity
- Presentation of data summarization at multiple
levels of abstraction. - Interactive drilling, pivoting, slicing and
dicing. - Differences
- Automated desired level allocation.
- Dimension relevance analysis and ranking when
there are many relevant dimensions. - Sophisticated typing on dimensions and measures.
- Analytical characterization data dispersion
analysis.
18Attribute Relevance Analysis
- Why?
- Which dimensions should be included?
- How high level of generalization?
- Automatic vs. interactive
- Reduce attributes easy to understand patterns
- What?
- statistical method for preprocessing data
- filter out irrelevant or weakly relevant
attributes - retain or rank the relevant attributes
- relevance related to dimensions and levels
- analytical characterization, analytical
comparison
19Attribute relevance analysis (contd)
- How?
- Data Collection
- Analytical Generalization
- Use information gain analysis (e.g., entropy or
other measures) to identify highly relevant
dimensions and levels. - Relevance Analysis
- Sort and select the most relevant dimensions and
levels. - Attribute-oriented Induction for class
description - On selected dimension/level
- OLAP operations (e.g. drilling, slicing) on
relevance rules
20Relevance Measures
- Quantitative relevance measure determines the
classifying power of an attribute within a set of
data. - Methods
- information gain (ID3)
- gain ratio (C4.5)
- gini index
- ?2 contingency table statistics
- uncertainty coefficient
21Information-Theoretic Approach
- Decision tree
- each internal node tests an attribute
- each branch corresponds to attribute value
- each leaf node assigns a classification
- ID3 algorithm
- build decision tree based on training objects
with known class labels to classify testing
objects - rank attributes with information gain measure
- minimal height
- the least number of tests to classify an object
22Top-Down Induction of Decision Tree
Attributes Outlook, Temperature, Humidity,
Wind
PlayTennis yes, no
23Entropy and Information Gain
- S contains si tuples of class Ci for i 1, ,
m - Information measures info required to classify
any arbitrary tuple - Entropy (weighted average) of attribute A with
values a1,a2,,av - Information gained by branching on attribute A
- gtinfo gained gt discriminating attribute
24Example Analytical Characterization
- Task
- Mine general characteristics describing graduate
students using analytical characterization - Given
- attributes name, gender, major, birth_place,
birth_date, phone, and gpa - Gen(ai) concept hierarchies on ai
- Ui attribute analytical thresholds for ai
- Ti attribute generalization thresholds for ai
- R attribute relevance threshold
25Example Analytical Characterization (contd)
- 1. Data collection
- target class graduate student
- contrasting class undergraduate student
- 2. Analytical generalization using Ui
- attribute removal
- remove name and phone
- attribute generalization
- generalize major, birth_place, birth_date and
gpa - accumulate counts
- candidate relation gender, major, birth_country,
age_range and gpa
26Example Analytical characterization (2)
Candidate relation for Target class Graduate
students (?120)
Candidate relation for Contrasting class
Undergraduate students (?130)
27Example Analytical characterization (3)
- 3. Relevance analysis
- Calculate expected info required to classify an
arbitrary tuple - Calculate entropy of each attribute e.g. major
28Example Analytical Characterization (4)
- Calculate expected info required to classify a
given sample if S is partitioned according to the
attribute - Calculate information gain for each attribute
- Information gain for all attributes
0
0.9892
0.9183
0.7873
0.9988
29Example Analytical characterization (5)
- 4. Initial working relation (W0) derivation
- R (attribute relevance threshold) 0.1
- remove irrelevant/weakly relevant attributes from
candidate relation gt drop gender, birth_country - remove contrasting class candidate relation
- 5. Perform attribute-oriented induction on W0
using Ti
Initial target class working relation W0
Graduate students
30Mining Class Comparisons
- Comparison Comparing two or more classes.
- Method
- Partition the set of relevant data into the
target class and the contrasting class(es) - Generalize both classes to the same high level
concepts - Compare tuples with the same high level
descriptions - Present for every tuple its description and two
measures - support - distribution within single class
- comparison - distribution between classes
- Highlight the tuples with strong discriminant
features - Relevance Analysis
- Find attributes (features) which best distinguish
different classes.
31Example Analytical comparison
- Task
- Compare graduate and undergraduate students using
discriminant rule. - DMQL query
use Big_University_DB mine comparison as
grad_vs_undergrad_students in relevance to
name, gender, major, birth_place, birth_date,
residence, phone, gpa for graduate_students whe
re status in graduate versus undergraduate_stud
ents where status in undergraduate analyze
count from student
32Example Analytical comparison (2)
- Given
- attributes name, gender, major, birth_place,
birth_date, residence, phone and gpa - Gen(ai) concept hierarchies on attributes ai
- Ui attribute analytical thresholds for
attributes ai - Ti attribute generalization thresholds for
attributes ai - R attribute relevance threshold
33Example Analytical comparison (3)
- 1. Data collection
- target and contrasting classes
- 2. Attribute relevance analysis
- remove attributes name, gender, major, phone
- 3. Synchronous generalization
- controlled by user-specified dimension thresholds
- prime target and contrasting class(es)
relations/cuboids
34Example Analytical comparison (4)
Prime generalized relation for the target class
Graduate students
Prime generalized relation for the contrasting
class Undergraduate students
35Example Analytical comparison (5)
- 4. Drill down, roll up and other OLAP operations
on target and contrasting classes to adjust
levels of abstractions of resulting description - 5. Presentation
- as generalized relations, crosstabs, bar charts,
pie charts, or rules - contrasting measures to reflect comparison
between target and contrasting classes - e.g. count
36Quantitative Discriminant Rules
- Cj target class
- qa a generalized tuple covers some tuples of
class - but can also cover some tuples of contrasting
class - d-weight
- range 0.0, 1.0 or 0, 100
37Quantitative Discriminant Rules
- High d-weight in target class indicates that
concept represented by generalized tuple is
primarily derived from target class - Low d-weight implies concept is derived from
contrasting class - Threshold can be set to control the display of
interesting tuples - quantitative discriminant rule form
- Read if X satisfies condition, there is a
probability (d-weight) that x is in the target
class
38Example Quantitative Discriminant Rule
Count distribution between graduate and
undergraduate students for a generalized tuple
- Quantitative discriminant rule
- where 90/(90210) 30
39Example Quantitative Description Rule
- Quantitative description rule for target class
Europe
Crosstab showing associated t-weight, d-weight
values and total number (in thousands) of TVs and
computers sold at AllElectronics in 1998
40Mining Data Dispersion Characteristics
- Motivation
- To better understand the data central tendency,
variation and spread - Data dispersion characteristics
- median, max, min, quantiles, outliers, variance,
etc. - Numerical dimensions correspond to sorted
intervals - Data dispersion analyzed with multiple
granularities of precision - Boxplot or quantile analysis on sorted intervals
- Dispersion analysis on computed measures
- Folding measures into numerical dimensions
- Boxplot or quantile analysis on the transformed
cube
41Measuring the Central Tendency
- Mean
- Weighted arithmetic mean
- Median A holistic measure
- Middle value if odd number of values, or average
of the middle two values otherwise - estimated by interpolation
- Mode
- Value that occurs most frequently in the data
- Unimodal, bimodal, trimodal
- Empirical formula
42Measuring the Dispersion of Data
- Quartiles, outliers and boxplots
- Quartiles Q1 (25th percentile), Q3 (75th
percentile) - Inter-quartile range IQR Q3 Q1
- Five number summary min, Q1, M, Q3, max
- Boxplot ends of the box are the quartiles,
median is marked, whiskers, and plot outlier
individually - Outlier usually, a value higher/lower than 1.5 x
IQR - Variance and standard deviation
- Variance s2 (algebraic, scalable computation)
- Standard deviation s is the square root of
variance s2
43 Boxplot Analysis
- Five-number summary of a distribution
- Minimum, Q1, M, Q3, Maximum
- Boxplot
- Data is represented with a box
- The ends of the box are at the first and third
quartiles, i.e., the height of the box is IQR
(interquartile range) - The median is marked by a line within the box
- Whiskers two lines outside the box extend to
Minimum and Maximum
44A Boxplot
45 DBMiner
- 4 examples of Boxplot Analysis
46Mining Descriptive Statistical Measures in Large
Databases
- Variance
- Standard deviation the square root of the
variance - Measures spread about the mean
- It is zero if and only if all the values are
equal - Both the deviation and the variance are algebraic
47Histogram Analysis
- Graph displays of basic statistical class
descriptions - Frequency histograms
- A univariate graphical method
- Consists of a set of rectangles that reflect the
counts or frequencies of the classes present in
the given data
48 DBMiner
- 2 examples of Histogram Analysis
49Quantile Plot
- Displays all of the data (allowing the user to
assess both the overall behavior and unusual
occurrences) - Plots quantile information
- For a data xi data sorted in increasing order, fi
indicates that approximately 100 fi of the data
are below or equal to the value xi
50Quantile-Quantile (Q-Q) Plot
- Graphs the quantiles of one univariate
distribution against the corresponding quantiles
of another - Allows the user to view whether there is a shift
in going from one distribution to another
51Scatter plot
- Provides a first look at bivariate data to see
clusters of points, outliers, etc - Each pair of values is treated as a pair of
coordinates and plotted as points in the plane
52Loess Curve
- Adds a smooth curve to a scatter plot in order to
provide better perception of the pattern of
dependence - Loess curve is fitted by setting two parameters
a smoothing parameter, and the degree of the
polynomials that are fitted by the regression
53Graphic Displays of Basic Statistical Descriptions
- Histogram (shown before)
- Boxplot (covered before)
- Quantile plot each value xi is paired with fi
indicating that approximately 100 fi of data
are ? xi - Quantile-quantile (q-q) plot graphs the
quantiles of one univariant distribution against
the corresponding quantiles of another - Scatter plot each pair of values is a pair of
coordinates and plotted as points in the plane - Loess (local regression) curve add a smooth
curve to a scatter plot to provide better
perception of the pattern of dependence
54AO Induction vs. Learning-from-example Paradigm
- Difference in philosophies and basic assumptions
- Positive and negative samples in
learning-from-example positive used for
generalization, negative - for specialization - Positive samples only in data mining hence
generalization-based, to drill-down backtrack the
generalization to a previous state - Difference in methods of generalizations
- Machine learning generalizes on a tuple by tuple
basis - Data mining generalizes on an attribute by
attribute basis
55Comparison of Entire vs. Factored Version Space
56Incremental and Parallel Mining of Concept
Description
- Incremental mining revision based on newly added
data ?DB (delta DB) - Generalize ?DB to the same level of abstraction
in the generalized relation R to derive ?R - Union R U ?R, i.e., merge counts and other
statistical information to produce a new relation
R - Similar philosophy can be applied to data
sampling, parallel and/or distributed mining, etc.
57Summary
- Concept description characterization and
discrimination - OLAP-based vs. attribute-oriented induction
- Efficient implementation of AOI
- Analytical characterization and comparison
- Mining descriptive statistical measures in large
databases - Discussion
- Incremental and parallel mining of description
- Descriptive mining of complex types of data
58Homework Assignment
- This homework assignment will utilize the data
warehouse you previously built incorporating the
hurricane data - Implement an automated Attribute-Oriented
Induction capability on your hurricane data - Input needed
- Your relation tables for the hurricane data
- A DMQuery for characterization
- A list of attributes
- A set of concept hierarchies or generalization
operators - A generalized relation threshold
- Attribute generalization thresholds for each
attribute
59Homework Assignment
- Transform the DMQL statement to a relational
query (can do this by hand) - Use this relational query in your program to
retrieve data and then perform the AOI (see
algorithm on pg. 188 of book) - Visualize the results of the AOI via a
generalized relation - Due April 22