Data Mining overview - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Data Mining overview

Description:

... age less than 25 and salary 40k drive sports cars. Similar time sequences ... E.g. A sale of man's suits is being held in all branches of a department store. ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 47

Provided by: csBg

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining overview

1
Data Mining(overview)
2
Presentation overview

Introduction
Association Rules
Classification
Clustering
Similar Time Sequences
Similar Images
Outliers
WWW
Summary

3
Background

Corporations have huge databases containing a
wealth of information
Business databases potentially constitute a
goldmine of valuable business information
Very little functionality in database systems to
support data mining applications
Data mining The efficient discovery of
previously unknown patterns in large databases

4
Applications

Fraud Detection
Loan and Credit Approval
Market Basket Analysis
Customer Segmentation
Financial Applications
E-Commerce
Decision Support
Web Search

5
Data Mining Techniques

Association Rules
Sequential Patterns
Classification
Clustering
Similar Time Sequences
Similar Images
Outlier Discovery
Text/Web Mining

6
Examples of Discovered Patterns

Association rules
98 of people who purchase diapers also buy beer
Classification
People with age less than 25 and salary gt 40k
drive sports cars
Similar time sequences
Stocks of companies A and B perform similarly
Outlier Detection
Residential customers for telecom company with
businesses at home

7
Association Rules

Given
A database of customer transactions
Each transaction is a set of items
Find all rules X gt Y that correlate the presence
of one set of items X with another set of items Y
Example 98 of people who purchase diapers and
baby food also buy beer.
Any number of items in the consequent/antecedent
of a rule
Possible to specify constraints on rules (e.g.,
find only rules involving expensive imported
products)

8
Association Rules

Sample Applications
Market basket analysis
Attached mailing in direct marketing
Fraud detection for medical insurance
Department store floor/shelf planning

9
Confidence and Support

A rule must have some minimum user-specified
confidence
1 2 gt 3 has 90 confidence if when a customer
bought 1 and 2, in 90 of cases, the customer
also bought 3.
A rule must have some minimum user-specified
support (how frequently the rule occurs)
1 2 gt 3 should hold in some minimum percentage
of transactions to have business value

10
Example

For minimum support 50, minimum confidence
50, we have the following rules
1 gt 3 with 50 support and 66 confidence
(13 happened in 50 of cases, but whenever 1
happened only in 2/3 of cases 3 happened too)
3 gt 1 with 50 support and 100 confidence
(31 happened in 50 of cases, but whenever 3
happened 1 happened too)

11
Quantitative Association Rules
Definition?

Quantitative attributes (e.g. age, income)
Categorical attributes (e.g. make of car)
Age 30..39 and Married Yes gt
NumCars2

min support 40 min confidence 50
12
Temporal Association Rules

Can describe the rich temporal character in data
Example
diaper -gt beer (support 5, confidence
87)
Support of this rule may jump to 25 between 6 to
9 PM weekdays
Problem How to find rules that follow
interesting user-defined temporal patterns
Challenge is to design efficient algorithms that
do much better than finding every rule in every
time unit

13
Correlation Rules

Association rules do not capture correlations
Example
Suppose 90 customers buy coffee, 25 buy tea
and 20 buy both tea and coffee
tea gt coffee has high support 0.2 and
confidence 0.8
tea, coffee are not correlated
expected support of customers buying both is 0.9
0.25 0.225

14
Sequential Patterns

Given
A sequence of customer transactions
Each transaction is a set of items
Find all maximal sequential patterns supported by
more than a user-specified percentage of
customers
Example 10 of customers who bought a PC did a
memory upgrade in a subsequent transaction
10 is the support of the pattern

15
Classification

Given
Database of tuples, each assigned a class label
Develop a model/profile for each class
Example profile (good credit) (25 lt age lt 40
and income gt 40k) or (married YES)
Sample applications
Credit card approval (good, bad)
Bank locations (good, fair, poor)
Treatment effectiveness (good, fair, poor)

16
Decision Trees
50 Churners 50 Non-Churners
New technology phone
Old technology phone
30 Churners 50 Non-Churners
20 Churners 0 Non-Churners
Customer lt 2.3 years
Customer gt 2.3 years
25 Churners 10 Non-Churners
5 Churners 40 Non-Churners
Age lt 55
Age gt 55
20 Churners 0 Non-Churners
5 Churners 10 Non-Churners
A decision tree is a predictive model that makes
a prediction on the basis of a series of decisions
17
Decision Trees
DT are creating a segmentation of the original
data set. This segmentation is done for the
prediction of some information. The records fall
in each segment have similarity with respect to
the information being predicted. The DT and the
algorithms may be complex, but the results are
presented in an easy-to-understand way, quite
useful to the business user.
18
Decision Trees

DT in business
Automation Very favorable technique for
automating the data mining and predictive
modeling. They embed automated solutions to
things that other techniques leave as a burden to
the user (4/4)
Clarity The models are viewed as a tree of
simple decisions based on familiar predictors or
as a set of rules. The user can confirm the DT or
modify by hand on the basis of his own expertise
(4/4)
ROI Because DT work well with relational
databases, they provide well-integrated solutions
with highly accurate models (3/4)

19
Decision Trees

Pros
Fast execution time
Generated rules are easy to interpret by humans
Scale well for large data sets
Can handle high dimensional data
Cons
Cannot capture correlations among attributes
Consider only axis-parallel cuts

20
Clustering

Given
Data points and number of desired clusters K
Group the data points into K clusters
Data points within clusters are more similar than
across clusters
Sample applications
Customer segmentation
Market basket customer analysis
Attached mailing in direct marketing
Clustering companies with similar growth

21
Where to use clustering and nearest-neighbor
prediction

Clustering for clarity
A high-level view
Segmentation
Clustering for outlier analysis
To see records that stick out of the rest
e.g. Wine distributors produce a certain level of
profit. One store produces significantly lower
profit. Turns out that the distributor was
delivering to but not collecting payment from one
of its customers.
Nearest neighbor for prediction
Objects near to each other have similar
prediction values.
Examples to find more documents as this one
among journal articles, the value to be predicted
in the next value of stock price based on time
series.

22
Outlier Discovery

Sometimes clustering is performed to see when one
record sticks out of the rest
E.g. One store stands out as producing
significantly lower profit. Closer examination
shows that the distributor was not collecting
payment from one of his customers
E.g. A sale of mans suits is being held in all
branches of a department store. All stores, but
one, have seen at least 100 jump in revenue. It
turns out that store had advertised via radio
rather than TV as other stores
Sample applications
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis

23
Outlier Discovery

Given
Data points and number of outliers ( n) to find
Find top n outlier points
outliers are considerably dissimilar from the
remainder of the data

24
Statistical Approaches

Model underlying distribution that generates
dataset (e.g. normal distribution)
Use discordancy tests depending on
data distribution
distribution parameter (e.g. mean, variance)
number of expected outliers
Drawbacks
most tests are for single attribute
In many cases, data distribution may not be known

25
Differences between the nearest-neighbor
technique and clustering
Nearest neighbors
Clustering

Used for prediction and consolidation
Space is defined by the problem to be solved
Generally only uses distance metrics to determine
nearness

Used for consolidating data into a high level
view and general grouping of records into like
behaviors
Space is defined as default n-dimensional space,
or by the user, or predefined space driven by
past experience
Can use other metrics beside distance to
determine nearness of two records- e.g.linking
points together

26
How clustering and nearest-neighbor work

Looking at n-dimensional space
The distance between the cluster and a given data
point is often measured from the center of mass
of the cluster
The center can be calculated
By simply average income and age of each record
By square error criterion
Other
Many clustering problems have hundreds of
dimensions. Our intuition works only in 2 or
3-dimensional space

Cluster 1
Outliers
Cluster 2
Customers of a golf equipment business Cluster 1
retirees with modest income Cluster 2
middle-aged weekend golfers Cluster 3 wealthy
youth with exclusive club membership
Cluster 3
27
Traditional Algorithms

Partitional algorithms
Enumerate K partitions optimizing some criterion
Example square-error criterion
mi is the mean of cluster Ci

28
How is nearness defined

The trivial case

ID Name Prediction Age Balance() Income Eyes Ge
nder

Carla Yes 21 2300 High Blue F
Sue ?? 21 2300 High Blue F

Exactly the same as the record to be predicted is
considered close. However, it is unlikely to
find exact matches

The Manhattan Distance metric adds up the
differences between each predictor between the
historical record and the record to be predicted
The Euclidean Distance metrics calculates
distance the Pythagorean way (the square of the
hypotenuse is equal to the sum of squares of the
other two sides)
Others

The Manhattan Distance metric (an example)

Calculating the difference between ages (6 years)
and balances (3100) is simple. Eyes color
predictor? e.g. match0, mismatch1 Income
assign numbers high3, medium2, low1 3108
6 3100 0 1 1 The result must be
normalized (e.g. 0-100) 225 6 19 0
100 100
30

Calculating dimension weights
Different dimensions may have different weights
e.g. In text classification not all words
(dimensions) are created equal entrepreneur is
significant, the is not.
Two methods
The inverse frequency of the word is used, the
1/10,000, entrepreneur 1/10
The importance of the word to the topic to be
predicted. entrepreneur and venture capital
will be given higher weight then tornado, the
topic is to start a small business
Dimension weights have also been calculated via
adaptive algorithms where random weights are
tried initially and then slowly modified to
improve the accuracy of the system (neural
networks, genetic algorithms)

31
Hierarchy of Clusters

The hierarchy of clusters is viewed as a tree in
which the smallest clusters merge to create the
next highest level of clusters.
Agglomerative technique starts with as many
clusters as there are records. The clusters that
are nearest each other are merged to form the
next largest cluster. This merging is continued
until a hierarchy of clusters is built.
Divisive technique takes the opposite approach.
It starts with all the records in one cluster,
then try to split that cluster into smaller
pieces, etc.
The hierarchy allows the end user to chose the
level to work with

Large single cluster
Smallest clusters
32
Similar Time Sequences

Given
A set of time-series sequences
Find
All sequences similar to the query sequence
All pairs of similar sequences
whole matching vs. subsequence matching
Sample Applications
Financial market
Market basket data analysis
Scientific databases
Medical Diagnosis

33
Whole Sequence Matching

Basic Idea
Extract k features from every sequence
Every sequence is then represented as a point in
k-dimensional space
Use a multi-dimensional index to store and search
these points
Spatial indices do not work well for high
dimensional data

34
Similar Time Sequences

Sequences are normalized with amplitude scaling
and offset translation
Two subsequences are considered similar if one
lies within an envelope of width around the
other, ignoring outliers
Two sequences are said to be similar if they have
enough non-overlapping time-ordered pairs of
similar subsequences

35
Similar Sequences Found
VanEck International Fund
Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund
group
36
Similar Images

Given
A set of images
Find
All images similar to a given image
All pairs of similar images
Sample applications
Medical diagnosis
Weather predication
Web search engine for images
E-commerce

37
Similar Images

QBICNib93, FSN95, JFS95, WBIISWWWFS98
Generates a single signature per image
Fails when the images contain similar objects,
but at different locations or varying sizes
Smi97
Divide an image into individual objects
Manual extraction can be very tedious and time
consuming
Inaccurate in identifying objects and not robust

38
WALRUS

Automatically extract regions from an image based
on the complexity of images
A single signature is used per each region
Two images are considered to be similar if they
have enough similar region pairs

39
WALRUS
Similarity Model
40
WALRUS (Overview)
Image Querying Phase
Image Indexing Phase
Compute wavelet signatures for sliding windows
Compute wavelet signatures for sliding windows
Cluster windows to generate regions
Cluster windows to generate regions
Insert regions into spatial index (R tree)
Find matching regions using spatial index
Compute similarity between query image and target
images
41
WALRUS
Query image
42
Web Mining Challenges

Todays search engines are plagued by problems
the abundance problem (99 of info of no interest
to 99 of people)
limited coverage of the Web (internet sources
hidden behind search interfaces)
limited query interface based on keyword-oriented
search
limited customization to individual users

43
Web is ..

The web is a huge collection of documents
Semistructured (HTML, XML)
Hyper-link information
Access and usage information
Dynamic
(i.e. New pages are constantly being generated)

44
Web Mining

Web Content Mining
Extract concept hierarchies/relations from the
web
Automatic categorization
Web Log Mining
Trend analysis (i.e web dynamics info)
Web access association/sequential pattern
analysis
Web Structure Mining
Google A page is important if important pages
point to it

45
Improving Search/Customization

Learn about users interests based on access
patterns
Provide users with pages, sites and
advertisements of interest

46
Summary

Data mining
Good science - leading position in research
community
Recent progress for large databases association
rules, classification, clustering, similar time
sequences, similar image retrieval, outlier
discovery, etc.
Many papers were published in major conferences
Still promising and rich field with many
challenging research issues
Maturing in industry