Data Mining overview - PowerPoint PPT Presentation


PPT – Data Mining overview PowerPoint presentation | free to view - id: 23d904-ZDljO


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Data Mining overview


... age less than 25 and salary 40k drive sports cars. Similar time sequences ... E.g. A sale of man's suits is being held in all branches of a department store. ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 47
Provided by: csBg


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Data Mining overview

Data Mining(overview)
Presentation overview
  • Introduction
  • Association Rules
  • Classification
  • Clustering
  • Similar Time Sequences
  • Similar Images
  • Outliers
  • WWW
  • Summary

  • Corporations have huge databases containing a
    wealth of information
  • Business databases potentially constitute a
    goldmine of valuable business information
  • Very little functionality in database systems to
    support data mining applications
  • Data mining The efficient discovery of
    previously unknown patterns in large databases

  • Fraud Detection
  • Loan and Credit Approval
  • Market Basket Analysis
  • Customer Segmentation
  • Financial Applications
  • E-Commerce
  • Decision Support
  • Web Search

Data Mining Techniques
  • Association Rules
  • Sequential Patterns
  • Classification
  • Clustering
  • Similar Time Sequences
  • Similar Images
  • Outlier Discovery
  • Text/Web Mining

Examples of Discovered Patterns
  • Association rules
  • 98 of people who purchase diapers also buy beer
  • Classification
  • People with age less than 25 and salary gt 40k
    drive sports cars
  • Similar time sequences
  • Stocks of companies A and B perform similarly
  • Outlier Detection
  • Residential customers for telecom company with
    businesses at home

Association Rules
  • Given
  • A database of customer transactions
  • Each transaction is a set of items
  • Find all rules X gt Y that correlate the presence
    of one set of items X with another set of items Y
  • Example 98 of people who purchase diapers and
    baby food also buy beer.
  • Any number of items in the consequent/antecedent
    of a rule
  • Possible to specify constraints on rules (e.g.,
    find only rules involving expensive imported

Association Rules
  • Sample Applications
  • Market basket analysis
  • Attached mailing in direct marketing
  • Fraud detection for medical insurance
  • Department store floor/shelf planning

Confidence and Support
  • A rule must have some minimum user-specified
  • 1 2 gt 3 has 90 confidence if when a customer
    bought 1 and 2, in 90 of cases, the customer
    also bought 3.
  • A rule must have some minimum user-specified
    support (how frequently the rule occurs)
  • 1 2 gt 3 should hold in some minimum percentage
    of transactions to have business value

  • For minimum support 50, minimum confidence
    50, we have the following rules
  • 1 gt 3 with 50 support and 66 confidence
  • (13 happened in 50 of cases, but whenever 1
    happened only in 2/3 of cases 3 happened too)
  • 3 gt 1 with 50 support and 100 confidence
  • (31 happened in 50 of cases, but whenever 3
    happened 1 happened too)

Quantitative Association Rules
  • Quantitative attributes (e.g. age, income)
  • Categorical attributes (e.g. make of car)
  • Age 30..39 and Married Yes gt

min support 40 min confidence 50
Temporal Association Rules
  • Can describe the rich temporal character in data
  • Example
  • diaper -gt beer (support 5, confidence
  • Support of this rule may jump to 25 between 6 to
    9 PM weekdays
  • Problem How to find rules that follow
    interesting user-defined temporal patterns
  • Challenge is to design efficient algorithms that
    do much better than finding every rule in every
    time unit

Correlation Rules
  • Association rules do not capture correlations
  • Example
  • Suppose 90 customers buy coffee, 25 buy tea
    and 20 buy both tea and coffee
  • tea gt coffee has high support 0.2 and
    confidence 0.8
  • tea, coffee are not correlated
  • expected support of customers buying both is 0.9
    0.25 0.225

Sequential Patterns
  • Given
  • A sequence of customer transactions
  • Each transaction is a set of items
  • Find all maximal sequential patterns supported by
    more than a user-specified percentage of
  • Example 10 of customers who bought a PC did a
    memory upgrade in a subsequent transaction
  • 10 is the support of the pattern

  • Given
  • Database of tuples, each assigned a class label
  • Develop a model/profile for each class
  • Example profile (good credit) (25 lt age lt 40
    and income gt 40k) or (married YES)
  • Sample applications
  • Credit card approval (good, bad)
  • Bank locations (good, fair, poor)
  • Treatment effectiveness (good, fair, poor)

Decision Trees
50 Churners 50 Non-Churners
New technology phone
Old technology phone
30 Churners 50 Non-Churners
20 Churners 0 Non-Churners
Customer lt 2.3 years
Customer gt 2.3 years
25 Churners 10 Non-Churners
5 Churners 40 Non-Churners
Age lt 55
Age gt 55
20 Churners 0 Non-Churners
5 Churners 10 Non-Churners
A decision tree is a predictive model that makes
a prediction on the basis of a series of decisions
Decision Trees
DT are creating a segmentation of the original
data set. This segmentation is done for the
prediction of some information. The records fall
in each segment have similarity with respect to
the information being predicted. The DT and the
algorithms may be complex, but the results are
presented in an easy-to-understand way, quite
useful to the business user.
Decision Trees
  • DT in business
  • Automation Very favorable technique for
    automating the data mining and predictive
    modeling. They embed automated solutions to
    things that other techniques leave as a burden to
    the user (4/4)
  • Clarity The models are viewed as a tree of
    simple decisions based on familiar predictors or
    as a set of rules. The user can confirm the DT or
    modify by hand on the basis of his own expertise
  • ROI Because DT work well with relational
    databases, they provide well-integrated solutions
    with highly accurate models (3/4)

Decision Trees
  • Pros
  • Fast execution time
  • Generated rules are easy to interpret by humans
  • Scale well for large data sets
  • Can handle high dimensional data
  • Cons
  • Cannot capture correlations among attributes
  • Consider only axis-parallel cuts

  • Given
  • Data points and number of desired clusters K
  • Group the data points into K clusters
  • Data points within clusters are more similar than
    across clusters
  • Sample applications
  • Customer segmentation
  • Market basket customer analysis
  • Attached mailing in direct marketing
  • Clustering companies with similar growth

Where to use clustering and nearest-neighbor
  • Clustering for clarity
  • A high-level view
  • Segmentation
  • Clustering for outlier analysis
  • To see records that stick out of the rest
  • e.g. Wine distributors produce a certain level of
    profit. One store produces significantly lower
    profit. Turns out that the distributor was
    delivering to but not collecting payment from one
    of its customers.
  • Nearest neighbor for prediction
  • Objects near to each other have similar
    prediction values.
  • Examples to find more documents as this one
    among journal articles, the value to be predicted
    in the next value of stock price based on time

Outlier Discovery
  • Sometimes clustering is performed to see when one
    record sticks out of the rest
  • E.g. One store stands out as producing
    significantly lower profit. Closer examination
    shows that the distributor was not collecting
    payment from one of his customers
  • E.g. A sale of mans suits is being held in all
    branches of a department store. All stores, but
    one, have seen at least 100 jump in revenue. It
    turns out that store had advertised via radio
    rather than TV as other stores
  • Sample applications
  • Credit card fraud detection
  • Telecom fraud detection
  • Customer segmentation
  • Medical analysis

Outlier Discovery
  • Given
  • Data points and number of outliers ( n) to find
  • Find top n outlier points
  • outliers are considerably dissimilar from the
    remainder of the data

Statistical Approaches
  • Model underlying distribution that generates
    dataset (e.g. normal distribution)
  • Use discordancy tests depending on
  • data distribution
  • distribution parameter (e.g. mean, variance)
  • number of expected outliers
  • Drawbacks
  • most tests are for single attribute
  • In many cases, data distribution may not be known

Differences between the nearest-neighbor
technique and clustering
Nearest neighbors
  • Used for prediction and consolidation
  • Space is defined by the problem to be solved
  • Generally only uses distance metrics to determine
  • Used for consolidating data into a high level
    view and general grouping of records into like
  • Space is defined as default n-dimensional space,
    or by the user, or predefined space driven by
    past experience
  • Can use other metrics beside distance to
    determine nearness of two records- e.g.linking
    points together

How clustering and nearest-neighbor work
  • Looking at n-dimensional space
  • The distance between the cluster and a given data
    point is often measured from the center of mass
    of the cluster
  • The center can be calculated
  • By simply average income and age of each record
  • By square error criterion
  • Other
  • Many clustering problems have hundreds of
    dimensions. Our intuition works only in 2 or
    3-dimensional space

Cluster 1
Cluster 2
Customers of a golf equipment business Cluster 1
retirees with modest income Cluster 2
middle-aged weekend golfers Cluster 3 wealthy
youth with exclusive club membership
Cluster 3
Traditional Algorithms
  • Partitional algorithms
  • Enumerate K partitions optimizing some criterion
  • Example square-error criterion
  • mi is the mean of cluster Ci

How is nearness defined
  • The trivial case

ID Name Prediction Age Balance() Income Eyes Ge
  • Carla Yes 21 2300 High Blue F
  • Sue ?? 21 2300 High Blue F

Exactly the same as the record to be predicted is
considered close. However, it is unlikely to
find exact matches
  • The Manhattan Distance metric adds up the
    differences between each predictor between the
    historical record and the record to be predicted
  • The Euclidean Distance metrics calculates
    distance the Pythagorean way (the square of the
    hypotenuse is equal to the sum of squares of the
    other two sides)
  • Others

  • The Manhattan Distance metric (an example)

Calculating the difference between ages (6 years)
and balances (3100) is simple. Eyes color
predictor? e.g. match0, mismatch1 Income
assign numbers high3, medium2, low1 3108
6 3100 0 1 1 The result must be
normalized (e.g. 0-100) 225 6 19 0
100 100
  • Calculating dimension weights
  • Different dimensions may have different weights
  • e.g. In text classification not all words
    (dimensions) are created equal entrepreneur is
    significant, the is not.
  • Two methods
  • The inverse frequency of the word is used, the
    1/10,000, entrepreneur 1/10
  • The importance of the word to the topic to be
    predicted. entrepreneur and venture capital
    will be given higher weight then tornado, the
    topic is to start a small business
  • Dimension weights have also been calculated via
    adaptive algorithms where random weights are
    tried initially and then slowly modified to
    improve the accuracy of the system (neural
    networks, genetic algorithms)

Hierarchy of Clusters
  • The hierarchy of clusters is viewed as a tree in
    which the smallest clusters merge to create the
    next highest level of clusters.
  • Agglomerative technique starts with as many
    clusters as there are records. The clusters that
    are nearest each other are merged to form the
    next largest cluster. This merging is continued
    until a hierarchy of clusters is built.
  • Divisive technique takes the opposite approach.
    It starts with all the records in one cluster,
    then try to split that cluster into smaller
    pieces, etc.
  • The hierarchy allows the end user to chose the
    level to work with

Large single cluster
Smallest clusters
Similar Time Sequences
  • Given
  • A set of time-series sequences
  • Find
  • All sequences similar to the query sequence
  • All pairs of similar sequences
  • whole matching vs. subsequence matching
  • Sample Applications
  • Financial market
  • Market basket data analysis
  • Scientific databases
  • Medical Diagnosis

Whole Sequence Matching
  • Basic Idea
  • Extract k features from every sequence
  • Every sequence is then represented as a point in
    k-dimensional space
  • Use a multi-dimensional index to store and search
    these points
  • Spatial indices do not work well for high
    dimensional data

Similar Time Sequences
  • Sequences are normalized with amplitude scaling
    and offset translation
  • Two subsequences are considered similar if one
    lies within an envelope of width around the
    other, ignoring outliers
  • Two sequences are said to be similar if they have
    enough non-overlapping time-ordered pairs of
    similar subsequences

Similar Sequences Found
VanEck International Fund
Fidelity Selective Precious Metal and Mineral Fund
Two similar mutual funds in the different fund
Similar Images
  • Given
  • A set of images
  • Find
  • All images similar to a given image
  • All pairs of similar images
  • Sample applications
  • Medical diagnosis
  • Weather predication
  • Web search engine for images
  • E-commerce

Similar Images
  • Generates a single signature per image
  • Fails when the images contain similar objects,
    but at different locations or varying sizes
  • Smi97
  • Divide an image into individual objects
  • Manual extraction can be very tedious and time
  • Inaccurate in identifying objects and not robust

  • Automatically extract regions from an image based
    on the complexity of images
  • A single signature is used per each region
  • Two images are considered to be similar if they
    have enough similar region pairs

Similarity Model
WALRUS (Overview)
Image Querying Phase
Image Indexing Phase
Compute wavelet signatures for sliding windows
Compute wavelet signatures for sliding windows
Cluster windows to generate regions
Cluster windows to generate regions
Insert regions into spatial index (R tree)
Find matching regions using spatial index
Compute similarity between query image and target
Query image
Web Mining Challenges
  • Todays search engines are plagued by problems
  • the abundance problem (99 of info of no interest
    to 99 of people)
  • limited coverage of the Web (internet sources
    hidden behind search interfaces)
  • limited query interface based on keyword-oriented
  • limited customization to individual users

Web is ..
  • The web is a huge collection of documents
  • Semistructured (HTML, XML)
  • Hyper-link information
  • Access and usage information
  • Dynamic
  • (i.e. New pages are constantly being generated)

Web Mining
  • Web Content Mining
  • Extract concept hierarchies/relations from the
  • Automatic categorization
  • Web Log Mining
  • Trend analysis (i.e web dynamics info)
  • Web access association/sequential pattern
  • Web Structure Mining
  • Google A page is important if important pages
    point to it

Improving Search/Customization
  • Learn about users interests based on access
  • Provide users with pages, sites and
    advertisements of interest

  • Data mining
  • Good science - leading position in research
  • Recent progress for large databases association
    rules, classification, clustering, similar time
    sequences, similar image retrieval, outlier
    discovery, etc.
  • Many papers were published in major conferences
  • Still promising and rich field with many
    challenging research issues
  • Maturing in industry