Data Mining PowerPoint PPT Presentation

presentation player overlay
1 / 46
About This Presentation
Transcript and Presenter's Notes

Title: Data Mining


1
Section 5
  • Data Mining

2
Section Content
  • 5.1 Introduction
  • 5.2 Knowledge Discovery
  • 5.3 Association Rules
  • 5.4 Sequential Patterns
  • 5.5 Classification and Regression
  • 5.6 Other Forms of Data Mining
  • 5.7 Applications of Data Mining

3
5.1 Data Mining Introduction
  • Data mining
  • the discovery of new information in terms of
    patterns or rules from huge amounts of data
  • mining tools should identify these patterns,
    rules and trends with minimal user input
  • data mining is related to
  • statistics exploratory data analysis
  • artificial intelligence knowledge discovery and
    machine learning
  • techniques from machine learning, statistics,
    neural networks and genetic algorithms are used
  • due to the vastness of the amount of data,
    efficiency/scalability of data mining algorithms
    is a key issue

4
Data Mining and Data Warehousing
  • The goal of data warehousing is to support
    decision making with data.
  • Data mining can help in conjunction with a data
    warehouse with certain types of decisions.
  • Data mining helps to extract new patterns/rules
    that cannot be found by merely querying or
    processing data.
  • Aggregated or summarised collections of data in
    warehouses improves the efficiency of data mining
    in these cases.
  • The potential use of data mining needs to be
    considered early in the design of a data
    warehouse.

5
Sections Covered
  • 5.1 Introduction
  • 5.2 Knowledge Discovery
  • 5.3 Association Rules
  • 5.4 Sequential Patterns
  • 5.5 Classification and Regression
  • 5.6 Other Forms of Data Mining
  • 5.7 Applications of Data Mining

6
5.2 Knowledge Discovery
  • Data mining is part of the knowledge discovery
    process
  • data selection
  • data cleansing
  • enrichment
  • data transformation / encoding
  • data mining
  • reporting and display
  • Example
  • Database Transaction database for a goods
    retailer
  • Client data name, zip code, phone, date of
    purchase, item code, price, quantity, total amount

7
Knowledge Discovery - Example
  • New knowledge can be discovered from the client
    data
  • data selection
  • data about specific items or categories of items
  • items from stores in specific regions
  • data cleansing
  • correct incorrect zip codes
  • eliminate records with incorrect phone numbers
  • enrichment add additional information
  • age, income, credit rating of client
  • data transformation reduce the amount of data
  • group items into product categories
  • group zip codes into regions

8
Data Mining - Knowledge Discovery
  • Data mining might discover
  • co-occurrences - items that are typically bought
    together
  • association rules - when a customer buys video
    equipment, he/she also buys another electronic
    gadget
  • sequential patters - when a customer buys a
    camera, then within 3 months he/she buys
    photographic supplies
  • classification trees - customers can be
    classified by frequency of visits, types of
    finance used, etc. combined with statistics about
    the classes
  • This information can then be used to for example
  • optimise store locations
  • run promotions
  • plan seasonal marketing strategies

9
Goals of Data Mining
  • Prediction
  • show how certain attributes within the data will
    behave in the future
  • example predict what customers will buy under
    certain discounts
  • example predict sales volume for some period
  • Identification
  • data patterns can be used to identify the
    existence of an item, an event, or an activity
  • example detecting intruders by the commands they
    execute

10
Goals of Data Mining
  • Classification
  • partition data such that different classes or
    categories can be identified
  • example customers can be categorised into
    regular and infrequent shoppers, into
    discount-seeking customers etc.
  • categorisation - e.g. into food categories - can
    reduce the complexity of data mining
  • Optimisation
  • optimise the use of limited resources (time,
    space, money, etc)
  • example what are the best products to spend our
    money on over the next three months?

11
Types of Knowledge Discovered
  • Co-occurrences
  • collection of items/actions/events that occur
    together
  • example items that are bought together by a
    consumer in a shop
  • Association rules
  • correlation of a set of items with another range
    of values for another set of variables
  • example when someone buys bread, he/she is
    likely to buy cheese
  • Classification hierarchies
  • create a hierarchy of classes from an existing
    set of events or transactions
  • example customers might be divided into a credit
    worthiness hierarchy based on their previous
    credit transactions

12
Types of Knowledge Discovered
  • Sequential patterns
  • search for a sequence of events or actions
  • example a patient that underwent cardiac surgery
    and later developed high blood urea, is likely to
    suffer from kidney problems
  • Patterns within time series
  • detection of similarities within positions of the
    time series
  • example a pattern in a time series of stock
    market prices may be used to predict employment
    rates
  • Categorisation and segmentation
  • partition a set of events of items into
    segments/categories/classes
  • example treatment data on a disease can be
    partitioned into groups based on the side effects
    that are caused

13
Counting Co-occurrences
  • The problem is to count co-occurring itemsets -
    motivated by market basket analysis.
  • A database of consumer transactions forms the
    basis
  • transaction a single visit to a store, an order
    at a virtual store (Web site), or a single order
    through a mail-order catalog
  • a transaction consists of a transaction ID,
    customer ID, date, item and quantity
  • The goal is to identify items that are typically
    purchased together.
  • This can be used to improve the layout of shops
    or catalogs.

14
Frequent Itemsets (1)
  • Consider the following transaction table
  • Transaction Customer Date Items bought
  • 101 12 11/09 milk, bread, juice
  • 792 13 12/09 milk, juice
  • 1130 14 14/09 milk, eggs
  • 1735 13 14/09 bread, coffee, biscuits
  • Items bought in one visit are already grouped
    together into itemsets.
  • Support of an itemset the fraction of
    transactions that contain all items in the
    itemset
  • Examples
  • milk, juice has a support of 50
  • bread, coffee has a support of 25

15
Frequent Itemsets (2)
  • Large itemsets are itemsets that have a certain
    minimum support, i.e. are itemsets that occur
    frequently.
  • Example
  • for a minimum support of 40, the large itemsets
    are milk, juice, milk, juice, bread
  • Proposition
  • every subset of a large itemset is also a large
    itemset
  • Algorithm
  • large itemsets can be computed incrementally
  • start with itemsets of cardinality 1 that have
    the required support

16
Sections Covered
  • 5.1 Introduction
  • 5.2 Knowledge Discovery
  • 5.3 Association Rules
  • 5.4 Sequential Patterns
  • 5.5 Classification and Regression
  • 5.6 Other Forms of Data Mining
  • 5.7 Applications of Data Mining

17
5.3 Association Rules
  • A database can be regarded as a collection of
    transactions.
  • Each transaction involves a set of items.
  • Example the items in a basket that a shopper
    uses in a supermarket
  • Transaction Time Items bought
  • 101 635 milk, bread, juice
  • 792 738 milk, juice
  • 1130 805 milk, eggs
  • 1735 840 bread, coffee, biscuits

18
Association Rules
  • An association rule is of form X gt Y where X and
    Y are two disjoint sets of items
  • Example
  • for sets of goods as itemsets X and Y, the
    expression X gt Y means that if a customer buys
    X, he/she is also likely to buy Y.
  • if the customer buys milk, he/she is also likely
    to buy juice.
  • The support for a rule X gt Y is the percentage
    of transactions that hold all of the items in the
    union X ? Y.
  • Examples
  • Milk gt Juice has 50 support
  • Bread gt Juice has 25 support

19
Association Rules
  • The confidence of a rule X gt Y is the percentage
    (fraction) of all transactions including X that
    also include Y.
  • Example
  • the rule Milk gt Juice has confidence 66.7
  • that means that 2/3 of all transactions with milk
    also include juice
  • Note that support and confidence might be
    different.
  • The goal is to discover rules with a certain
    minimum support and confidence.
  • These rules can be used for prediction for a
    rule
  • Pen gt Ink
  • offer discounts on pens and you might increase
    ink sales.

20
Association Rules
  • How to compute these rules?
  • Generate large itemsets (itemsets with a certain
    minimum support)
  • For each large itemset X, generate all rules with
    a certain minimum confidence (mconf)
  • for X and Y ? X, let Z X - Y (divide X
    into Y and Z)
  • if support(X) / support(Y) gt mconf then
  • Y gt Z is a valid rule
  • the confidence of rule Y gt Z is defined as
    support(X) / support(Y)
  • Example for Xmilk, juice and Ymilk ?
    milk, juice,
  • let Zjuice
  • X, Y, Z have support 50, 75 and 50, resp.
    (support for itemsets 5.14)
  • for mconf40 milk gt juice is a valid rule
    with confidence 66.7 ( 50/75 )

21
Generating Association Rules
  • In principle, generating rules based on large
    itemsets and their support is straightforward.
  • Computing all large itemsets and their support
    creates an efficiency problem if the number of
    items is very high.
  • If m is the number of items, then 2m is the
    number of different itemsets.
  • Example a typical supermarket might have several
    thousands of items.
  • Computing the support of all itemsets might take
    a long time.
  • Reducing the combinatorial search space is
    therefore important - the following properties
    can be used
  • subsets of large itemsets are large
  • extensions of small itemsets are small

22
Association Rules - Algorithms
  • Outline of an algorithm that finds large
    itemsets
  • Step 1
  • test the support for itemsets of length 1 -
    called 1-itemsets - by scanning the database
  • discard those that do not meet the minimum
    requirement.
  • Step 2
  • extend large 1-itemsets into 2-itemsets by
    appending one item each time (this generates all
    itemsets of length two)
  • test the support and eliminate all 2-itemsets
    that do not meet the minumum support.
  • Step 3
  • repeat the above steps extend (k-1)-itemsets
    into k-itemsets.

23
Association Rules among Hierarchies
  • Items might be divided among disjoint hierarchies
    based on some classification, e.g. Beverage can
    be divided into Juice and Milk
  • Associations might occur among the hierarchies of
    items.
  • Example healthy frozen yoghurt gt bottled water
  • Particularly interesting are associations across
    hierarchies.
  • this kind of information can be used to arrange
    different kinds of items in a supermarket

24
Negative Associations
  • Negative associations are more difficult to
    detect than positive associations.
  • Example 60 of customers who buy crisps do not
    buy bottled water.
  • There are usually more negative associations than
    positive ones.
  • The majority of itemset combinations do not occur
    in databases.
  • Finding interesting negative associations can be
    difficult.

25
Association Rules - Additional Considerations
  • Sampling
  • For very large databases, sampling improves
    efficiency.
  • Truly representative samples can help to find
    most of the rules.
  • The danger is that
  • false positives might be discovered (large
    itemsets that are not truly large)
  • true positives might be missing.
  • Other problems
  • Cardinality of itemsets and volume of
    transactions can be very high.
  • Variablity of transactions (geographical, season)
    makes sampling difficult.
  • Multiple classifications along different
    dimensions.

26
Sections Covered
  • 5.1 Introduction
  • 5.2 Knowledge Discovery
  • 5.3 Association Rules
  • 5.4 Sequential Patterns
  • 5.5 Classification and Regression
  • 5.6 Other Forms of Data Mining
  • 5.7 Applications of Data Mining

27
5.4 Sequential Patterns
  • Sequential patterns are based on sequences of
    itemsets.
  • Assume transactions to be ordered by time.
  • Example
  • transactions in a supermarket
  • milk, bread, juice bread, eggs milk,
    coffee, biscuits may be based on three visits of
    a customer
  • A subsequence of a sequence is obtained by
    deleting one or more itemsets.
  • Example
  • let milk, bread, juice bread, eggs milk,
    coffee, biscuits be the orginal sequence
  • milk, bread, juice bread, eggs is a
    subsequence
  • milk, bread, juice milk, coffee, biscuits
    is a subsequence

28
Support for Sequences
  • A sequence a1, ... , am is contained in another
    sequence S if
  • S has a subsequence b1, ..., bn such that ai ?
    bi for 1 lt i lt n
  • Example
  • milk, bread coffee, biscuits is contained
    in milk, bread, juice bread, eggs milk,
    coffee, biscuits
  • The support of a sequence S is the percentage of
    a set of given sequences that contain S as a
    subsequence.

29
Discovery of Patterns in Time Series
  • Time series are sequences of events.
  • An event might be a fixed type of transaction.
  • Example
  • closing price of a stock or fund each day.
  • Analysis of time series
  • find period of time in which the stock did not
    fluctuate more than 1
  • find period (week/month/quarter) with the
    greatest loss
  • identify stocks with similar behaviour

30
Sections Covered
  • 5.1 Introduction
  • 5.2 Knowledge Discovery
  • 5.3 Association Rules
  • 5.4 Sequential Patterns
  • 5.5 Classification and Regression
  • 5.6 Other Forms of Data Mining
  • 5.7 Applications of Data Mining

31
5.5 Classification and Regression
  • Classification Rules
  • Regression
  • Tree-structured Rules

32
Discovery of Classification Rules
  • Classification means defining/identifying a
    function that maps an object into one of many
    possible classes.
  • Example a bank wants to classify loan applicants
    into loanworthy and not loanworthy
  • a classification rule could define the
    classification
  • not loanworthy current monthly debt obligation
    exceeds 25 of monthly net income
  • loanworthy otherwise
  • loanworthiness is a dependent, categorical
    attribute
  • In general there is one rule (set) per class
  • (var1 in range1) and ... and (varn in rangen)
  • gt object O in class C1
  • var1 , ..., varn are the predictor attributes

33
Support and Confidence
  • Again we can define support and confidence for
    these rules.
  • The support for a classification condition C is
    the percentage of tuples that satisfy C.
  • The support for a rule C1 gt C2 is the support
    for the condition C1 ? C2. (C1 AND C2 is the set
    of objects in both C1 and C2.)
  • Consider those tuples that satisfy condition C1.
    The confidence for a rule C1 gt C2 is the
    percentage of such rules that also satisfy
    condition C2.

34
Regression
  • Regression is similar to classification, except
    that the dependent variable is numerical (and not
    categorical).
  • Rules (such as classification rules) can be
    regarded as functions.
  • A regression rule is a function that maps
    variables into a target class variable.
  • Example LabTest(patientID, test1, ... , testn)
  • the values in that relation result from a series
    of lab tests
  • the target variable P is the probability of
    survival - a numerical variable
  • the regression rule
  • (test1 in range1) and ... and (testn in rangen)
    gt P x
  • the regression function is P f(test1, ... ,
    testn)

35
Regression (2)
  • If P appears as a function y f(x1, ... , xn)
  • and f is linear in the domain variables,
  • then the process of deriving f from a given set
    of
  • tuples ltx1, ... , xn, ygt is called linear
    regression.
  • Linear regression is a common statistical
    technique.

36
Tree-Structured Rules
  • Specific classification and regression rules
    shall now be examined.
  • These are rules that can be represented as trees
    - called classification trees or decision trees.
  • These trees are typically the output of the data
    mining activity.
  • Each path from a root to a leaf node represents
    one classification rule.
  • Example Insurance risk determination for motor
    insurance

Age lt 25 gt
25 Car Type NO sports
family YES NO
37
Decision Trees
  • A decision tree is a graphical representation of
    a collection of classification rules.
  • Each node in the tree is labelled with a
    predictor or splitting attribute.
  • Each outgoing edge of an internal node is
    labelled with a predicate that involves the
    splitting attribute.
  • Each leaf node is labelled with a value of the
    depending attribute.
  • A classification rule can be associated with each
    leaf node - constructed as the conjunction of the
    predicates
  • Age lt 25 and Car Type sports for the YES-leaf
  • Decision trees are constructed in two phases
  • growth phase create tree based on specialised
    rules from an input database (relation)
  • pruning phase reduce tree size by generalising
    rules

38
Sections Covered
  • 5.1 Introduction
  • 5.2 Knowledge Discovery
  • 5.3 Association Rules
  • 5.4 Sequential Patterns
  • 5.5 Classification and Regression
  • 5.6 Other Forms of Data Mining
  • 5.7 Applications of Data Mining

39
5.6 Other Types of Data Mining
  • Neural Networks
  • Genetic Algorithms
  • Clustering and Segmentation

40
Neural Networks
  • Techniques from artificial intelligence can be
    used to generalise regression.
  • Neural networks provide an iterative method to
    carry out this generalised regression.
  • Neural networks use a curve-fitting approach to
    infer a function from a set of samples.
  • This process is based on learning a test sample
    is the initial input, the system then
    incrementally infers functions based on more
    samples
  • Neural networks can be applied to classification
    problems.
  • Modelling time series with neural networks is
    difficult.

41
Genetic Algorithms (1)
  • Genetic algorithms (GA) are a class of randomised
    search procedures for adaptive and robust search
    over a wide range of search topologies.
  • Principle
  • Genetic algorithms extend the idea of
    characterising human DNA by a four-letter
    alphabet (A,C,T,G).
  • Construction
  • Devise an alphabet that allows the encoding of a
    solution to the decision problem in terms of
    strings of that alphabet.
  • Usage
  • Study the cutting and combination of strings
    (compare natural reproduction and evolution).
  • New generations of individuals (solutions) are
    generated and assessed - survival of the fittest.

42
Genetic Algorithms (2)
  • Generation of solutions - comparison with other
    techniques.
  • GA search uses a set of solutions during each
    generation rather than a single solution.
  • The search in the string-space represents a much
    larger parallel search in the space of encoded
    solutions.
  • The memory of the search completed is represented
    solely by the set of solutions available for
    generation.
  • A GA is a randomised algorithm since search
    mechanisms use probabilistic operators.
  • While progressing from one generation to the
    next, a GA finds near-optimal balance between
    knowledge acquisition and exploitation by
    manipulating encoded solutions.

43
Clustering and Segmentation
  • Clustering is about identification and
    classification.
  • Clustering tries to identify categories (or
    clusters) to which a data object can be mapped.
  • The categories can be disjoint or might overlap
    they might be organised into trees.
  • A related problem multivariate probability
    density functions.

44
Sections Covered
  • 5.1 Introduction
  • 5.2 Knowledge Discovery
  • 5.3 Association Rules
  • 5.4 Sequential Patterns
  • 5.5 Classification and Regression
  • 5.6 Other Forms of Data Mining
  • 5.7 Applications of Data Mining

45
5.7 Applications of Data Mining
  • Decision-making contexts
  • marketing
  • analysis of customer behaviour based on buying
    patterns
  • determination of marketing strategies (store
    locations, advertising campaigns, etc)
  • segmentation of customers, stores, products.
  • finance
  • analysis of creditworthiness of clients
  • performance analysis of finance investments
  • evaluation of financing options
  • fraud detection.

46
Applications
  • Manufacturing
  • optimisation of resources (machines, manpower,
    material)
  • optimal design of manufacturing process,
    shop-floor layout, etc.
  • Health care
  • analysis of effectiveness of certain treatments
  • optimisation of processes in a hospital
  • analysing side effects of drugs
  • relating patient wellness and doctor
    qualifications.
Write a Comment
User Comments (0)