Title: Data Mining
1Section 5
2Section Content
- 5.1 Introduction
- 5.2 Knowledge Discovery
- 5.3 Association Rules
- 5.4 Sequential Patterns
- 5.5 Classification and Regression
- 5.6 Other Forms of Data Mining
- 5.7 Applications of Data Mining
35.1 Data Mining Introduction
- Data mining
- the discovery of new information in terms of
patterns or rules from huge amounts of data - mining tools should identify these patterns,
rules and trends with minimal user input - data mining is related to
- statistics exploratory data analysis
- artificial intelligence knowledge discovery and
machine learning - techniques from machine learning, statistics,
neural networks and genetic algorithms are used - due to the vastness of the amount of data,
efficiency/scalability of data mining algorithms
is a key issue
4Data Mining and Data Warehousing
- The goal of data warehousing is to support
decision making with data. - Data mining can help in conjunction with a data
warehouse with certain types of decisions. - Data mining helps to extract new patterns/rules
that cannot be found by merely querying or
processing data. - Aggregated or summarised collections of data in
warehouses improves the efficiency of data mining
in these cases. - The potential use of data mining needs to be
considered early in the design of a data
warehouse.
5Sections Covered
- 5.1 Introduction
- 5.2 Knowledge Discovery
- 5.3 Association Rules
- 5.4 Sequential Patterns
- 5.5 Classification and Regression
- 5.6 Other Forms of Data Mining
- 5.7 Applications of Data Mining
65.2 Knowledge Discovery
- Data mining is part of the knowledge discovery
process - data selection
- data cleansing
- enrichment
- data transformation / encoding
- data mining
- reporting and display
- Example
- Database Transaction database for a goods
retailer - Client data name, zip code, phone, date of
purchase, item code, price, quantity, total amount
7Knowledge Discovery - Example
- New knowledge can be discovered from the client
data - data selection
- data about specific items or categories of items
- items from stores in specific regions
- data cleansing
- correct incorrect zip codes
- eliminate records with incorrect phone numbers
- enrichment add additional information
- age, income, credit rating of client
- data transformation reduce the amount of data
- group items into product categories
- group zip codes into regions
8Data Mining - Knowledge Discovery
- Data mining might discover
- co-occurrences - items that are typically bought
together - association rules - when a customer buys video
equipment, he/she also buys another electronic
gadget - sequential patters - when a customer buys a
camera, then within 3 months he/she buys
photographic supplies - classification trees - customers can be
classified by frequency of visits, types of
finance used, etc. combined with statistics about
the classes - This information can then be used to for example
- optimise store locations
- run promotions
- plan seasonal marketing strategies
9Goals of Data Mining
- Prediction
- show how certain attributes within the data will
behave in the future - example predict what customers will buy under
certain discounts - example predict sales volume for some period
- Identification
- data patterns can be used to identify the
existence of an item, an event, or an activity - example detecting intruders by the commands they
execute
10Goals of Data Mining
- Classification
- partition data such that different classes or
categories can be identified - example customers can be categorised into
regular and infrequent shoppers, into
discount-seeking customers etc. - categorisation - e.g. into food categories - can
reduce the complexity of data mining - Optimisation
- optimise the use of limited resources (time,
space, money, etc) - example what are the best products to spend our
money on over the next three months?
11Types of Knowledge Discovered
- Co-occurrences
- collection of items/actions/events that occur
together - example items that are bought together by a
consumer in a shop - Association rules
- correlation of a set of items with another range
of values for another set of variables - example when someone buys bread, he/she is
likely to buy cheese - Classification hierarchies
- create a hierarchy of classes from an existing
set of events or transactions - example customers might be divided into a credit
worthiness hierarchy based on their previous
credit transactions
12Types of Knowledge Discovered
- Sequential patterns
- search for a sequence of events or actions
- example a patient that underwent cardiac surgery
and later developed high blood urea, is likely to
suffer from kidney problems - Patterns within time series
- detection of similarities within positions of the
time series - example a pattern in a time series of stock
market prices may be used to predict employment
rates - Categorisation and segmentation
- partition a set of events of items into
segments/categories/classes - example treatment data on a disease can be
partitioned into groups based on the side effects
that are caused
13Counting Co-occurrences
- The problem is to count co-occurring itemsets -
motivated by market basket analysis. - A database of consumer transactions forms the
basis - transaction a single visit to a store, an order
at a virtual store (Web site), or a single order
through a mail-order catalog - a transaction consists of a transaction ID,
customer ID, date, item and quantity - The goal is to identify items that are typically
purchased together. - This can be used to improve the layout of shops
or catalogs.
14Frequent Itemsets (1)
- Consider the following transaction table
- Transaction Customer Date Items bought
- 101 12 11/09 milk, bread, juice
- 792 13 12/09 milk, juice
- 1130 14 14/09 milk, eggs
- 1735 13 14/09 bread, coffee, biscuits
- Items bought in one visit are already grouped
together into itemsets. - Support of an itemset the fraction of
transactions that contain all items in the
itemset - Examples
- milk, juice has a support of 50
- bread, coffee has a support of 25
15Frequent Itemsets (2)
- Large itemsets are itemsets that have a certain
minimum support, i.e. are itemsets that occur
frequently. - Example
- for a minimum support of 40, the large itemsets
are milk, juice, milk, juice, bread - Proposition
- every subset of a large itemset is also a large
itemset - Algorithm
- large itemsets can be computed incrementally
- start with itemsets of cardinality 1 that have
the required support
16Sections Covered
- 5.1 Introduction
- 5.2 Knowledge Discovery
- 5.3 Association Rules
- 5.4 Sequential Patterns
- 5.5 Classification and Regression
- 5.6 Other Forms of Data Mining
- 5.7 Applications of Data Mining
175.3 Association Rules
- A database can be regarded as a collection of
transactions. - Each transaction involves a set of items.
- Example the items in a basket that a shopper
uses in a supermarket - Transaction Time Items bought
- 101 635 milk, bread, juice
- 792 738 milk, juice
- 1130 805 milk, eggs
- 1735 840 bread, coffee, biscuits
18Association Rules
- An association rule is of form X gt Y where X and
Y are two disjoint sets of items - Example
- for sets of goods as itemsets X and Y, the
expression X gt Y means that if a customer buys
X, he/she is also likely to buy Y. - if the customer buys milk, he/she is also likely
to buy juice. - The support for a rule X gt Y is the percentage
of transactions that hold all of the items in the
union X ? Y. - Examples
- Milk gt Juice has 50 support
- Bread gt Juice has 25 support
19Association Rules
- The confidence of a rule X gt Y is the percentage
(fraction) of all transactions including X that
also include Y. - Example
- the rule Milk gt Juice has confidence 66.7
- that means that 2/3 of all transactions with milk
also include juice - Note that support and confidence might be
different. - The goal is to discover rules with a certain
minimum support and confidence. - These rules can be used for prediction for a
rule - Pen gt Ink
- offer discounts on pens and you might increase
ink sales.
20Association Rules
- How to compute these rules?
- Generate large itemsets (itemsets with a certain
minimum support) - For each large itemset X, generate all rules with
a certain minimum confidence (mconf) -
- for X and Y ? X, let Z X - Y (divide X
into Y and Z) - if support(X) / support(Y) gt mconf then
- Y gt Z is a valid rule
- the confidence of rule Y gt Z is defined as
support(X) / support(Y) - Example for Xmilk, juice and Ymilk ?
milk, juice, - let Zjuice
- X, Y, Z have support 50, 75 and 50, resp.
(support for itemsets 5.14) - for mconf40 milk gt juice is a valid rule
with confidence 66.7 ( 50/75 )
21Generating Association Rules
- In principle, generating rules based on large
itemsets and their support is straightforward. - Computing all large itemsets and their support
creates an efficiency problem if the number of
items is very high. - If m is the number of items, then 2m is the
number of different itemsets. - Example a typical supermarket might have several
thousands of items. - Computing the support of all itemsets might take
a long time. - Reducing the combinatorial search space is
therefore important - the following properties
can be used - subsets of large itemsets are large
- extensions of small itemsets are small
22Association Rules - Algorithms
- Outline of an algorithm that finds large
itemsets - Step 1
- test the support for itemsets of length 1 -
called 1-itemsets - by scanning the database - discard those that do not meet the minimum
requirement. - Step 2
- extend large 1-itemsets into 2-itemsets by
appending one item each time (this generates all
itemsets of length two) - test the support and eliminate all 2-itemsets
that do not meet the minumum support. - Step 3
- repeat the above steps extend (k-1)-itemsets
into k-itemsets.
23Association Rules among Hierarchies
- Items might be divided among disjoint hierarchies
based on some classification, e.g. Beverage can
be divided into Juice and Milk -
- Associations might occur among the hierarchies of
items. - Example healthy frozen yoghurt gt bottled water
- Particularly interesting are associations across
hierarchies. - this kind of information can be used to arrange
different kinds of items in a supermarket
24Negative Associations
- Negative associations are more difficult to
detect than positive associations. - Example 60 of customers who buy crisps do not
buy bottled water. - There are usually more negative associations than
positive ones. - The majority of itemset combinations do not occur
in databases. - Finding interesting negative associations can be
difficult.
25Association Rules - Additional Considerations
- Sampling
- For very large databases, sampling improves
efficiency. - Truly representative samples can help to find
most of the rules. - The danger is that
- false positives might be discovered (large
itemsets that are not truly large) - true positives might be missing.
- Other problems
- Cardinality of itemsets and volume of
transactions can be very high. - Variablity of transactions (geographical, season)
makes sampling difficult. - Multiple classifications along different
dimensions.
26Sections Covered
- 5.1 Introduction
- 5.2 Knowledge Discovery
- 5.3 Association Rules
- 5.4 Sequential Patterns
- 5.5 Classification and Regression
- 5.6 Other Forms of Data Mining
- 5.7 Applications of Data Mining
275.4 Sequential Patterns
- Sequential patterns are based on sequences of
itemsets. - Assume transactions to be ordered by time.
- Example
- transactions in a supermarket
- milk, bread, juice bread, eggs milk,
coffee, biscuits may be based on three visits of
a customer - A subsequence of a sequence is obtained by
deleting one or more itemsets. - Example
- let milk, bread, juice bread, eggs milk,
coffee, biscuits be the orginal sequence - milk, bread, juice bread, eggs is a
subsequence - milk, bread, juice milk, coffee, biscuits
is a subsequence
28Support for Sequences
- A sequence a1, ... , am is contained in another
sequence S if - S has a subsequence b1, ..., bn such that ai ?
bi for 1 lt i lt n - Example
- milk, bread coffee, biscuits is contained
in milk, bread, juice bread, eggs milk,
coffee, biscuits - The support of a sequence S is the percentage of
a set of given sequences that contain S as a
subsequence.
29Discovery of Patterns in Time Series
- Time series are sequences of events.
- An event might be a fixed type of transaction.
- Example
- closing price of a stock or fund each day.
- Analysis of time series
- find period of time in which the stock did not
fluctuate more than 1 - find period (week/month/quarter) with the
greatest loss - identify stocks with similar behaviour
30Sections Covered
- 5.1 Introduction
- 5.2 Knowledge Discovery
- 5.3 Association Rules
- 5.4 Sequential Patterns
- 5.5 Classification and Regression
- 5.6 Other Forms of Data Mining
- 5.7 Applications of Data Mining
315.5 Classification and Regression
- Classification Rules
- Regression
- Tree-structured Rules
32Discovery of Classification Rules
- Classification means defining/identifying a
function that maps an object into one of many
possible classes. - Example a bank wants to classify loan applicants
into loanworthy and not loanworthy - a classification rule could define the
classification - not loanworthy current monthly debt obligation
exceeds 25 of monthly net income - loanworthy otherwise
- loanworthiness is a dependent, categorical
attribute - In general there is one rule (set) per class
- (var1 in range1) and ... and (varn in rangen)
- gt object O in class C1
- var1 , ..., varn are the predictor attributes
33Support and Confidence
- Again we can define support and confidence for
these rules. - The support for a classification condition C is
the percentage of tuples that satisfy C. - The support for a rule C1 gt C2 is the support
for the condition C1 ? C2. (C1 AND C2 is the set
of objects in both C1 and C2.) - Consider those tuples that satisfy condition C1.
The confidence for a rule C1 gt C2 is the
percentage of such rules that also satisfy
condition C2.
34Regression
- Regression is similar to classification, except
that the dependent variable is numerical (and not
categorical). - Rules (such as classification rules) can be
regarded as functions. - A regression rule is a function that maps
variables into a target class variable. - Example LabTest(patientID, test1, ... , testn)
- the values in that relation result from a series
of lab tests - the target variable P is the probability of
survival - a numerical variable - the regression rule
- (test1 in range1) and ... and (testn in rangen)
gt P x - the regression function is P f(test1, ... ,
testn)
35Regression (2)
- If P appears as a function y f(x1, ... , xn)
- and f is linear in the domain variables,
- then the process of deriving f from a given set
of - tuples ltx1, ... , xn, ygt is called linear
regression. - Linear regression is a common statistical
technique.
36Tree-Structured Rules
- Specific classification and regression rules
shall now be examined. - These are rules that can be represented as trees
- called classification trees or decision trees. - These trees are typically the output of the data
mining activity. - Each path from a root to a leaf node represents
one classification rule. - Example Insurance risk determination for motor
insurance
Age lt 25 gt
25 Car Type NO sports
family YES NO
37Decision Trees
- A decision tree is a graphical representation of
a collection of classification rules. - Each node in the tree is labelled with a
predictor or splitting attribute. - Each outgoing edge of an internal node is
labelled with a predicate that involves the
splitting attribute. - Each leaf node is labelled with a value of the
depending attribute. - A classification rule can be associated with each
leaf node - constructed as the conjunction of the
predicates - Age lt 25 and Car Type sports for the YES-leaf
- Decision trees are constructed in two phases
- growth phase create tree based on specialised
rules from an input database (relation) - pruning phase reduce tree size by generalising
rules
38Sections Covered
- 5.1 Introduction
- 5.2 Knowledge Discovery
- 5.3 Association Rules
- 5.4 Sequential Patterns
- 5.5 Classification and Regression
- 5.6 Other Forms of Data Mining
- 5.7 Applications of Data Mining
395.6 Other Types of Data Mining
- Neural Networks
- Genetic Algorithms
- Clustering and Segmentation
40Neural Networks
- Techniques from artificial intelligence can be
used to generalise regression. - Neural networks provide an iterative method to
carry out this generalised regression. - Neural networks use a curve-fitting approach to
infer a function from a set of samples. - This process is based on learning a test sample
is the initial input, the system then
incrementally infers functions based on more
samples - Neural networks can be applied to classification
problems. - Modelling time series with neural networks is
difficult.
41Genetic Algorithms (1)
- Genetic algorithms (GA) are a class of randomised
search procedures for adaptive and robust search
over a wide range of search topologies. - Principle
- Genetic algorithms extend the idea of
characterising human DNA by a four-letter
alphabet (A,C,T,G). - Construction
- Devise an alphabet that allows the encoding of a
solution to the decision problem in terms of
strings of that alphabet. - Usage
- Study the cutting and combination of strings
(compare natural reproduction and evolution). - New generations of individuals (solutions) are
generated and assessed - survival of the fittest.
42Genetic Algorithms (2)
- Generation of solutions - comparison with other
techniques. - GA search uses a set of solutions during each
generation rather than a single solution. - The search in the string-space represents a much
larger parallel search in the space of encoded
solutions. - The memory of the search completed is represented
solely by the set of solutions available for
generation. - A GA is a randomised algorithm since search
mechanisms use probabilistic operators. - While progressing from one generation to the
next, a GA finds near-optimal balance between
knowledge acquisition and exploitation by
manipulating encoded solutions.
43Clustering and Segmentation
- Clustering is about identification and
classification. - Clustering tries to identify categories (or
clusters) to which a data object can be mapped. - The categories can be disjoint or might overlap
they might be organised into trees. - A related problem multivariate probability
density functions.
44Sections Covered
- 5.1 Introduction
- 5.2 Knowledge Discovery
- 5.3 Association Rules
- 5.4 Sequential Patterns
- 5.5 Classification and Regression
- 5.6 Other Forms of Data Mining
- 5.7 Applications of Data Mining
455.7 Applications of Data Mining
- Decision-making contexts
- marketing
- analysis of customer behaviour based on buying
patterns - determination of marketing strategies (store
locations, advertising campaigns, etc) - segmentation of customers, stores, products.
- finance
- analysis of creditworthiness of clients
- performance analysis of finance investments
- evaluation of financing options
- fraud detection.
46Applications
- Manufacturing
- optimisation of resources (machines, manpower,
material) - optimal design of manufacturing process,
shop-floor layout, etc. - Health care
- analysis of effectiveness of certain treatments
- optimisation of processes in a hospital
- analysing side effects of drugs
- relating patient wellness and doctor
qualifications.