Tutorial on Data Mining

- Workshop of the Indian Database Research

Community - Sunita Sarawagi
- School of IT, IIT Bombay

Data mining

- Process of semi-automatically analyzing large

databases to find interesting and useful patterns - Overlaps with machine learning, statistics,

artificial intelligence and databases but - more scalable in number of features and instances
- more automated to handle heterogeneous data

Outline

- Applications
- Usage scenarios
- Overview of operations
- Mining research groups
- Relevance in India
- Ten research problems

Applications

- Customer relationship management
- identify those who are likely to leave for a

competitor. - Targeted marketing identify likely responders to

promotions - Fraud detection telecommunications, financial

transactions - Manufacturing and production
- Medicine disease outcome, effectiveness of

treatments - Molecular/Pharmaceutical identify new drugs
- Scientific data analysis
- Web site/store design and promotion

Usage scenarios

- Data warehouse mining
- assimilate data from operational sources
- mine static data
- Mining log data
- Continuous mining example in process control
- Stages in mining
- data selection ? pre-processing cleaning ?

transformation ? mining ? result evaluation ?

visualization

Some basic operations

- Predictive
- Regression
- Classification
- Descriptive
- Clustering / similarity matching
- Association rules and variants
- Deviation detection

Classification

- Given old data about customers and payments,

predict new applicants loan eligibility.

Previous customers

Classifier

Decision rules

Age Salary Profession Location Customer type

Salary gt 5 L

Good/ bad

Prof. Exec

New applicants data

Classification methods

- Goal Predict class Ci f(x1, x2, .. Xn)
- Regression (linear or any other polynomial)
- ax1 bx2 c Ci.
- Nearest neighour
- Decision tree classifier divide decision space

into piecewise constant regions. - Probabilistic/generative models
- Neural networks partition by non-linear

boundaries

Nearest neighbor

- Define proximity between instances, find

neighbors of new instance and assign majority

class - Case based reasoning when attributes are more

complicated than real-valued.

- Cons
- Slow during application.
- No feature selection.
- Notion of proximity vague

- Pros
- Fast training

Decision trees

- Tree where internal nodes are simple decision

rules on one or more attributes and leaf nodes

are predicted class labels.

Salary lt 1 M

Prof teacher

Age lt 30

Algorithm for tree building

- Greedy top-down construction.

Gen_Tree (Node, data)

Yes

make node a leaf?

Stop

Selection criteria

Find best attribute and best split on attribute

Partition data on split condition

For each child j of node Gen_Tree (node_j,

data_j)

Split criteria

- K classes, set of S instances partitioned into r

subsets. Instance Sj has fraction pij instances

of class j. - Information entropy
- Gini index

1/4

Gini

0

1

Impurity

r 1, k2

Scalable algorithm

rid A1 A2 A3 C

- Input table of records
- Vertically partition data and sort on ltattribute

value, classgt - Finding best split
- Scan and maintain class counts in memory and find

gini incrementally. - Performing split
- Use split attribute to build
- rid to L/R hash in memory.
- Divide other attributes using above hash table.

A2 C rid

A3 C rid

A1 C rid

Issues

- Preventing overfitting
- Occams razor
- prefer the simplest hypothesis that fits the data
- Tree pruning methods
- Cross validation with separate test data
- Minimum description length (MDL) criteria
- Multi attribute tests on nodes to handle

correlated attributes - Linear multivariate
- Non-linear multivariate e.g. a neural net at each

node. - Methods of handling missing values

Pros and Cons of decision trees

- Cons
- Cannot handle complicated relationship between

features - simple decision boundaries
- problems with lots of missing data

- Pros
- Reasonable training time
- Fast application
- Easy to interpret
- Easy to implement
- Can handle large number of features

More information http//www.stat.wisc.edu/limt/t

reeprogs.html

Neural networks

- Useful for learning complex data like

handwriting, speech and image recognition

Decision boundaries

Neural network

Classification tree

Neural network

- Set of nodes connected by directed weighted edges

Basic NN unit

A more typical NN

x1

x1

w1

x2

x2

w2

x3

Output nodes

x3

w3

Hidden nodes

Pros and Cons of Neural Network

- Cons
- Slow training time
- Hard to interpret
- Hard to implement trial and error for choosing

number of nodes

- Pros
- Can learn more complicated class boundaries
- Fast application
- Can handle large number of features

Conclusion Use neural nets only if decision

trees/NN fail.

Bayesian learning

- Assume a probability model on generation of data.

- Apply bayes theorem to find most likely class as
- Naïve bayes Assume attributes conditionally

independent given class value - Easy to learn probabilities by counting,
- Useful in some domains e.g. text

Bayesian belief network

- Find joint probability over set of variables

making use of conditional independence whenever

known - Learning parameters hard when hidden units use

gradient descent / EM algorithms - Learning structure of network harder

a

d

ad ad ad ad

b

b

0.1 0.2 0.3 0.4

Variable e independent of d given b

b

0.3 0.2 0.1 0.5

e

C

Clustering

- Unsupervised learning when old data with class

labels not available e.g. when introducing a new

product to a customer base - Group/cluster existing customers based on time

series of payment history such that similar

customers in same cluster. - Identify micro-markets and develop policies for

each - Key requirement Need a good measure of

similarity between instances

Distance functions

- Numeric data euclidean, manhattan distances
- Categorical data 0/1 to indicate

presence/absence followed by - Hamming distance ( dissimilarity)
- Jaccard coefficients similarity in 1s/( of 1s)

- data dependent measures similarity of A and B

depends on co-occurance with C. - Combined numeric and categorical data
- weighted normalized distance

Distance functions on high dimensional data

- Example Time series, Text, Images
- Euclidian measures make all points equally far
- Reduce number of dimensions
- choose subset of original features using random

projections, feature selection techniques - transform original features using statistical

methods like Principal Component Analysis - Define domain specific similarity measures e.g.

for images define features like number of

objects, color histogram for time series define

shape based measures. - Define non-distance based (model-based)

clustering methods

Clustering methods

- Hierarchical clustering
- agglomerative Vs divisive
- single link Vs complete link
- Partitional clustering
- distance-based K-means
- model-based EM
- density-based

Partitional methods K-means

- Criteria minimize sum of square of distance
- Between each point and centroid of the cluster.
- Between each pair of points in the cluster
- Algorithm
- Select initial partition with K clusters random,

first K, K separated points - Repeat until stabilization
- Assign each point to closest cluster center
- Generate new cluster centers
- Adjust clusters by merging/splitting

Properties

- May not reach global optima
- Converges fast in practice guaranteed for

certain forms of optimization function - Complexity O(KndI)
- I number of iterations, n number of points, d

number of dimensions, K number of clusters. - Database research on scalable algorithms
- Birch one/two pass of data by keeping R-tree

like index in memory Sigmod 96

Model based clustering

- Assume data generated from K probability

distributions. Need to find distribution

parameters. - EM algorithm K Gaussian mixtures
- Iterate between two steps
- Expectation step assign points to clusters
- Maximation step estimate model parameters

Association rules

T

- Given set T of groups of items
- Example set of baskets of items purchased
- Goal find all rules on itemsets of the form

a--gtb such that - support of a and b gt user threshold s
- conditional probability (confidence) of b given

a gt user threshold c - Example Milk --gt bread
- Lot of work done on scalable algorithms

Milk, cereal

Tea, milk

Tea, rice, bread

cereal

Variants

- High confidence may not imply high correlation
- Use correlations. Find expected support and

large departures from that interesting. - Brin et al. Limited attempt.
- More complete work in statistical literature on

contingency tables. - Still too many rules, need to prune...
- Does not imply causality as in Bayesian networks

Prevalent ? Interesting

- Analysts already know about prevalent rules
- Interesting rules are those that deviate from

prior expectation - Minings payoff is in finding surprising phenomena

1995

Milk and cereal selltogether!

Milk and cereal selltogether!

What makes a rule surprising?

- Does not match prior expectation
- Correlation between milk and cereal remains

roughly constant over time

- Cannot be trivially derived from simpler rules
- Milk 10, cereal 10
- Milk and cereal 10 surprising
- Eggs 10
- Milk, cereal and eggs 0.1 surprising!
- Expected 1

Applications of fast itemset counting

- Find correlated events
- Applications in medicine find redundant tests
- Cross selling in retail, banking
- Improve predictive capability of classifiers that

assume attribute independence - New similarity measures of categorical

attributes Mannila et al, KDD 98

Mining market

- Around 20 to 30 mining tool vendors 1/5th the

size of OLAP market. - Major players
- Clementine,
- IBMs Intelligent Miner,
- SGIs MineSet,
- SASs Enterprise Miner.
- All pretty much the same set of tools
- Many embedded products fraud detection,

electronic commerce applications

Integrating mining with DBMS

- Need to
- intermix operations
- iterate through results
- flexibly query and filter results and data
- Existing file-based, batched approach not

satisfactory. - Research challenge Identify a collection of

primitive, composable operators like in

relational DBMS and build a mining engine

OLAP Mining integration

- OLAP (On Line Analytical Processing)
- Multidimensional view of data factors are

dimensions, quantity to be analyszed

measures/cells. - Facilitates fast interactive exploration of

multidimensional aggregates. - OLAP products provide a minimal set of tools for

analysis - Heavy reliance on manual operations for analysis

- tedious and error-prone on large multidimensional

data - Ideal platform for vertical integration of mining

but needs to be interactive instead of batch.

State of art in mining OLAP integration

- Decision trees Information discovery, Cognos
- find factors influencing high profits
- Clustering Pilot software
- segment customers to define hierarchy on that

dimension - Time series analysis Seagates Holos
- Query for various shapes along time spikes,

outliers etc - Multi-level Associations Han et al.
- find association between members of dimensions

New approach

- Identify complex operations with specific

OLAP needs in mind (what does an analyst need?)

rather than looking at mining operations and

choosing what fits - Two examples
- Exceptions in data to guide exploration
- One reason for manual exploration is to make sure

that there are no surprises. - Pre-mines abnormalities in data and points them

out to analysts using highlights at aggregate

levels - Reasons for specific why questions at aggregate

level - most compactly represent the answer that user can

quickly assimilate

Vertical integration Mining on the web

- Web log analysis for site design
- what are popular pages,
- what links are hard to find.
- Electronic stores sales enhancements
- recommendations, advertisement
- Collaborative filtering Net perception, Wisewire

- Inventory control what was a shopper looking for

and could not find..

Research problems

- Automatic model selection different ways of

solving same problem, which one to use? - Automatic classification of complex data types

especially time series data. - Refreshing mined results explaining and modeling

changes along time - Quality of mined results guarding against wrong

conclusions, chance discovering - Incorporating domain knowledge to filter results

and improve result quality

Research problems

- Close integration with data sources to be mined
- Distributed mining across multiple relations at a

single site or spread across multiple sites. - Integration with other data analysis tools

example statistical tools, OLAP and SQL querying - Interactive data mining toolkit of micro

operators - Mixed media mining link textual reports with

images and numeric fields

Relevance in India

- Emerging application areas especially in the

banking, retail industry and manufacturing

processes - Mining large scientific databases export laws

might require indigeneous technology - Rich research area with interesting algorithm

components -- just need to implement. - Too expensive to purchase US/Europe products

- Need to build usable prototypes not simply tweak

algorithms for publications.

Summary

- What is data mining and an overview of the

various operations - Classification regression, nearest neighbour,

neural network, bayesian - Clustering distance based (k-means),

distribution based(EM) - Itemset counting
- Several operations challenge is choosing the

right operation for the problem - New directions and identification of research

problems