Tutorial on Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Tutorial on Data Mining

Description:

Process of semi-automatically analyzing large databases to find interesting and useful patterns ... rather than looking at mining operations and choosing what ... – PowerPoint PPT presentation

Number of Views:275
Avg rating:3.0/5.0
Slides: 44
Provided by: Sun43
Category:

less

Transcript and Presenter's Notes

Title: Tutorial on Data Mining


1
Tutorial on Data Mining
  • Workshop of the Indian Database Research
    Community
  • Sunita Sarawagi
  • School of IT, IIT Bombay

2
Data mining
  • Process of semi-automatically analyzing large
    databases to find interesting and useful patterns
  • Overlaps with machine learning, statistics,
    artificial intelligence and databases but
  • more scalable in number of features and instances
  • more automated to handle heterogeneous data

3
Outline
  • Applications
  • Usage scenarios
  • Overview of operations
  • Mining research groups
  • Relevance in India
  • Ten research problems

4
Applications
  • Customer relationship management
  • identify those who are likely to leave for a
    competitor.
  • Targeted marketing identify likely responders to
    promotions
  • Fraud detection telecommunications, financial
    transactions
  • Manufacturing and production
  • Medicine disease outcome, effectiveness of
    treatments
  • Molecular/Pharmaceutical identify new drugs
  • Scientific data analysis
  • Web site/store design and promotion

5
Usage scenarios
  • Data warehouse mining
  • assimilate data from operational sources
  • mine static data
  • Mining log data
  • Continuous mining example in process control
  • Stages in mining
  • data selection ? pre-processing cleaning ?
    transformation ? mining ? result evaluation ?
    visualization

6
Some basic operations
  • Predictive
  • Regression
  • Classification
  • Descriptive
  • Clustering / similarity matching
  • Association rules and variants
  • Deviation detection

7
Classification
  • Given old data about customers and payments,
    predict new applicants loan eligibility.

Previous customers
Classifier
Decision rules
Age Salary Profession Location Customer type
Salary gt 5 L
Good/ bad
Prof. Exec
New applicants data
8
Classification methods
  • Goal Predict class Ci f(x1, x2, .. Xn)
  • Regression (linear or any other polynomial)
  • ax1 bx2 c Ci.
  • Nearest neighour
  • Decision tree classifier divide decision space
    into piecewise constant regions.
  • Probabilistic/generative models
  • Neural networks partition by non-linear
    boundaries

9
Nearest neighbor
  • Define proximity between instances, find
    neighbors of new instance and assign majority
    class
  • Case based reasoning when attributes are more
    complicated than real-valued.
  • Cons
  • Slow during application.
  • No feature selection.
  • Notion of proximity vague
  • Pros
  • Fast training

10
Decision trees
  • Tree where internal nodes are simple decision
    rules on one or more attributes and leaf nodes
    are predicted class labels.

Salary lt 1 M
Prof teacher
Age lt 30
11
Algorithm for tree building
  • Greedy top-down construction.

Gen_Tree (Node, data)
Yes
make node a leaf?
Stop
Selection criteria
Find best attribute and best split on attribute
Partition data on split condition
For each child j of node Gen_Tree (node_j,
data_j)
12
Split criteria
  • K classes, set of S instances partitioned into r
    subsets. Instance Sj has fraction pij instances
    of class j.
  • Information entropy
  • Gini index

1/4
Gini
0
1
Impurity
r 1, k2
13
Scalable algorithm
rid A1 A2 A3 C
  • Input table of records
  • Vertically partition data and sort on ltattribute
    value, classgt
  • Finding best split
  • Scan and maintain class counts in memory and find
    gini incrementally.
  • Performing split
  • Use split attribute to build
  • rid to L/R hash in memory.
  • Divide other attributes using above hash table.

A2 C rid
A3 C rid
A1 C rid
14
Issues
  • Preventing overfitting
  • Occams razor
  • prefer the simplest hypothesis that fits the data
  • Tree pruning methods
  • Cross validation with separate test data
  • Minimum description length (MDL) criteria
  • Multi attribute tests on nodes to handle
    correlated attributes
  • Linear multivariate
  • Non-linear multivariate e.g. a neural net at each
    node.
  • Methods of handling missing values

15
Pros and Cons of decision trees
  • Cons
  • Cannot handle complicated relationship between
    features
  • simple decision boundaries
  • problems with lots of missing data
  • Pros
  • Reasonable training time
  • Fast application
  • Easy to interpret
  • Easy to implement
  • Can handle large number of features

More information http//www.stat.wisc.edu/limt/t
reeprogs.html
16
Neural networks
  • Useful for learning complex data like
    handwriting, speech and image recognition

Decision boundaries
Neural network
Classification tree
17
Neural network
  • Set of nodes connected by directed weighted edges

Basic NN unit
A more typical NN
x1
x1
w1
x2
x2
w2
x3
Output nodes
x3
w3
Hidden nodes
18
Pros and Cons of Neural Network
  • Cons
  • Slow training time
  • Hard to interpret
  • Hard to implement trial and error for choosing
    number of nodes
  • Pros
  • Can learn more complicated class boundaries
  • Fast application
  • Can handle large number of features

Conclusion Use neural nets only if decision
trees/NN fail.
19
Bayesian learning
  • Assume a probability model on generation of data.
  • Apply bayes theorem to find most likely class as
  • Naïve bayes Assume attributes conditionally
    independent given class value
  • Easy to learn probabilities by counting,
  • Useful in some domains e.g. text

20
Bayesian belief network
  • Find joint probability over set of variables
    making use of conditional independence whenever
    known
  • Learning parameters hard when hidden units use
    gradient descent / EM algorithms
  • Learning structure of network harder

a
d
ad ad ad ad
b
b
0.1 0.2 0.3 0.4
Variable e independent of d given b
b
0.3 0.2 0.1 0.5
e
C
21
Clustering
  • Unsupervised learning when old data with class
    labels not available e.g. when introducing a new
    product to a customer base
  • Group/cluster existing customers based on time
    series of payment history such that similar
    customers in same cluster.
  • Identify micro-markets and develop policies for
    each
  • Key requirement Need a good measure of
    similarity between instances

22
Distance functions
  • Numeric data euclidean, manhattan distances
  • Categorical data 0/1 to indicate
    presence/absence followed by
  • Hamming distance ( dissimilarity)
  • Jaccard coefficients similarity in 1s/( of 1s)
  • data dependent measures similarity of A and B
    depends on co-occurance with C.
  • Combined numeric and categorical data
  • weighted normalized distance

23
Distance functions on high dimensional data
  • Example Time series, Text, Images
  • Euclidian measures make all points equally far
  • Reduce number of dimensions
  • choose subset of original features using random
    projections, feature selection techniques
  • transform original features using statistical
    methods like Principal Component Analysis
  • Define domain specific similarity measures e.g.
    for images define features like number of
    objects, color histogram for time series define
    shape based measures.
  • Define non-distance based (model-based)
    clustering methods

24
Clustering methods
  • Hierarchical clustering
  • agglomerative Vs divisive
  • single link Vs complete link
  • Partitional clustering
  • distance-based K-means
  • model-based EM
  • density-based

25
Partitional methods K-means
  • Criteria minimize sum of square of distance
  • Between each point and centroid of the cluster.
  • Between each pair of points in the cluster
  • Algorithm
  • Select initial partition with K clusters random,
    first K, K separated points
  • Repeat until stabilization
  • Assign each point to closest cluster center
  • Generate new cluster centers
  • Adjust clusters by merging/splitting

26
Properties
  • May not reach global optima
  • Converges fast in practice guaranteed for
    certain forms of optimization function
  • Complexity O(KndI)
  • I number of iterations, n number of points, d
    number of dimensions, K number of clusters.
  • Database research on scalable algorithms
  • Birch one/two pass of data by keeping R-tree
    like index in memory Sigmod 96

27
Model based clustering
  • Assume data generated from K probability
    distributions. Need to find distribution
    parameters.
  • EM algorithm K Gaussian mixtures
  • Iterate between two steps
  • Expectation step assign points to clusters
  • Maximation step estimate model parameters

28
Association rules
T
  • Given set T of groups of items
  • Example set of baskets of items purchased
  • Goal find all rules on itemsets of the form
    a--gtb such that
  • support of a and b gt user threshold s
  • conditional probability (confidence) of b given
    a gt user threshold c
  • Example Milk --gt bread
  • Lot of work done on scalable algorithms

Milk, cereal
Tea, milk
Tea, rice, bread
cereal
29
Variants
  • High confidence may not imply high correlation
  • Use correlations. Find expected support and
    large departures from that interesting.
  • Brin et al. Limited attempt.
  • More complete work in statistical literature on
    contingency tables.
  • Still too many rules, need to prune...
  • Does not imply causality as in Bayesian networks

30
Prevalent ? Interesting
  • Analysts already know about prevalent rules
  • Interesting rules are those that deviate from
    prior expectation
  • Minings payoff is in finding surprising phenomena

1995
Milk and cereal selltogether!
Milk and cereal selltogether!
31
What makes a rule surprising?
  • Does not match prior expectation
  • Correlation between milk and cereal remains
    roughly constant over time
  • Cannot be trivially derived from simpler rules
  • Milk 10, cereal 10
  • Milk and cereal 10 surprising
  • Eggs 10
  • Milk, cereal and eggs 0.1 surprising!
  • Expected 1

32
Applications of fast itemset counting
  • Find correlated events
  • Applications in medicine find redundant tests
  • Cross selling in retail, banking
  • Improve predictive capability of classifiers that
    assume attribute independence
  • New similarity measures of categorical
    attributes Mannila et al, KDD 98

33
Mining market
  • Around 20 to 30 mining tool vendors 1/5th the
    size of OLAP market.
  • Major players
  • Clementine,
  • IBMs Intelligent Miner,
  • SGIs MineSet,
  • SASs Enterprise Miner.
  • All pretty much the same set of tools
  • Many embedded products fraud detection,
    electronic commerce applications

34
Integrating mining with DBMS
  • Need to
  • intermix operations
  • iterate through results
  • flexibly query and filter results and data
  • Existing file-based, batched approach not
    satisfactory.
  • Research challenge Identify a collection of
    primitive, composable operators like in
    relational DBMS and build a mining engine

35
OLAP Mining integration
  • OLAP (On Line Analytical Processing)
  • Multidimensional view of data factors are
    dimensions, quantity to be analyszed
    measures/cells.
  • Facilitates fast interactive exploration of
    multidimensional aggregates.
  • OLAP products provide a minimal set of tools for
    analysis
  • Heavy reliance on manual operations for analysis
  • tedious and error-prone on large multidimensional
    data
  • Ideal platform for vertical integration of mining
    but needs to be interactive instead of batch.

36
State of art in mining OLAP integration
  • Decision trees Information discovery, Cognos
  • find factors influencing high profits
  • Clustering Pilot software
  • segment customers to define hierarchy on that
    dimension
  • Time series analysis Seagates Holos
  • Query for various shapes along time spikes,
    outliers etc
  • Multi-level Associations Han et al.
  • find association between members of dimensions

37
New approach
  • Identify complex operations with specific
    OLAP needs in mind (what does an analyst need?)
    rather than looking at mining operations and
    choosing what fits
  • Two examples
  • Exceptions in data to guide exploration
  • One reason for manual exploration is to make sure
    that there are no surprises.
  • Pre-mines abnormalities in data and points them
    out to analysts using highlights at aggregate
    levels
  • Reasons for specific why questions at aggregate
    level
  • most compactly represent the answer that user can
    quickly assimilate

38
Vertical integration Mining on the web
  • Web log analysis for site design
  • what are popular pages,
  • what links are hard to find.
  • Electronic stores sales enhancements
  • recommendations, advertisement
  • Collaborative filtering Net perception, Wisewire
  • Inventory control what was a shopper looking for
    and could not find..

39
Research problems
  • Automatic model selection different ways of
    solving same problem, which one to use?
  • Automatic classification of complex data types
    especially time series data.
  • Refreshing mined results explaining and modeling
    changes along time
  • Quality of mined results guarding against wrong
    conclusions, chance discovering
  • Incorporating domain knowledge to filter results
    and improve result quality

40
Research problems
  • Close integration with data sources to be mined
  • Distributed mining across multiple relations at a
    single site or spread across multiple sites.
  • Integration with other data analysis tools
    example statistical tools, OLAP and SQL querying
  • Interactive data mining toolkit of micro
    operators
  • Mixed media mining link textual reports with
    images and numeric fields

41
Relevance in India
  • Emerging application areas especially in the
    banking, retail industry and manufacturing
    processes
  • Mining large scientific databases export laws
    might require indigeneous technology
  • Rich research area with interesting algorithm
    components -- just need to implement.
  • Too expensive to purchase US/Europe products

42
  • Need to build usable prototypes not simply tweak
    algorithms for publications.

43
Summary
  • What is data mining and an overview of the
    various operations
  • Classification regression, nearest neighbour,
    neural network, bayesian
  • Clustering distance based (k-means),
    distribution based(EM)
  • Itemset counting
  • Several operations challenge is choosing the
    right operation for the problem
  • New directions and identification of research
    problems
Write a Comment
User Comments (0)
About PowerShow.com