I dont need a title slide for a lecture - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

I dont need a title slide for a lecture

Description:

Map each subset to numbers. While there still are large itemsets: ... General Data mining: http://www.almaden.ibm.com/cs/quest, www.bell-labs.com/project/serendip ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 41
Provided by: rachelpo3
Category:
Tags: com | dont | free | lecture | map | need | online | quest | title | translator

less

Transcript and Presenter's Notes

Title: I dont need a title slide for a lecture


1
I dont need a title slide for a lecture
  • Long long ago,
  • in a galaxy far, far away

2
Outline
  • Background
  • Data mining
  • Association Rules
  • Classification
  • Clustering
  • Sequential Patterns
  • Sequence Similarity

3
Knowledge Discovery in Databases (KDD)
  • What is it?
  • Finding useful patterns in data
  • Why do we need it?
  • Terabytes of data
  • Impractical to manually search for patterns
  • Where does data mining come in?

4
Steps of a KDD process
  • Learn the application domain
  • Create a target dataset
  • Clean and preprocess data
  • Choose type of data mining
  • Pick an algorithm
  • Perform data mining
  • Interpret results

5
Databases vs.Data warehousing
  • Data warehousing
  • Storage of all data
  • Details or summaries
  • Metadata
  • Data cleaning, integration
  • Databases
  • Queries over current data
  • Persistent storage
  • Atomic updates

6
Databases vs.Data warehouses
  • Databases provide for
  • Queries over current data
  • Persistent storage
  • Atomic updates
  • Data warehouses provide for
  • Storage of all data
  • Meta data
  • Data cleaning, integration
  • Fast access to data

7
Whos interested?
  • Databases - large amounts of data
  • Artificial Intelligence - search, planning,
    machine learning
  • Information Retrieval - searching for similar
    documents
  • Image Processing - finding similar images

8
Types of data mining
  • Association Rules
  • Classification
  • Clustering
  • Sequential Patterns
  • Sequence Similarity

9
Association rules
  • What are they?
  • Looking for common causal relationships in basket
    data
  • Where are they used?
  • Store layout
  • Catalog design
  • Customer segmentation

10
Association rules example
Find all itemsets that occur at least twice, and
the causal relationship of each
11
Association rules metrics
  • For a rule a ?b
  • support a and b occur together in at least s
    of the n baskets
  • confidence of all of the baskets containing a,
    at least c also contain b

12
Association rules algorithms
  • Focus on finding support for itemsets
  • The naïve method
  • Combine itemsets of size k-1 that differ only on
    the last item to find Candidatesk
  • Measure support of itemsets from step 1 to form
    large itemsetk
  • Increase k and repeat until no new large itemsets

13
Itemsets of size 1
Looking for support of 2
14
Finding candidate set 2
15
Finding candidate set 3
16
Apriori algorithm
  • An itemset cannot be a large itemset unless all
    of its subsets are large itemsets
  • Reduces number of candidate itemsets considered

17
Research directions
  • Online construction of rules
  • CARMA (Berkeley)
  • Pre filtering the data
  • a posteriori (Limburgs Universitair Centrum)

18
Classification
  • What is it?
  • Rules that partition data into separate groups.
  • Where is it used?
  • to classify people as good/bad credit risks
  • weather prediction
  • fraud detection
  • Variation best k of n (who to send flyers to)

19
Classification example
20
Possible solutions
  • Bayesian classification
  • Neural networks
  • Genetic algorithms
  • Decision Trees

21
Decision trees
Salary lt 25,000
no
yes
Graduate education?
Accept
no
yes
Accept
Reject
22
Decision trees
  • Build the tree in two steps
  • Build a perfect tree on sample data
  • At each node, pick a good attribute
  • Split data according to attribute
  • Recursively build tree on children
  • Prune the tree
  • Minimum Description Length
  • Cost of encoding tree structure
  • Cost of encoding split attribute
  • Cost of encoding leaf data records

23
Research directions
  • Integrate building and pruning
  • PUBLIC (Bell Labs)
  • Incremental Updates
  • BOAT (University of Wisconsin)

24
Clustering
  • What is it?
  • Given n points, separate them into k clusters
  • Where is it used?
  • Information retrieval - text classification
  • Identify similar web documents
  • Mapping the universe

25
Clustering example
26
Traditional clustering algorithms
  • Partitional
  • Determine k partitions that optimize a function
  • Common function is the square error function
  • Hierarchical
  • Each point starts as a cluster
  • Clusters are merged until k clusters remain

27
Clustering difficulties
28
Research directions
  • Higher dimension subspace clustering
  • CLIQUE (IBM Almaden)
  • Incremental clustering
  • Incremental DBScan (University of Munich)
  • Remove problems with outliers
  • CURE (Bell Labs)

29
Sequential patterns
  • What is it?
  • Given a set of events, find frequently occurring
    patterns
  • Where is it used?
  • Analyzing basket data
  • Medical diagnosis

30
Sequential patterns example
31
AprioriAll
  • Create all large events that occur once
  • Map each subset to numbers
  • While there still are large itemsets
  • Find candidate itemsets of length k
  • Find large itemsets of length k
  • Increase k

32
Mapping the itemsets
33
Research directions
  • Time limitations
  • WINEPI (Helsinki/Microsoft)
  • Itemsets over multiple transactions
  • CSP (IBM Almaden)

34
Sequence Similarity
  • What is it?
  • Given a number of data sets, look for similar
    trends
  • Where is it used?
  • Find stocks with similar price movements
  • Find geological irregularities

35
Example
  • Are the two sequences similar?

36
Basic algorithm
  • Scale data
  • Match all gap-free sequences
  • Form pairs of large similar sequences
  • Find the longest common subsequence

37
Research directions
  • Finding surprising patterns
  • IBM Almaden

38
Data mining directions
  • Sampling
  • Fractals
  • Pre-partitioning data
  • Making data mining more accessible
  • User defined aggregation support

39
References
  • General Data mining http//www.almaden.ibm.com/cs
    /quest, www.bell-labs.com/project/serendip
  • Association Rules Fast Algorithms for Mining
    Association Rules, Agrawal and Srikant VLDB 94.
  • Classification PUBLIC A Decision Tree
    Classifier that Integrates Building and Pruning,
    Rastogi and Shim VLDB 98.

40
References (cont.)
  • Clustering CURE An Efficient Clustering
    Algorithm for Large Databases, Guha, Rastogi,
    Shim SIGMOD 98.
  • Sequential Patterns Mining Sequential Patterns
    Generalizations and Performance Improvements,
    Srikant and Agrawal EDBT 98.
  • Similarity Search Fast Similarity Search in the
    Presence of Noise, Scaling, and Translation in
    Time-Series Databases, Agrawal, Nin, Sawhney,
    and Shim VLDB 95.
Write a Comment
User Comments (0)
About PowerShow.com