Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining

Description:

Huge amount of databases and web pages make information extraction next to ... in the 1950's who hypothesized that some people had Extra-Sensory Perception. ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 118
Provided by: rajee57
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Mining


1
Data Mining
2
Outline
  • What is data mining?
  • Data Mining Tasks
  • Association
  • Classification
  • Clustering
  • Data mining Algorithms
  • Are all the patterns interesting?

3
What is Data Mining
  • Huge amount of databases and web pages make
    information extraction next to impossible
    (remember the favored statement I will bury them
    in data!)
  • Inability of many other disciplines (statistic,
    AI, information retrieval) to have scalable
    algorithms to extract information and/or rules
    from the databases
  • Necessity to find relationships among data

4
What is Data Mining
  • Discovery of useful, possibly unexpected data
    patterns
  • Subsidiary issues
  • Data cleansing
  • Visualization
  • Warehousing

5
Examples
  • A big objection to was that it was looking for so
    many vague connections that it was sure to find
    things that were bogus and thus violate
    innocents privacy.
  • The Rhine Paradox a great example of how not to
    conduct scientific research.

6
Rhine Paradox --- (1)
  • David Rhine was a parapsychologist in the 1950s
    who hypothesized that some people had
    Extra-Sensory Perception.
  • He devised an experiment where subjects were
    asked to guess 10 hidden cards --- red or blue.
  • He discovered that almost 1 in 1000 had ESP ---
    they were able to get all 10 right!

7
Rhine Paradox --- (2)
  • He told these people they had ESP and called them
    in for another test of the same type.
  • Alas, he discovered that almost all of them had
    lost their ESP.
  • What did he conclude?
  • Answer on next slide.

8
Rhine Paradox --- (3)
  • He concluded that you shouldnt tell people they
    have ESP it causes them to lose it.

9
A Concrete Example
  • This example illustrates a problem with
    intelligence-gathering.
  • Suppose we believe that certain groups of
    evil-doers are meeting occasionally in hotels to
    plot doing evil.
  • We want to find people who at least twice have
    stayed at the same hotel on the same day.

10
The Details
  • 109 people being tracked.
  • 1000 days.
  • Each person stays in a hotel 1 of the time (10
    days out of 1000).
  • Hotels hold 100 people (so 105 hotels).
  • If everyone behaves randomly (I.e., no
    evil-doers) will the data mining detect anything
    suspicious?

11
Calculations --- (1)
  • Probability that persons p and q will be at the
    same hotel on day d
  • 1/100 1/100 10-5 10-9.
  • Probability that p and q will be at the same
    hotel on two given days
  • 10-9 10-9 10-18.
  • Pairs of days
  • 5105.

12
Calculations --- (2)
  • Probability that p and q will be at the same
    hotel on some two days
  • 5105 10-18 510-13.
  • Pairs of people
  • 51017.
  • Expected number of suspicious pairs of people
  • 51017 510-13 250,000.

13
Conclusion
  • Suppose there are (say) 10 pairs of evil-doers
    who definitely stayed at the same hotel twice.
  • Analysts have to sift through 250,010 candidates
    to find the 10 real cases.
  • Not gonna happen.
  • But how can we improve the scheme?

14
Appetizer
  • Consider a file consisting of 24471 records. File
    contains at least two condition attributes A and
    D

A/D 0 1 total
0 9272 232 9504
1 14695 272 14967
Total 23967 504 24471
15
Appetizer (cont)
  • Probability that person has A P(A)0.6,
  • Probability that person has D P(D)0.02
  • Conditional probability that person has D
    provided it has A P(DA) P(AD)/P(A)(272/24471)
    /.6 .02
  • P(AD) P(AD)/P(D) .54
  • What can we say about dependencies between A and
    D?

A/D 0 1 total
0 9272 232 9504
1 14695 272 14967
Total 23967 504 24471
16
Appetizer(3)
  • So far we did not ask anything that statistics
    would not have ask. So Data Mining another word
    for statistic?
  • We hope that the response will be resounding NO
  • The major difference is that statistical methods
    work with random data samples, whereas the data
    in databases is not necessarily random
  • The second difference is the size of the data set
  • The third data is that statistical samples do not
    contain dirty data

17
Architecture of a Typical Data Mining System
Graphical user interface
Pattern evaluation
Data mining engine
Knowledge-base
Database or data warehouse server
Filtering
Data cleaning data integration
Data Warehouse
Databases
18
Data Mining Tasks
  • Association (correlation and causality)
  • Multi-dimensional vs. single-dimensional
    association
  • age(X, 20..29) income(X, 20..29K) -gt
    buys(X, PC) support 2, confidence 60
  • contains(T, computer) -gt contains(x,
    software) 1, 75
  • What is support? the percentage of the tuples
    in the database that have age between 20 and 29
    and income between 20K and 29K and buying PC
  • What is confidence? the probability that if
    person is between 20 and 29 and income between
    20K and 29K then it buys PC
  • Clustering (getting data that are close together
    into the same cluster.
  • What does close together means?

19
Distances between data
  • Distance between data is a measure of
    dissimilarity between data.
  • d(i,j)gt0 d(i,j) d(j,i) d(i,j)lt d(i,k)
    d(k,j)
  • Euclidean distance ltx1,x2, xkgt and lty1,y2,ykgt
  • Standardize variables by finding standard
    deviation and dividing each xi by standard
    deviation of X
  • Covariance(X,Y)1/k(Sum(xi-mean(x))(y(I)-mean(y))
  • Boolean variables and their distances

20
Data Mining Tasks
  • Outlier analysis
  • Outlier a data object that does not comply with
    the general behavior of the data
  • It can be considered as noise or exception but is
    quite useful in fraud detection, rare events
    analysis
  • Trend and evolution analysis
  • Trend and deviation regression analysis
  • Sequential pattern mining, periodicity analysis
  • Similarity-based analysis
  • Other pattern-directed or statistical analyses

21
Are All the Discovered Patterns Interesting?
  • A data mining system/query may generate thousands
    of patterns, not all of them are interesting.
  • Suggested approach Human-centered, query-based,
    focused mining
  • Interestingness measures A pattern is
    interesting if it is easily understood by humans,
    valid on new or test data with some degree of
    certainty, potentially useful, novel, or
    validates some hypothesis that a user seeks to
    confirm
  • Objective vs. subjective interestingness
    measures
  • Objective based on statistics and structures of
    patterns, e.g., support, confidence, etc.
  • Subjective based on users belief in the data,
    e.g., unexpectedness, novelty, actionability, etc.

22
Are All the Discovered Patterns Interesting? -
Example


1
coffee
0 1
tea
5 5 20
25
0
70 75
Conditional probability that if one buys coffee,
one also buys tea is 2/9 Conditional probability
that if one buys tea she also buys coffee is
20/25.8 However, the probability that she buys
coffee is .9 So, is it significant inference that
if customer buys tea she also buys coffee? Is
buying tea and coffee independent activities?
23
How to measure Interestingness
  • RI X , Y - XY/N
  • Support and Confidence X Y/N support and
    X Y/X -confidence of X-gtY
  • Chi2 (XY - E(XY)) 2 /E(XY)
  • J(X-gtY) P(Y)(P(XY)log (P(XY)/P(X)) (1-
    P(XY))log ((1- P(XY)/(1-P(X))
  • Sufficiency (X-gtY) P(XY)/P(X!Y) Necessity
    (X-gtY) P(!XY)/P(!X!Y). Interestingness of
    Y-gtX is
  • NC 1-N(X-gtY)P(Y), if N() is less than 1
    or 0 otherwise

24
Can We Find All and Only Interesting Patterns?
  • Find all the interesting patterns Completeness
  • Can a data mining system find all the interesting
    patterns?
  • Association vs. classification vs. clustering
  • Search for only interesting patterns
    Optimization
  • Can a data mining system find only the
    interesting patterns?
  • Approaches
  • First general all the patterns and then filter
    out the uninteresting ones.
  • Generate only the interesting patternsmining
    query optimization

25
Clustering
  • Partition data set into clusters, and one can
    store cluster representation only
  • Can be very effective if data is clustered but
    not if data is smeared
  • Can have hierarchical clustering and be stored in
    multi-dimensional index tree structures
  • There are many choices of clustering definitions
    and clustering algorithms.

26
Example Clusters
Outliers
x xx x x x x x x x x x x x
x
x x x x x x x x x x x x
x x x
x x x x x x x x x x
x
27
Sampling
  • Allow a mining algorithm to run in complexity
    that is potentially sub-linear to the size of the
    data
  • Choose a representative subset of the data
  • Simple random sampling may have very poor
    performance in the presence of skew
  • Develop adaptive sampling methods
  • Stratified sampling
  • Approximate the percentage of each class (or
    subpopulation of interest) in the overall
    database
  • Used in conjunction with skewed data
  • Sampling may not reduce database I/Os (page at a
    time).

28
Sampling
SRSWOR (simple random sample without
replacement)
SRSWR
29
Sampling
Cluster/Stratified Sample
Raw Data
30
Discretization
  • Three types of attributes
  • Nominal values from an unordered set
  • Ordinal values from an ordered set
  • Continuous real numbers
  • Discretization
  • divide the range of a continuous attribute into
    intervals
  • Some classification algorithms only accept
    categorical attributes.
  • Reduce data size by discretization
  • Prepare for further analysis

31
Discretization
  • Discretization
  • reduce the number of values for a given
    continuous attribute by dividing the range of the
    attribute into intervals. Interval labels can
    then be used to replace actual data values.

32
Discretization
Sort Attribute
Select cut Point
Evaluate Measure
NO
NO
Satisfied
Yes
DONE
Split/Merge
Stop
33
Discretization
  • Dynamic vs Static
  • Local vs Global
  • Top-Down vs Bottom-Up
  • Direct vs Incremental

34
Discretization Quality Evaluation
  • Total number of Intervals
  • The Number of Inconsistencies
  • Predictive Accuracy
  • Complexity

35
Discretization - Binning
  • Equal width all range is between min and max
    values is split in equal width intervals
  • Equal-frequency - Each bin contains
    approximately the same number of data

36
Entropy-Based Discretization
  • Given a set of samples S, if S is partitioned
    into two intervals S1 and S2 using boundary T,
    the entropy after partitioning is
  • The boundary that minimizes the entropy function
    over all possible boundaries is selected as a
    binary discretization.
  • The process is recursively applied to partitions
    obtained until some stopping criterion is met,
    e.g.,
  • Experiments show that it may reduce data size and
    improve classification accuracy

37
Data Mining Primitives, Languages, and System
Architectures
  • Data mining primitives What defines a data
    mining task?
  • A data mining query language
  • Design graphical user interfaces based on a data
    mining query language
  • Architecture of data mining systems

38
Why Data Mining Primitives and Languages?
  • Data mining should be an interactive process
  • User directs what to be mined
  • Users must be provided with a set of primitives
    to be used to communicate with the data mining
    system
  • Incorporating these primitives in a data mining
    query language
  • More flexible user interaction
  • Foundation for design of graphical user interface
  • Standardization of data mining industry and
    practice

39
What Defines a Data Mining Task ?
  • Task-relevant data
  • Type of knowledge to be mined
  • Background knowledge
  • Pattern interestingness measurements
  • Visualization of discovered patterns

40
Task-Relevant Data (Minable View)
  • Database or data warehouse name
  • Database tables or data warehouse cubes
  • Condition for data selection
  • Relevant attributes or dimensions
  • Data grouping criteria

41
Types of knowledge to be mined
  • Characterization
  • Discrimination
  • Association
  • Classification/prediction
  • Clustering
  • Outlier analysis
  • Other data mining tasks

42
A Data Mining Query Language (DMQL)
  • Motivation
  • A DMQL can provide the ability to support ad-hoc
    and interactive data mining
  • By providing a standardized language like SQL
  • Hope to achieve a similar effect like that SQL
    has on relational database
  • Foundation for system development and evolution
  • Facilitate information exchange, technology
    transfer, commercialization and wide acceptance
  • Design
  • DMQL is designed with the primitives described
    earlier

43
Syntax for DMQL
  • Syntax for specification of
  • task-relevant data
  • the kind of knowledge to be mined
  • concept hierarchy specification
  • interestingness measure
  • pattern presentation and visualization
  • Putting it all together a DMQL query

44
Syntax for task-relevant data specification
  • use database database_name, or use data warehouse
    data_warehouse_name
  • from relation(s)/cube(s) where condition
  • in relevance to att_or_dim_list
  • order by order_list
  • group by grouping_list
  • having condition

45
Specification of task-relevant data
46
Syntax for specifying the kind of knowledge to be
mined
  • Characterization
  • Mine_Knowledge_Specification  mine
    characteristics as pattern_name analyze
    measure(s)
  • Discrimination
  • Mine_Knowledge_Specification  mine
    comparison as pattern_name for
    target_class where target_condition  versus
    contrast_class_i where contrast_condition_i 
    analyze measure(s)
  • Association
  • Mine_Knowledge_Specification  mine
    associations as pattern_name

47
Syntax for specifying the kind of knowledge to be
mined (cont.)
  • Classification
  • Mine_Knowledge_Specification  mine
    classification as pattern_name analyze
    classifying_attribute_or_dimension
  • Prediction
  • Mine_Knowledge_Specification  mine
    prediction as pattern_name analyze
    prediction_attribute_or_dimension set
    attribute_or_dimension_i value_i

48
Syntax for concept hierarchy specification
  • To specify what concept hierarchies to use
  • use hierarchy lthierarchygt for ltattribute_or_dimens
    iongt
  • We use different syntax to define different type
    of hierarchies
  • schema hierarchies
  • define hierarchy time_hierarchy on date as
    date,month quarter,year
  • set-grouping hierarchies
  • define hierarchy age_hierarchy for age on
    customer as
  • level1 young, middle_aged, senior lt level0
    all
  • level2 20, ..., 39 lt level1 young
  • level2 40, ..., 59 lt level1 middle_aged
  • level2 60, ..., 89 lt level1 senior

49
Syntax for concept hierarchy specification (Cont.)
  • operation-derived hierarchies
  • define hierarchy age_hierarchy for age on
    customer as
  • age_category(1), ..., age_category(5)
    cluster(default, age, 5) lt all(age)
  • rule-based hierarchies
  • define hierarchy profit_margin_hierarchy on item
    as
  • level_1 low_profit_margin lt level_0 all
  • if (price - cost)lt 50
  • level_1 medium-profit_margin lt level_0 all
  • if ((price - cost) gt 50) and ((price -
    cost) lt 250))
  • level_1 high_profit_margin lt level_0 all
  • if (price - cost) gt 250

50
Syntax for interestingness measure specification
  • Interestingness measures and thresholds can be
    specified by the user with the statement
  • with ltinterest_measure_namegt  threshold
    threshold_value
  • Example
  • with support threshold 0.05
  • with confidence threshold 0.7 

51
Syntax for pattern presentation and visualization
specification
  • We have syntax which allows users to specify the
    display of discovered patterns in one or more
    forms
  • display as ltresult_formgt
  • To facilitate interactive viewing at different
    concept level, the following syntax is defined
  • Multilevel_Manipulation    roll up on
    attribute_or_dimension drill down on
    attribute_or_dimension add
    attribute_or_dimension drop
    attribute_or_dimension

52
Putting it all together the full specification
of a DMQL query
  • use database AllElectronics_db
  • use hierarchy location_hierarchy for B.address
  • mine characteristics as customerPurchasing
  • analyze count
  • in relevance to C.age, I.type, I.place_made
  • from customer C, item I, purchases P,
    items_sold S, works_at W, branch
  • where I.item_ID S.item_ID and S.trans_ID
    P.trans_ID
  • and P.cust_ID C.cust_ID and P.method_paid
    AmEx''
  • and P.empl_ID W.empl_ID and W.branch_ID
    B.branch_ID and B.address Canada" and
    I.price gt 100
  • with noise threshold 0.05
  • display as table

53
DMQL and SQL
  • DMQL Describe general characteristics of
    graduate students in the Big-University database
  • use Big_University_DB
  • mine characteristics as Science_Students
  • in relevance to name, gender, major, birth_place,
    birth_date, residence, phone, gpa
  • from student
  • where status in graduate
  • Corresponding SQL statement
  • Select name, gender, major, birth_place,
    birth_date, residence, phone, gpa
  • from student
  • where status in Msc, MBA, PhD

54
Decision Trees
  • Example
  • Conducted survey to see what customers were
    interested in new model car
  • Want to select customers for advertising campaign

training set
55
One Possibility
agelt30
Y
N
citysf
carvan
Y
Y
N
N
likely
unlikely
likely
unlikely
56
Another Possibility
cartaurus
Y
N
citysf
agelt45
Y
Y
N
N
likely
unlikely
likely
unlikely
57
Issues
  • Decision tree cannot be too deep
  • would not have statistically significant amounts
    of data for lower decisions
  • Need to select tree that most reliably predicts
    outcomes

58
Top-Down Induction of Decision Tree
Attributes Outlook, Temperature, Humidity,
Wind
PlayTennis yes, no
59
Entropy and Information Gain
  • S contains si tuples of class Ci for i 1, ,
    m
  • Information measures info required to classify
    any arbitrary tuple
  • Entropy of attribute A with values a1,a2,,av
  • Information gained by branching on attribute A

60
Example Analytical Characterization
  • Task
  • Mine general characteristics describing graduate
    students using analytical characterization
  • Given
  • attributes name, gender, major, birth_place,
    birth_date, phone, and gpa
  • Gen(ai) concept hierarchies on ai
  • Ui attribute analytical thresholds for ai
  • Ti attribute generalization thresholds for ai
  • R attribute relevance threshold

61
Example Analytical Characterization (contd)
  • 1. Data collection
  • target class graduate student
  • contrasting class undergraduate student
  • 2. Analytical generalization using Ui
  • attribute removal
  • remove name and phone
  • attribute generalization
  • generalize major, birth_place, birth_date and
    gpa
  • accumulate counts
  • candidate relation gender, major, birth_country,
    age_range and gpa

62
Example Analytical characterization (3)
  • 3. Relevance analysis
  • Calculate expected info required to classify an
    arbitrary tuple
  • Calculate entropy of each attribute e.g. major

63
Example Analytical Characterization (4)
  • Calculate expected info required to classify a
    given sample if S is partitioned according to the
    attribute
  • Calculate information gain for each attribute
  • Information gain for all attributes

64
Example Analytical characterization (5)
  • 4. Initial working relation (W0) derivation
  • R 0.1
  • remove irrelevant/weakly relevant attributes from
    candidate relation gt drop gender, birth_country
  • remove contrasting class candidate relation
  • 5. Perform attribute-oriented induction on W0
    using Ti

Initial target class working relation W0
Graduate students
65
What Is Association Mining?
  • Association rule mining
  • Finding frequent patterns, associations,
    correlations, or causal structures among sets of
    items or objects in transaction databases,
    relational databases, and other information
    repositories.
  • Applications
  • Basket data analysis, cross-marketing, catalog
    design, loss-leader analysis, clustering,
    classification, etc.
  • Examples.
  • Rule form Body Head support, confidence.
  • buys(x, diapers) buys(x, beers) 0.5,
    60
  • major(x, CS) takes(x, DB) grade(x, A)
    1, 75

66
Association Rule Mining
transaction id
customer id
products bought
sales records
market-basket data
  • Trend Products p5, p8 often bough together
  • Trend Customer 12 likes product p9

67
Association Rule
  • Rule p1, p3, p8
  • Support number of baskets where these products
    appear
  • High-support set support ? threshold s
  • Problem find all high support sets

68
Association Rule Basic Concepts
  • Given (1) database of transactions, (2) each
    transaction is a list of items (purchased by a
    customer in a visit)
  • Find all rules that correlate the presence of
    one set of items with that of another set of
    items
  • E.g., 98 of people who purchase tires and auto
    accessories also get automotive services done
  • Applications
  • ? Maintenance Agreement (What the store
    should do to boost Maintenance Agreement sales)
  • Home Electronics ? (What other products
    should the store stocks up?)
  • Attached mailing in direct marketing
  • Detecting ping-ponging of patients, faulty
    collisions

69
Rule Measures Support and Confidence
Customer buys both
  • Find all the rules X Y ? Z with minimum
    confidence and support
  • support, s, probability that a transaction
    contains X ? Y ? Z
  • confidence, c, conditional probability that a
    transaction having X ? Y also contains Z

Customer buys diaper
Customer buys beer
  • Let minimum support 50, and minimum confidence
    50, we have
  • A ? C (50, 66.6)
  • C ? A (50, 100)

70
Mining Association RulesAn Example
Min. support 50 Min. confidence 50
  • For rule A ? C
  • support support(A ?C) 50
  • confidence support(A ?C)/support(A) 66.6
  • The Apriori principle
  • Any subset of a frequent itemset must be frequent

71
Mining Frequent Itemsets the Key Step
  • Find the frequent itemsets the sets of items
    that have minimum support
  • A subset of a frequent itemset must also be a
    frequent itemset
  • i.e., if AB is a frequent itemset, both A and
    B should be a frequent itemset
  • Iteratively find frequent itemsets with
    cardinality from 1 to k (k-itemset)
  • Use the frequent itemsets to generate association
    rules.

72
The Apriori Algorithm
  • Join Step Ck is generated by joining Lk-1with
    itself
  • Prune Step Any (k-1)-itemset that is not
    frequent cannot be a subset of a frequent
    k-itemset
  • Pseudo-code
  • Ck Candidate itemset of size k
  • Lk frequent itemset of size k
  • L1 frequent items
  • for (k 1 Lk !? k) do begin
  • Ck1 candidates generated from Lk
  • for each transaction t in database do
  • increment the count of all candidates in
    Ck1 that are
    contained in t
  • Lk1 candidates in Ck1 with min_support
  • end
  • return ?k Lk

73
The Apriori Algorithm Example
Database D
L1
C1
Scan D
C2
C2
L2
Scan D
C3
L3
Scan D
74
How to Generate Candidates?
  • Suppose the items in Lk-1 are listed in an order
  • Step 1 self-joining Lk-1
  • insert into Ck
  • select p.item1, p.item2, , p.itemk-1, q.itemk-1
  • from Lk-1 p, Lk-1 q
  • where p.item1q.item1, , p.itemk-2q.itemk-2,
    p.itemk-1 lt q.itemk-1
  • Step 2 pruning
  • forall itemsets c in Ck do
  • forall (k-1)-subsets s of c do
  • if (s is not in Lk-1) then delete c from Ck

75
How to Count Supports of Candidates?
  • Why counting supports of candidates a problem?
  • The total number of candidates can be very huge
  • One transaction may contain many candidates
  • Method
  • Candidate itemsets are stored in a hash-tree
  • Leaf node of hash-tree contains a list of
    itemsets and counts
  • Interior node contains a hash table
  • Subset function finds all the candidates
    contained in a transaction

76
Example of Generating Candidates
  • L3abc, abd, acd, ace, bcd
  • Self-joining L3L3
  • abcd from abc and abd
  • acde from acd and ace
  • Pruning
  • acde is removed because ade is not in L3
  • C4abcd

77
Criticism to Support and Confidence
  • Example 1 (Aggarwal Yu, PODS98)
  • Among 5000 students
  • 3000 play basketball
  • 3750 eat cereal
  • 2000 both play basket ball and eat cereal
  • play basketball ? eat cereal 40, 66.7 is
    misleading because the overall percentage of
    students eating cereal is 75 which is higher
    than 66.7.
  • play basketball ? not eat cereal 20, 33.3 is
    far more accurate, although with lower support
    and confidence

78
Criticism to Support and Confidence (Cont.)
  • Example 2
  • X and Y positively correlated,
  • X and Z, negatively related
  • support and confidence of
  • XgtZ dominates
  • We need a measure of dependent or correlated
    events
  • P(BA)/P(B) is also called the lift of rule A gt B

79
Other Interestingness Measures Interest
  • Interest (correlation, lift)
  • taking both P(A) and P(B) in consideration
  • P(AB)P(B)P(A), if A and B are independent
    events
  • A and B negatively correlated, if the value is
    less than 1 otherwise A and B positively
    correlated

80
Classification vs. Prediction
  • Classification
  • predicts categorical class labels
  • classifies data (constructs a model) based on the
    training set and the values (class labels) in a
    classifying attribute and uses it in classifying
    new data
  • Prediction
  • models continuous-valued functions, i.e.,
    predicts unknown or missing values
  • Typical Applications
  • credit approval
  • target marketing
  • medical diagnosis
  • treatment effectiveness analysis

81
Classification Process Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
82
Classification Process Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
83
Supervised vs. Unsupervised Learning
  • Supervised learning (classification)
  • Supervision The training data (observations,
    measurements, etc.) are accompanied by labels
    indicating the class of the observations
  • New data is classified based on the training set
  • Unsupervised learning (clustering)
  • The class labels of training data are unknown
  • Given a set of measurements, observations, etc.
    with the aim of establishing the existence of
    classes or clusters in the data

84
Training Dataset
This follows an example from Quinlans ID3
85
Output A Decision Tree for buys_computer
age?
lt30
overcast
gt40
30..40
student?
credit rating?
yes
no
yes
fair
excellent
no
no
yes
yes
86
Algorithm for Decision Tree Induction
  • Basic algorithm (a greedy algorithm)
  • Tree is constructed in a top-down recursive
    divide-and-conquer manner
  • At start, all the training examples are at the
    root
  • Attributes are categorical (if continuous-valued,
    they are discretized in advance)
  • Examples are partitioned recursively based on
    selected attributes
  • Test attributes are selected on the basis of a
    heuristic or statistical measure (e.g.,
    information gain)
  • Conditions for stopping partitioning
  • All samples for a given node belong to the same
    class
  • There are no remaining attributes for further
    partitioning majority voting is employed for
    classifying the leaf
  • There are no samples left

87
Information Gain (ID3/C4.5)
  • Select the attribute with the highest information
    gain
  • Assume there are two classes, P and N
  • Let the set of examples S contain p elements of
    class P and n elements of class N
  • The amount of information, needed to decide if an
    arbitrary example in S belongs to P or N is
    defined as

88
Information Gain in Decision Tree Induction
  • Assume that using attribute A a set S will be
    partitioned into sets S1, S2 , , Sv
  • If Si contains pi examples of P and ni examples
    of N, the entropy, or the expected information
    needed to classify objects in all subtrees Si is
  • The encoding information that would be gained by
    branching on A

89
Attribute Selection by Information Gain
Computation
  • Class P buys_computer yes
  • Class N buys_computer no
  • I(p, n) I(9, 5) 0.940
  • Compute the entropy for age
  • Hence
  • Similarly

90
Gini Index (IBM IntelligentMiner)
  • If a data set T contains examples from n classes,
    gini index, gini(T) is defined as
  • where pj is the relative frequency of class j
    in T.
  • If a data set T is split into two subsets T1 and
    T2 with sizes N1 and N2 respectively, the gini
    index of the split data contains examples from n
    classes, the gini index gini(T) is defined as
  • The attribute provides the smallest ginisplit(T)
    is chosen to split the node (need to enumerate
    all possible splitting points for each attribute).

91
Extracting Classification Rules from Trees
  • Represent the knowledge in the form of IF-THEN
    rules
  • One rule is created for each path from the root
    to a leaf
  • Each attribute-value pair along a path forms a
    conjunction
  • The leaf node holds the class prediction
  • Rules are easier for humans to understand
  • Example
  • IF age lt30 AND student no THEN
    buys_computer no
  • IF age lt30 AND student yes THEN
    buys_computer yes
  • IF age 3140 THEN buys_computer yes
  • IF age gt40 AND credit_rating excellent
    THEN buys_computer yes
  • IF age gt40 AND credit_rating fair THEN
    buys_computer no

92
Avoid Overfitting in Classification
  • The generated tree may overfit the training data
  • Too many branches, some may reflect anomalies due
    to noise or outliers
  • Result is in poor accuracy for unseen samples
  • Two approaches to avoid overfitting
  • Prepruning Halt tree construction earlydo not
    split a node if this would result in the goodness
    measure falling below a threshold
  • Difficult to choose an appropriate threshold
  • Postpruning Remove branches from a fully grown
    treeget a sequence of progressively pruned trees
  • Use a set of data different from the training
    data to decide which is the best pruned tree

93
Approaches to Determine the Final Tree Size
  • Separate training (2/3) and testing (1/3) sets
  • Use cross validation, e.g., 10-fold cross
    validation
  • Use all the data for training
  • but apply a statistical test (e.g., chi-square)
    to estimate whether expanding or pruning a node
    may improve the entire distribution
  • Use minimum description length (MDL) principle
  • halting growth of the tree when the encoding is
    minimized

94
Scalable Decision Tree Induction Methods in Data
Mining Studies
  • SLIQ (EDBT96 Mehta et al.)
  • builds an index for each attribute and only class
    list and the current attribute list reside in
    memory
  • SPRINT (VLDB96 J. Shafer et al.)
  • constructs an attribute list data structure
  • PUBLIC (VLDB98 Rastogi Shim)
  • integrates tree splitting and tree pruning stop
    growing the tree earlier
  • RainForest (VLDB98 Gehrke, Ramakrishnan
    Ganti)
  • separates the scalability aspects from the
    criteria that determine the quality of the tree
  • builds an AVC-list (attribute, value, class label)

95
Bayesian Theorem
  • Given training data D, posteriori probability of
    a hypothesis h, P(hD) follows the Bayes theorem
  • MAP (maximum posteriori) hypothesis
  • Practical difficulty require initial knowledge
    of many probabilities, significant computational
    cost

96
NaĂŻve Bayes Classifier (I)
  • A simplified assumption attributes are
    conditionally independent
  • Greatly reduces the computation cost, only count
    the class distribution.

97
Naive Bayesian Classifier (II)
  • Given a training set, we can compute the
    probabilities

98
Bayesian classification
  • The classification problem may be formalized
    using a-posteriori probabilities
  • P(CX) prob. that the sample tuple
    Xltx1,,xkgt is of class C.
  • E.g. P(classN outlooksunny,windytrue,)
  • Idea assign to sample X the class label C such
    that P(CX) is maximal

99
Estimating a-posteriori probabilities
  • Bayes theorem
  • P(CX) P(XC)P(C) / P(X)
  • P(X) is constant for all classes
  • P(C) relative freq of class C samples
  • C such that P(CX) is maximum C such that
    P(XC)P(C) is maximum
  • Problem computing P(XC) is unfeasible!

100
NaĂŻve Bayesian Classification
  • NaĂŻve assumption attribute independence
  • P(x1,,xkC) P(x1C)P(xkC)
  • If i-th attribute is categoricalP(xiC) is
    estimated as the relative freq of samples having
    value xi as i-th attribute in class C
  • If i-th attribute is continuousP(xiC) is
    estimated thru a Gaussian density function
  • Computationally easy in both cases

101
Play-tennis example estimating P(xiC)
outlook
P(sunnyp) 2/9 P(sunnyn) 3/5
P(overcastp) 4/9 P(overcastn) 0
P(rainp) 3/9 P(rainn) 2/5
temperature
P(hotp) 2/9 P(hotn) 2/5
P(mildp) 4/9 P(mildn) 2/5
P(coolp) 3/9 P(cooln) 1/5
humidity
P(highp) 3/9 P(highn) 4/5
P(normalp) 6/9 P(normaln) 2/5
windy
P(truep) 3/9 P(truen) 3/5
P(falsep) 6/9 P(falsen) 2/5
P(p) 9/14
P(n) 5/14
102
Play-tennis example classifying X
  • An unseen sample X ltrain, hot, high, falsegt
  • P(Xp)P(p) P(rainp)P(hotp)P(highp)P(fals
    ep)P(p) 3/92/93/96/99/14 0.010582
  • P(Xn)P(n) P(rainn)P(hotn)P(highn)P(fals
    en)P(n) 2/52/54/52/55/14 0.018286
  • Sample X is classified in class n (dont play)

103
Association-Based Classification
  • Several methods for association-based
    classification
  • ARCS Quantitative association mining and
    clustering of association rules (Lent et al97)
  • It beats C4.5 in (mainly) scalability and also
    accuracy
  • Associative classification (Liu et al98)
  • It mines high support and high confidence rules
    in the form of cond_set gt y, where y is a
    class label
  • CAEP (Classification by aggregating emerging
    patterns) (Dong et al99)
  • Emerging patterns (EPs) the itemsets whose
    support increases significantly from one class to
    another
  • Mine Eps based on minimum support and growth rate

104
What Is Prediction?
  • Prediction is similar to classification
  • First, construct a model
  • Second, use model to predict unknown value
  • Major method for prediction is regression
  • Linear and multiple regression
  • Non-linear regression
  • Prediction is different from classification
  • Classification refers to predict categorical
    class label
  • Prediction models continuous-valued functions

105
Regression Analysis and Log-Linear Models in
Prediction
  • Linear regression Y ? ? X
  • Two parameters , ? and ? specify the line and
    are to be estimated by using the data at hand.
  • using the least squares criterion to the known
    values of Y1, Y2, , X1, X2, .
  • Multiple regression Y b0 b1 X1 b2 X2.
  • Many nonlinear functions can be transformed into
    the above.
  • Log-linear models
  • The multi-way table of joint probabilities is
    approximated by a product of lower-order tables.
  • Probability p(a, b, c, d) ?ab ?ac?ad ?bcd

106
What is Cluster Analysis?
  • Cluster a collection of data objects
  • Similar to one another within the same cluster
  • Dissimilar objects are in different clusters
  • Cluster analysis
  • Grouping a set of data objects into clusters
  • Clustering is unsupervised classification no
    predefined classes
  • Typical applications
  • As a stand-alone tool to get insight into data
    distribution
  • As a preprocessing step for other algorithms

107
General Applications of Clustering
  • Pattern Recognition
  • Spatial Data Analysis
  • create thematic maps in GIS by clustering feature
    spaces
  • detect spatial clusters and explain them in
    spatial data mining
  • Image Processing
  • Economic Science (especially market research)
  • WWW
  • Document classification
  • Cluster Weblog data to discover groups of similar
    access patterns

108
Examples of Clustering Applications
  • Marketing Help marketers discover distinct
    groups in their customer bases, and then use this
    knowledge to develop targeted marketing programs
  • Land use Identification of areas of similar land
    use in an earth observation database
  • Insurance Identifying groups of motor insurance
    policy holders with a high average claim cost
  • City-planning Identifying groups of houses
    according to their house type, value, and
    geographical location
  • Earth-quake studies Observed earth quake
    epicenters should be clustered along continent
    faults

109
What Is Good Clustering?
  • A good clustering method will produce high
    quality clusters with
  • high intra-class similarity
  • low inter-class similarity
  • The quality of a clustering result depends on
    both the similarity measure used by the method
    and its implementation.
  • The quality of a clustering method is also
    measured by its ability to discover some or all
    of the hidden patterns.

110
Types of Data in Cluster Analysis
  • Data matrix
  • Dissimilarity matrix

111
Measure the Quality of Clustering
  • Dissimilarity/Similarity metric Similarity is
    expressed in terms of a distance function, which
    is typically metric d(i, j)
  • There is a separate quality function that
    measures the goodness of a cluster.
  • The definitions of distance functions are usually
    very different for interval-scaled, boolean,
    categorical, ordinal and ratio variables.
  • Weights should be associated with different
    variables based on applications and data
    semantics.
  • It is hard to define similar enough or good
    enough
  • the answer is typically highly subjective.

112
Similarity and Dissimilarity Between Objects
  • Distances are normally used to measure the
    similarity or dissimilarity between two data
    objects
  • Some popular ones include Minkowski distance
  • where i (xi1, xi2, , xip) and j (xj1, xj2,
    , xjp) are two p-dimensional data objects, and q
    is a positive integer
  • If q 1, d is Manhattan distance

113
Similarity and Dissimilarity Between Objects
  • If q 2, d is Euclidean distance
  • Properties
  • d(i,j) ? 0
  • d(i,i) 0
  • d(i,j) d(j,i)
  • d(i,j) ? d(i,k) d(k,j)
  • Also one can use weighted distance, parametric
    Pearson product moment correlation, or other
    disimilarity measures.

114
Binary Variables
  • A contingency table for binary data
  • Simple matching coefficient (invariant, if the
    binary variable is symmetric)
  • Jaccard coefficient (noninvariant if the binary
    variable is asymmetric)

Object j
Object i
115
Dissimilarity between Binary Variables
  • Example
  • gender is a symmetric attribute
  • the remaining attributes are asymmetric binary
  • let the values Y and P be set to 1, and the value
    N be set to 0

116
Major Clustering Methods
  • Partitioning algorithms Construct various
    partitions and then evaluate them by some
    criterion
  • Hierarchy algorithms Create a hierarchical
    decomposition of the set of data (or objects)
    using some criterion
  • Density-based based on connectivity and density
    functions
  • Grid-based based on a multiple-level granularity
    structure
  • Model-based A model is hypothesized for each of
    the clusters and the idea is to find the best fit
    of that model to each other

117
Partitioning Algorithms Basic Concept
  • Partitioning method Construct a partition of a
    database D of n objects into a set of k clusters
  • Given a k, find a partition of k clusters that
    optimizes the chosen partitioning criterion
  • Global optimal exhaustively enumerate all
    partitions
  • Heuristic methods k-means and k-medoids
    algorithms
  • k-means (MacQueen67) Each cluster is
    represented by the center of the cluster
  • k-medoids or PAM (Partition around medoids)
    (Kaufman Rousseeuw87) Each cluster is
    represented by one of the objects in the cluster

118
The K-Means Clustering Method
  • Given k, the k-means algorithm is implemented in
    4 steps
  • Partition objects into k nonempty subsets
  • Compute seed points as the centroids of the
    clusters of the current partition. The centroid
    is the center (mean point) of the cluster.
  • Assign each object to the cluster with the
    nearest seed point.
  • Go back to Step 2, stop when no more new
    assignment.

119
The K-Means Clustering Method
  • Example

120
Comments on the K-Means Method
  • Strength
  • Relatively efficient O(tkn), where n is
    objects, k is clusters, and t is iterations.
    Normally, k, t ltlt n.
  • Often terminates at a local optimum. The global
    optimum may be found using techniques such as
    deterministic annealing and genetic algorithms
  • Weakness
  • Applicable only when mean is defined, then what
    about categorical data?
  • Need to specify k, the number of clusters, in
    advance
  • Unable to handle noisy data and outliers
  • Not suitable to discover clusters with non-convex
    shapes
Write a Comment
User Comments (0)
About PowerShow.com