Mid 2 revision and Asssociation Rules (II) - PowerPoint PPT Presentation

1 / 98
About This Presentation
Title:

Mid 2 revision and Asssociation Rules (II)

Description:

Mid 2 revision and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science Example Data Instance Natural Join and Intersection Natural join: special ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 99
Provided by: Lee149
Category:

less

Transcript and Presenter's Notes

Title: Mid 2 revision and Asssociation Rules (II)


1
Mid 2 revision and Asssociation Rules (II)
Lecture 16
  • Prof. Sin-Min Lee
  • Department of Computer Science

2
Example Data Instance
STUDENT
COURSE
Takes
cid subj sem
550-0103 DB F03
700-1003 AI S03
501-0103 Arch F03
sid name
1 Jill
2 Qun
3 Nitin
4 Marty
sid exp-grade cid
1 A 550-0103
1 A 700-1003
3 A 700-1003
3 C 500-0103
4 C 500-0103
PROFESSOR
Teaches
fid name
1 Ives
2 Saul
8 Roth
fid cid
1 550-0103
2 700-1003
8 501-0103
3
Natural Join and Intersection
  • Natural join special case of join where ? is
    implicit attributes with same name must be
    equal
  • STUDENT ? Takes STUDENT ?STUDENT.sid
    Takes.sid Takes
  • Intersection as with set operations, derivable
    from difference

A ? B
B-A
A-B
A B
4
Division
  • A somewhat messy operation that can be expressed
    in terms of the operations we have already
    defined
  • Used to express queries such as The fid's of
    faculty who have taught all subjects
  • Paraphrased The fids of professors for which
    there does not exist a subject that they havent
    taught

5
Division Using Our Existing Operators
  • All possible teaching assignments Allpairs
  • NotTaught, all (fid,subj) pairs for which
    professor fid has not taught subj
  • Answer is all faculty not in NotTaught

?fid,subj (PROFESSORx ?subj(COURSE))
Allpairs - ?fid,subj(Teaches ? COURSE)
?fid(PROFESSOR) - ?fid(NotTaught)
?fid(PROFESSOR) - ?fid(
?fid,subj (PROFESSOR x ?subj(COURSE)) -
?fid,subj(Teaches ? COURSE))
6
Division R1 R2
  • Requirement schema(R1) ¾ schema(R2)
  • Result schema schema(R1) schema(R2)
  • Professors who have taught all courses
  • What about Courses that have been taught by all
    faculty?

?fid (?fid,subj(Teaches ? COURSE) ?subj(COURSE))
7
Division
  • Goal Produce the tuples in one relation, r, that
    match all tuples in another relation, s
  • r (A1, An, B1, Bm)
  • s (B1 Bm)
  • r/s, with attributes A1, An, is the set of all
    tuples ltagt such that for every tuple ltbgt in s,
    lta,bgt is in r
  • Can be expressed in terms of projection, set
    difference, and cross-product

8
Division (contd)
9
Division - Example
  • List the Ids of students who have passed all
    courses that were taught in spring 2000
  • Numerator
  • StudId and CrsCode for every course passed by
    every student
  • ?StudId, CrsCode (?Grade? F (Transcript) )
  • Denominator
  • CrsCode of all courses taught in spring 2000
  • ?CrsCode (?SemesterS2000 (Teaching) )
  • Result is numerator/denominator

10
Relational Calculus
  • Important features
  • Declarative formal query languages for relational
    model
  • Based on the branch mathematical logic known as
    predicate calculus
  • Two types of RC
  • 1) tuple relational calculus
  • 2) domain relational calculus
  • A single statement can be used to perform a query

11
Tuple Relational Calculus
  • based on specifying a number of tuple variables
  • a tuple variable refers to any tuple

12
Generic Form
  • t COND (t)
  • where
  • t is a tuple variable and
  • COND(t) is Boolean expression involving t

13
Simple example 1
  • To find all employees whose salary is greater
    than 50,000
  • t EMPLOYEE(t) and t.Salarygt5000
  • where
  • EMPLOYEE(t) specifies the range of tuple variable
    t
  • The above operation selects all the attributes

14
Simple example 2
  • To find only the names of employees whose salary
    is greater than 50,000
  • t.FNAME, t.NAME EMPLOYEE(t) and t.Salarygt5000
  • The above is equivalent to
  • SELECT T.FNAME, T.LNAME
  • FROM EMPLOYEE T
  • WHERE T.SALARY gt 5000

15
Elements of a tuple calculus
  • In general, we need to specify the following in a
    tuple calculus expression
  • Range Relation (I.e, R(t)) FROM
  • Selected combination WHERE
  • Requested attributes SELECT

16
More ExampleQ0
  • Retrieve the birthrate and address of the
    employee(s) whose name is John B. Smith
  • t.BDATE, t.ADDRESS EMPLOYEE(t) AND
    t.FNAMEJohn AND t.MINITB AND
    t.LNAMESmith

17
Formal Specification of tuple Relational Calculus
  • A general format
  • t1.A1, t2.A2,,tn.An COND ( t1 ,t2 ,, tn,
    tn1, tn2,,tnm)
  • where
  • t1,,tnm are tuple var
  • Ai attribute?R(ti)
  • COND (formula)
  • Where COND corresponds to statement about the
    world, which can be True or False

18
Elements of formula
  • A formula is made of Predicate Calculus atoms
  • an atom of the from R(ti)
  • ti.A op tj.B op?, lt,gt,..
  • F1 And F2 where F1 and F2 are formulas
  • F1 OR F2
  • Not (F1)
  • F(?t) (F) or F (?t) (F)
  • ? Y friends (Y, John)
  • ?X likes(X, ICE_CREAM)

19
Example Queries Using the Existential Quantifier
  • Retrieve the name and address of all employees
    who work for the Research department
  • t.FNAME, t.LNAME, t.ADDRESS EMPLOYEE(t) AND (?
    d) (DEPARTMENT (d) AND d.DNAMEResearch AND
    d.DNUMBERt.DNO)

20
More Example
  • For every project located in Stafford, retrieve
    the project number, the controlling department
    number, and the last name, birthrate, and address
    of the manger of that department.

21
Cont.
  • p.PNUMBER,p.DNUM,m.LNAME,m.BDATE,
    m.ADDRESSPROJECT(p) and EMPLOYEE(M) and
    P.PLOCATIONStafford and (? d) (DEPARTMENT(D)
    AND P.DNUMd.DNUMBER and d.MGRSSNm.SSN))

22
Logical Equivalences
  • There are two logical equivalences that will be
    heavily used
  • p?q ? ?p ? q (Whenever p is true, q must also
    be true.)
  • ?x. p(x) ? ??x. ?p(x) (p is true for all x)
  • The second can be a lot easier to check!

23
Normalization
  • Review on Keys
  • superkey a set of attributes which will uniquely
    identify each tuple in a relation
  • candidate key a minimal superkey
  • primary key a chosen candidate key
  • secondary key all the rest of candiate keys
  • prime attribute an attribute that is a part of a
    candidate key (key column)
  • nonprime attribute a nonkey column

24
Normalization
  • Functional Dependency Type by Keys
  • whole (candidate) key ? nonprime attribute
    full FD (no violation)
  • partial key ? nonprime attribute partial FD
    (violation of 2NF)
  • nonprime attribute ? nonprime attribute
    transitive FD (violation of 3NF)
  • not a whole key ? prime attribute violation of
    BCNF

25
Functional Dependencies
  • Let R be a relation schema
  • ? ? R and ? ? R
  • The functional dependency
  • ? ? ?holds on R iff for any legal relations
    r(R), whenever two tuples t1 and t2 of r have
    same values for ?, they have same values for ?.
  • t1? t2 ? ? t1? t2 ?
  • On this instance, A ? B does NOT hold, but B ? A
    does hold.

A B
  • 4
  • 1 5
  • 3 7

26
1. Closure
  • Given a set of functional dependencies, F, its
    closure, F , is all FDs that are implied by FDs
    in F.
  • e.g. If A ? B, and B ? C,
  • then clearly A ? C

27
Armstrongs Axioms
  • We can find F by applying Armstrongs Axioms
  • if ? ? ?, then ? ? ?
    (reflexivity)
  • if ? ? ?, then ? ? ? ? ?
    (augmentation)
  • if ? ? ?, and ? ? ?, then ? ? ? (transitivity)
  • These rules are
  • sound (generate only functional dependencies that
    actually hold) and
  • complete (generate all functional dependencies
    that hold).

28
Additional rules
  • If ? ? ? and ? ? ?, then ? ? ? ? (union)
  • If ? ? ? ?, then ? ? ? and ? ? ? (decomposition)
  • If ? ? ? and ? ? ? ?, then ? ? ? ?
    (pseudotransitivity)
  • The above rules can be inferred from Armstrongs
    axioms.

29
Example
  • R (A, B, C, G, H, I)F A ? B A ? C CG
    ? H CG ? I B ? H
  • Some members of F
  • A ? H
  • by transitivity from A ? B and B ? H
  • AG ? I
  • by augmenting A ? C with G, to get AG ? CG
    and then transitivity with CG ? I
  • CG ? HI
  • by augmenting CG ? I to infer CG ? CGI,
  • and augmenting of CG ? H to infer CGI ? HI,
  • and then transitivity

30
2. Closure of an attribute set
  • Given a set of attributes A and a set of FDs F,
    closure of A under F is the set of all attributes
    implied by A
  • In other words, the largest B such that
  • A ? B
  • Redefining super keys
  • The closure of a super key is the entire
    relation schema
  • Redefining candidate keys
  • 1. It is a super key
  • 2. No subset of it is a super key

31
Computing the closure for A
  • Simple algorithm
  • 1. Start with B A.
  • 2. Go over all functional dependencies, ? ? ? ,
    in F
  • 3. If ? ? B, then
  • Add ? to B
  • 4. Repeat till B changes

32
Example
  • R (A, B, C, G, H, I)F A ? B A ? C CG
    ? H CG ? I B ? H
  • (AG) ?
  • 1. result AG
  • 2. result ABCG (A ? C and A ? B)
  • 3. result ABCGH (CG ? H and CG ? AGBC)
  • 4. result ABCGHI (CG ? I and CG ? AGBCH
  • Is (AG) a candidate key ?
  • 1. It is a super key.
  • 2. (A) BC, (G) G.
  • YES.

33
Support
  • Simplest question find sets of items that appear
    frequently in the baskets.
  • Support for itemset I the number of baskets
    containing all items in I.
  • Given a support threshold s, sets of items that
    appear in gt s baskets are called frequent
    itemsets.

34
Example
  • Itemsmilk, coke, pepsi, beer, juice.
  • Support 3 baskets.
  • B1 m, c, b B2 m, p, j B3 m, b
  • B4 c, j B5 m, p, b B6 m,
    c, b, j
  • B7 c, b, j B8 b, c
  • Frequent itemsets m, c, b, j, m,
    b, c, b, j, c.

35
Association Rules
  • Association rule R Itemset1 gt Itemset2
  • Itemset1, 2 are disjoint and Itemset2 is
    non-empty
  • meaning if transaction includes Itemset1 then
    it also has Itemset2
  • Examples
  • A,B gt E,C
  • A gt B,C

36
Example
  • B1 m, c, b B2 m, p, j
  • B3 m, b B4 c, j
  • B5 m, p, b B6 m, c, b, j
  • B7 c, b, j B8 b, c
  • An association rule m, b ? c.
  • Confidence 2/4 50.

_ _

37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
From Frequent Itemsets to Association Rules
  • Q Given frequent set A,B,E, what are possible
    association rules?
  • A gt B, E
  • A, B gt E
  • A, E gt B
  • B gt A, E
  • B, E gt A
  • E gt A, B
  • __ gt A,B,E (empty rule), or true gt A,B,E

42
Classification vs Association Rules
  • Classification Rules
  • Focus on one target field
  • Specify class in all cases
  • Measures Accuracy
  • Association Rules
  • Many target fields
  • Applicable in some cases
  • Measures Support, Confidence, Lift

43
  • Apriori Algorithm (1997)
  • Let I a,b,c, be a set of all items in the
    domain
  • Let T S S ? I be a bag of all transaction
    records of item sets
  • Let support(S) ? A A ? T ? S ? A
  • Let L1 a a ? I ? support(a) ?
    minSupport
  • ?k (k gt 1 ? Lk-1 ? ?) Let
  • Lk Si ? Sj (Si ? Lk-1) ? (Sj ? Lk-1) ?
  • ( Si Sj 1 ) ? (
    Sj Si 1) ?
  • ( ?S ((S ? Si ? Sj) ?
    (S k-1)) ? S ? Lk-1 ) ?
  • ( support(Si ? Sj) ?
    minSupport )
  • The set of all frequent item sets is given by
  • L ?Lk
  • and the set of all association rules is
    given by
  • R A ? C A ? ?(Lk) ? (C Lk A) ? (A ?
    ?) ? (C ? ?)
  • support(Lk) /
    support(A) ? minConfidence

44
Rule Support and Confidence
  • Suppose R I gt J is an association rule
  • sup (R) sup (I ? J) is the support count
  • support of itemset I ? J (I or J)
  • conf (R) sup(J) / sup(R) is the confidence of R
  • fraction of transactions with I ? J that have J
  • Association rules with minimum support and count
    are sometimes called strong rules

45
Association Rules Example
  • Q Given frequent set A,B,E, what association
    rules have minsup 2 and minconf 50 ?
  • A, B gt E conf2/4 50
  • A, E gt B conf2/2 100
  • B, E gt A conf2/2 100
  • E gt A, B conf2/2 100
  • Dont qualify
  • A gtB, E conf2/6 33lt 50
  • B gt A, E conf2/7 28 lt 50
  • __ gt A,B,E conf 2/9 22 lt 50

46
Find Strong Association Rules
  • A rule has the parameters minsup and minconf
  • sup(R) gt minsup and conf (R) gt minconf
  • Problem
  • Find all association rules with given minsup and
    minconf
  • First, find all frequent itemsets

47
Finding Frequent Itemsets
  • Start by finding one-item sets (easy)
  • Q How?
  • A Simply count the frequencies of all items

48
Finding itemsets next level
  • Apriori algorithm (Agrawal Srikant)
  • Idea use one-item sets to generate two-item
    sets, two-item sets to generate three-item sets,
  • If (A B) is a frequent item set, then (A) and (B)
    have to be frequent item sets as well!
  • In general if X is frequent k-item set, then all
    (k-1)-item subsets of X are also frequent
  • Compute k-item set by merging (k-1)-item sets

49
(No Transcript)
50
Finding Association Rules
  • A typical question find all association rules
    with support s and confidence c.
  • Note support of an association rule is the
    support of the set of items it mentions.
  • Hard part finding the high-support (frequent )
    itemsets.
  • Checking the confidence of association rules
    involving those sets is relatively easy.

51
Naïve Algorithm
  • A simple way to find frequent pairs is
  • Read file once, counting in main memory the
    occurrences of each pair.
  • Expand each basket of n items into its n (n
    -1)/2 pairs.
  • Fails if items-squared exceeds main memory.

52
(No Transcript)
53
Filter
Filter
Construct
Construct
C1
L1
C2
L2
C3
First pass
Second pass
54
Agrawal, Srikant 94
Fast Algorithms for Mining Association Rules, by
Rakesh Agrawal and Ramakrishan Sikant, IBM
Almaden Research Center
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
C1
L1
Database
Items TID
1 3 4 100
2 3 5 200
1 2 3 5 300
2 5 400
Set-of-itemsets TID
1,3,4 100
2,3,5 200
1,2,3,5 300
2,5 400
Support Itemset
2 1
3 2
3 3
3 5
C2
L2
C2
itemset
1 2
1 3
1 5
2 3
2 5
3 5
Set-of-itemsets TID
1 3 100
2 3,2 5 3 5 200
1 2,1 3,1 5, 2 3, 2 5, 3 5 300
2 5 400
Support Itemset
2 1 3
3 2 3
3 2 5
2 3 5
C3
L3
C3
Set-of-itemsets TID
2 3 5 200
2 3 5 300
Support Itemset
2 2 3 5
itemset
2 3 5
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
  • Dynamic Programming Approach
  • Want proof of principle of optimality and
    overlapping subproblems
  • Principle of Optimality
  • The optimal solution to Lk includes the
    optimal solution of Lk-1
  • Proof by contradiction
  • Overlapping Subproblems
  • Lemma of every subset of a frequent item set is a
    frequent item set
  • Proof by contradiction

69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)
79
(No Transcript)
80
(No Transcript)
81
The Apriori Algorithm Example
  • Consider a database, D , consisting of 9
    transactions.
  • Suppose min. support count required is 2 (i.e.
    min_sup 2/9 22 )
  • Let minimum confidence required is 70.
  • We have to first find out the frequent itemset
    using Apriori algorithm.
  • Then, Association rules will be generated using
    min. support min. confidence.

TID List of Items
T100 I1, I2, I5
T100 I2, I4
T100 I2, I3
T100 I1, I2, I4
T100 I1, I3
T100 I2, I3
T100 I1, I3
T100 I1, I2 ,I3, I5
T100 I1, I2, I3
82
Step 1 Generating 1-itemset Frequent Pattern
Compare candidate support count with minimum
support count
Scan D for count of each candidate
Itemset Sup.Count
I1 6
I2 7
I3 6
I4 2
I5 2
Itemset Sup.Count
I1 6
I2 7
I3 6
I4 2
I5 2
C1
L1
  • In the first iteration of the algorithm, each
    item is a member of the set of candidate.
  • The set of frequent 1-itemsets, L1 , consists of
    the candidate 1-itemsets satisfying minimum
    support.

83
Step 2 Generating 2-itemset Frequent Pattern
Generate C2 candidates from L1
Compare candidate support count with minimum
support count
Itemset
I1, I2
I1, I3
I1, I4
I1, I5
I2, I3
I2, I4
I2, I5
I3, I4
I3, I5
I4, I5
Itemset Sup. Count
I1, I2 4
I1, I3 4
I1, I4 1
I1, I5 2
I2, I3 4
I2, I4 2
I2, I5 2
I3, I4 0
I3, I5 1
I4, I5 0
Itemset Sup Count
I1, I2 4
I1, I3 4
I1, I5 2
I2, I3 4
I2, I4 2
I2, I5 2
Scan D for count of each candidate
L2
C2
C2
84
Step 2 Generating 2-itemset Frequent Pattern
Cont.
  • To discover the set of frequent 2-itemsets, L2 ,
    the algorithm uses L1 Join L1 to generate a
    candidate set of 2-itemsets, C2.
  • Next, the transactions in D are scanned and the
    support count for each candidate itemset in C2 is
    accumulated (as shown in the middle table).
  • The set of frequent 2-itemsets, L2 , is then
    determined, consisting of those candidate
    2-itemsets in C2 having minimum support.
  • Note We havent used Apriori Property yet.

85
Step 3 Generating 3-itemset Frequent Pattern
Compare candidate support count with min support
count
Itemset Sup. Count
I1, I2, I3 2
I1, I2, I5 2
Itemset Sup Count
I1, I2, I3 2
I1, I2, I5 2
Scan D for count of each candidate
Scan D for count of each candidate
Itemset
I1, I2, I3
I1, I2, I5
C3
L3
C3
  • The generation of the set of candidate
    3-itemsets, C3 , involves use of the Apriori
    Property.
  • In order to find C3, we compute L2 Join L2.
  • C3 L2 Join L2 I1, I2, I3, I1, I2, I5,
    I1, I3, I5, I2, I3, I4, I2, I3, I5, I2,
    I4, I5.
  • Now, Join step is complete and Prune step will
    be used to reduce the size of C3. Prune step
    helps to avoid heavy computation due to large Ck.

86
Step 3 Generating 3-itemset Frequent Pattern
Cont.
  • Based on the Apriori property that all subsets of
    a frequent itemset must also be frequent, we can
    determine that four latter candidates cannot
    possibly be frequent. How ?
  • For example , lets take I1, I2, I3. The 2-item
    subsets of it are I1, I2, I1, I3 I2, I3.
    Since all 2-item subsets of I1, I2, I3 are
    members of L2, We will keep I1, I2, I3 in C3.
  • Lets take another example of I2, I3, I5 which
    shows how the pruning is performed. The 2-item
    subsets are I2, I3, I2, I5 I3,I5.
  • BUT, I3, I5 is not a member of L2 and hence it
    is not frequent violating Apriori Property. Thus
    We will have to remove I2, I3, I5 from C3.
  • Therefore, C3 I1, I2, I3, I1, I2, I5
    after checking for all members of result of Join
    operation for Pruning.
  • Now, the transactions in D are scanned in order
    to determine L3, consisting of those candidates
    3-itemsets in C3 having minimum support.

87
Step 4 Generating 4-itemset Frequent Pattern
  • The algorithm uses L3 Join L3 to generate a
    candidate set of 4-itemsets, C4. Although the
    join results in I1, I2, I3, I5, this itemset
    is pruned since its subset I2, I3, I5 is not
    frequent.
  • Thus, C4 f , and algorithm terminates, having
    found all of the frequent items. This completes
    our Apriori Algorithm.
  • Whats Next ?
  • These frequent itemsets will be used to generate
    strong association rules ( where strong
    association rules satisfy both minimum support
    minimum confidence).

88
Step 5 Generating Association Rules from
Frequent Itemsets
  • Procedure
  • For each frequent itemset l, generate all
    nonempty subsets of l.
  • For every nonempty subset s of l, output the rule
    s ? (l-s) if
  • support_count(l) / support_count(s) gt min_conf
    where min_conf is minimum confidence threshold.
  • Back To Example
  • We had L I1, I2, I3, I4, I5,
    I1,I2, I1,I3, I1,I5, I2,I3, I2,I4,
    I2,I5, I1,I2,I3, I1,I2,I5.
  • Lets take l I1,I2,I5.
  • Its all nonempty subsets are I1,I2, I1,I5,
    I2,I5, I1, I2, I5.

89
Step 5 Generating Association Rules from
Frequent Itemsets Cont.
  • Let minimum confidence threshold is , say 70.
  • The resulting association rules are shown below,
    each listed with its confidence.
  • R1 I1 I2 ? I5
  • Confidence scI1,I2,I5/scI1,I2 2/4 50
  • R1 is Rejected.
  • R2 I1 I5 ? I2
  • Confidence scI1,I2,I5/scI1,I5 2/2 100
  • R2 is Selected.
  • R3 I2 I5 ? I1
  • Confidence scI1,I2,I5/scI2,I5 2/2 100
  • R3 is Selected.

90
Step 5 Generating Association Rules from
Frequent Itemsets Cont.
  • R4 I1 ? I2 I5
  • Confidence scI1,I2,I5/scI1 2/6 33
  • R4 is Rejected.
  • R5 I2 ? I1 I5
  • Confidence scI1,I2,I5/I2 2/7 29
  • R5 is Rejected.
  • R6 I5 ? I1 I2
  • Confidence scI1,I2,I5/ I5 2/2 100
  • R6 is Selected.
  • In this way, We have found three strong
    association rules.

91
Example
Simple algorithm
ABCDE
Large itemset
ACDE?B
ABCE?D
Rules with minsup
CDE?AB
BCE?AD
ABE?CD
ADE?BC
ACD?BE
ACE?BD
ACE?BD
ABC?ED
ABCDE
Fast algorithm
ACDE?B
ABCE?D
ACE?BD
92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com