Parallel Mining of Association Rules - PowerPoint PPT Presentation

Loading...

PPT – Parallel Mining of Association Rules PowerPoint presentation | free to download - id: 9f5f6-NzFhN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Parallel Mining of Association Rules

Description:

The need of fast algorithms for discovering association rules ... Why Parallel Algorithms? ... Three parallel algorithms: CD, DD, CaD based on Apriori ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 44
Provided by: tinghi
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Parallel Mining of Association Rules


1
Parallel Mining of Association Rules
Rakesh Agrawal John C.Shafer
Presented by Ting Hian Ong Xu XingJian http//www
.comp.nus.edu.sg/tinghian/cs6203
2
  • Introduction
  • Overview of Serial AlgorithmParallel
    Algorithms
  • Count Distribution (CD)
  • Data Distribution (DD)
  • Candidate Distribution (CaD)
  • Parallel Rule Generation
  • Performance Sensitivity Analysis
  • Conclusions
  • Q A

3
Ultra-large databases
Possibility of faster access and manipulation
  • DATA MINING
  • The efficient discovery of previously unknown
    patterns in large databases
  • ? The need of fast algorithms for discovering
    association rules

4
  • Why Parallel Algorithms?
  • Databases (raw transaction data instead of
    samples) to be mined are often very large - in GB
    and TB
  • The need of fast algorithm for discovering
    association rules
  • Transaction databases has to be scanned
    repeatedly in discovering the frequent itemsets
  • Requires a lot of computation power, memory and
    I/O, which can only provided by parallel computer
    using parallel algorithms

5
  • Three parallel algorithms introduced
  • Count Distribution (CD)
  • Data Distribution (DD)
  • Candidate Distribution (CaD)
  • Based on the serial algorithm Apriori

6
  • Association Rules
  • The problem of mining association rules is to
    generate all association rules that have certain
    user-specified minimum support and confidence.
  • Problem Decomposition
  • Find all sets of items whose support is greater
    than the user-specified minimum support (frequent
    itemsets)
  • Use frequent itemsets to generate the desired
    rules

7
Apriori Algorithm L1 frequent 1-itemsets k
2 while (Lk-1 ¹ 0) do Cknew candidates of
size k generated from Lk-1 forall transactions
t Î D do Increment the count of all candidates
in Ck that are contained in t LkAll
candidates in Ck with minimum support kk1 en
d Answer Èk Lk
8
Apriori Algorithm Candidate Generation Join
step insert into Ck select p.item1, p.item2,
, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where
p.item1q.item1, , p.itemk-2q.itemk-2,
p.itemk-1 lt q.itemk-1 Prune step delete all
itemsets c Î Ck such that some (k-1)-subset of c
is not in Lk-1
9
  • Three parallel algorithms CD, DD, CaD based on
    Apriori
  • Discovering frequent itemsets (1) is much more
    expensive than generating rules (2)
  • Phase 1
  • Each node generates candidate k-itemsets locally
    from the frequent (k-1)-itemsets ? how to
    partition?
  • Phase 2
  • The match candidates itemsets and transactions
    collect the local counts ? how to distribute?
  • Phase 3
  • - determine the global counts for itemsets ? how
    to find?
  • find frequent k-itemsets and replicate in all
    nodes

10
  • Implemented on an IBM POWERparallel System SP2, a
    shared-nothing machine, where each of N
    processors has a private memory and a private
    disk.
  • Data is evenly distributed among the nodes

11
(No Transcript)
12
  • Objective minimizing communication
  • Techniques
  • - Straight-forward parallelization of Apriori
  • Carry out redundant duplicate computations in
    parallel to avoid communication
  • Only requires communicating count values (no data
    tuples are exchanged)
  • Processors can scan the local data asynchronously
    in parallel

13
  • Algorithm
  • Pass 1
  • Each processor Pi generates its local candidate
    itemset Ci1 depending on the items present in its
    local data partition Di
  • Develop and Exchange local counts Ci1
  • Develop global support counts C1

14
  • Algorithm
  • Pass kgt1
  • Pi generates the complete Ck using the complete
    Lk-1 created at the end of pass (k-1). Each
    processor has the identical Lk-1 thus generates
    identical Ck and puts its count values in a
    common order into a count array
  • Pi makes a pass over data partition Di and
    develop local support counts for candidates in Ck
  • Pi exchanges local Ck counts with all other
    processors to develop global Ck counts. All
    processors must synchronize.
  • Pi computes Lk from Ck
  • Pi independently decide to terminate or continue
    to the next pass

15
(No Transcript)
16
  • Disadvantages
  • CD does not exploit the aggregate memory of the
    system
  • Must synchronize and develop global count at the
    end of each pass

17
  • Objective utilize aggregate main memory of the
    system effectively
  • Technique
  • Partitions the candidates into disjoint sets,
    which are assigned to different processors. Each
    processor works with the entire dataset but only
    portion of the candidate set.
  • Each processor counts mutually exclusive
    candidates. On a N-processor configuration, DD
    can count in a single pass candidate set that
    would require N pass in CD

18
Basic Idea
  • Example 2 processors
  • Data Distribution only processes a subset of Ck
    to utilize the aggregate memory
  • Exchange data to develop global counts for Cki

data
19
  • Algorithm
  • Pass 1 Same as the CD algorithm
  • Pass kgt1
  • Pi generates Ck from Lk-1. It retains only 1/N of
    the itemsets forming Cik
  • Pi develops support counts for itemsets in Cik
    for ALL transactions (using local data pages and
    data pages received from other processors)
  • At the end of the data pass, Pi calculates Lik
    using local Cik
  • Processors exchange Lik so that every processor
    has the complete Lk for generating Ck1 for the
    next pass (requires processors to synchronize)
  • Pi can independently decide whether to terminate
    or continue on to the next pass

20
Lik
Lik
Lik
Lk
21
Disadvantages heavy communication Each
processor must broadcast their local data and
frequent itemsets to all other processors and
synchronize in every pass.
22
  • Problem
  • CD and DD require processors to synchronize at
    the end of each pass
  • Basic Idea Remove dependence among processors
  • Data dependence

Complete transactions are required to compute
support count (in CD)
  • Frequent itemsets dependency

A global itemset Lk is needed during the pruning
step of Apiori candidate generation algorithm(in
DD)
23
  • Remove Data Dependency
  • Each processor Pi works on Cki, a disjoint subset
    of Ck
  • Pi derives global support counts for Cki from
    local data.
  • Replicate data amongst processors in order to
    achieve the above
  • Reduce Frequent itemset dependency
  • Does not wait for the complete pruning
    information to arrive from other processors.
  • Prune the candidate set as much as possible
  • Late arriving pruning information is used in
    subsequent passes.

24
  • Algorithm
  • Pass kltl Use either the CD or DD algorithm
  • Pass kl
  • Partition Lk-1 among N processors
  • Pi generates Cik logically using only the Lik-1
    partition (use standard pruning)
  • Pi develops global counts for candidates in Cik
    and the database is repartitioned into D Ri at
    the same time (requires communicating local data)
  • Pi receive Ljk from all other processors needed
    for pruning Cik1
  • Pi computes Lik from Cik and asynchronously send
    it to the other N-1 processors
  • Pass kgtl
  • Pi collects all frequent itemsets sent by other
    processors
  • Pi generates Cik using local Lik-,, take care of
    pruning(Ljl-1)
  • Pi passes over D Ri and counts Cik
  • Pi computes Lik from Cik and asynchronously send
    it to the other N-1 processors

25
  • How to partition Lk ?
  • Partition the itemsets in Lk based on common k-1
    long prefixes
  • Assume items in the itemsets are
    lexicographically ordered
  • Example (in the paper) an error ADE
  • L3 ABC, ABD, ABE, ACD, ACE, BCD, BCE, BDE,
    CDE
  • L4 ABCD, ABCE, ABDE, ACDE, BCDE
  • L5 ABCDE
  • L6 Æ
  • ABC, ABD, ABE ? all have common prefix AB
  • The apriori candidate generation procedure
    generate ABCD, ABCE, ABDE, and ABCDE by joining
    only the items in e
  • Repartition the database according to Lk Partition

26
  • In candidate distribution, each processor works
    independently by counting only its portion of
    global candidate set using only local data
  • CaD must communicate the entire dataset during
    the redistribution pass (kl step 3), but only
    once. Unlike DD, processors may selectively
    filter out transactions it sends to other
    processors depending upon how the dependency
    graph is partitioned.

27
Given a frequent itemset l examine a subset a and
generate rule a gt (l-a) with support
support(l) and confidence support(l) /
support(a) Example Frequent itemsets ABCD,
AB Confidence support(ABCD) / support
(AB) Only proceed to smaller subsets if rules
have the required minconf. Example Frequent
itemset ABCD, If ABC Þ D doesnt satisfy
minconf, AB Þ CD will not have minconf
28
  • Examination of dataset is not required.-gt Cheap
  • Generating rules in parallel need partitioning
    the set of all frequent itemsets. Each processor
    generates rule for its partition only using the
    algorithm.
  • Sensitive to itemsets length, balancing by
    partitioning the itemsets of each length
    equally.
  • Each processor must have access to all frequent
    itemsets before rule generation begins for
    calculating the confidence.
  • ?In CaD occurs waiting time for slower processors
    to discover and transmit all frequent itemsets.
  • Due to load imbalance, this can be performed
    off-line, possibly on a serial processor.

29
  • Hardware specifications
  • a 32-node IBM SP2 Model 302
  • Each node is a Thin Node 2 consisting of a
    POWER2processor running at 66.7 MHz with 256MB
    memory
  • Each node has 2GB disk of which less than 500MB
    available for tests
  • Combined communication hardware has a rated peak
    bandwitdh of 80 MBps and latency lt 40 ms. Actual
    point-to-point bandwidth reached 20 MBps
  • Message Passing Interface (MPI) was used to
    facilitate communication among processors

30
  • Six synthetic datasets used of varying complexity
  • All datasets size were about 100 MB per
    processor
  • Data Parameters

T Average transaction length I Average size
of frequent itemsets D Average number of
transactions
31
  • TEST PARAMETERS
  • Response Time
  • The time elapsed from the initiation of the
    execution to the end time of the last processor
    finishing the computation
  • Note
  • - Run on the 6 datasets on 16-node configuration
  • - Since limited disk space available, the
    response time for the serial version are run on 1
    nodes worth of data or 1/16th of the database
  • - Repartitioning for CaD was done in the 4th
    pass (best performance)

32
DD
CaD
CD
Serial
Response times for CD and CaD are much lower than
DD and close to the serial version run with 1/N
data
33
  • DD was able to exploit aggregate memory of the
    multiprocessor and make fewer passes in the case
    of datasets with large average transaction and
    frequent itemset lengths.
  • CaD makes just as many data passes as CD, because
    the large candidate sets that force CD into
    multiple subpasses all occur before CaD takes
    over with its redistribution pass.

34
Normal
No Communication
  • Normal DD the same 100 MB data replicated on
    each of the 16 nodes
  • No-communication DD a node is not receiving
    data from other nodes, simply processed its local
    data 15 more times.
  • Half of the time taken by DD was for
    communication.
  • I/O savings due to DD making fewer passes become
    negligible

35
  • DD performs quite low for 2 reasons
  • Extra communication
  • Every node in the system must process every
    single transaction
  • CaD must communicate the entire dataset during
    the redistribution pass ONCE, also suffers the
    same problems as DD.
  • Unfortunately a single pass of redistribution is
    costly. The savings from each processor that can
    run completely independently with smaller
    candidate sets can not compensate the cost.
  • Although CDs overhead is small (less than 7.5
    to serial version), synchronization cost can be
    large if the data distributions are skewed or the
    nodes are not equally capable (different memory,
    processor speed, I/O bandwidth, and capacities)
  • Suggestion CD Load Balance

36
  • TEST PARAMETERS (only on CD algorithm)
  • Scaleup
  • Increased the size of the database in direct
    proportion to the number of nodes in the system
  • Sizeup
  • Fixed the size of the multiprocessor at 16
    nodes, while increasing the database from 25MB
    per node to 400MB per node
  • Speedup
  • Fixed the size of each database at 400 MB and
    varied the number of processors

37
SCALE UP CD scales linearly able to keep the
response time almost constant as the database
and multiprocessor size increase. Reasons The
itemsets found by CD doesnt change as the
database size increased, the number of candidates
whose support must be summed by the communication
phase remains constant
38
SIZE UP CD shows sublinear performance, the
program is actually more efficient as the
database size increase. More efficientincreasing
size of database ? more I/O and transaction
processing ? less portion of time spent in
communication
39
SPEED UP CD has a very good speedup performance,
up to 8 processors Larger datasets shows better
speedup characteristics.The more data processed
per node, the less significant becomes the
communication time
40
  • Count Distribution attempts to minimize
    communication by replicating the candidate sets
    in each processors memory
  • Data Distribution maximizes the use of aggregate
    memory by allowing each processor works with the
    entire dataset but only portion of the candidate
    set
  • Candidate Distribution eliminates the
    synchronization costs at the end of every pass,
    maximizes the use of aggregate memory while
    limiting heavy communication to a single
    redistribution pass

41
(No Transcript)
42
(No Transcript)
43
  • Count Distribution exhibited linear scale-up and
    excellent speed-up and size-up behaviour
  • Data Distribution lost out because of the cost of
    broadcasting local data from each processor to
    every other processor and Candidate Distribution
    lost because the cost of data redistribution.
  • Not all problems require an intricate
    parallelization

44
  • Q A
About PowerShow.com