SPRINT: A Scalable Parallel Classifier for Data Mining - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

SPRINT: A Scalable Parallel Classifier for Data Mining

Description:

SPRINT: A Scalable Parallel Classifier for Data Mining. IBM Almaden ... This paper present a decision-tree-based classification algorithm, called SPRINT ... – PowerPoint PPT presentation

Number of Views:297
Avg rating:3.0/5.0
Slides: 33
Provided by: msy1
Category:

less

Transcript and Presenter's Notes

Title: SPRINT: A Scalable Parallel Classifier for Data Mining


1
SPRINT A Scalable Parallel Classifier for Data
Mining
  • IBM Almaden Research Center, 1996
  • ????? ? ?????
  • ????3?? ???

2
1. Introduction (1/2)
  • Classification recently has been focus on
    algorithms that can handle large databases.
  • In classification, are given
  • A set of example records called a training set
  • Each record consists of several fields or
    attributes
  • Attributes continuous coming from an ordered
    domain
  • categorical coming from
    an unordered domain
  • One of the attributes, called the classifying
    attributes
  • Objective of classification
  • To build a model of the classifying attribute
    based upon the other attributes

3
1. Introduction (2/2)
  • Decision tree are suited for data mining.
  • Therefore, focused on building a scalable and
    parallelizable decision-tree classifier
  • This paper present a decision-tree-based
    classification algorithm, called SPRINT
  • Removes all of the memory restriction
  • Fast and scalable
  • Easily parallelized

4
2. Serial Algorithm (1/2)
  • Decision tree classifier is built in two phases
  • Growth phase
  • The tree is built by recursively partitioning the
    data until each partition is either pure or
    sufficiently small (figure 2)
  • Prune phase
  • Pruning requires access only to the fully grown
    decision tree

5
2. Serial Algorithm (2/2)
  • Two major issues that have critical performance
    implication in the tree-growth phase
  • How to find split points that define node tests
  • Having chosen a split point, how to partition the
    data
  • SPRINT addresses the above two issues differently
    from previous algorithms
  • It has no restriction on the size of input
  • Yet is a fast algorithm
  • It shares with SLIQ the advantage of a one-time
    sort, but uses different data structure

6
2.1 Data Structures (1/4)
  • Attribute lists
  • SPRINT initially creates an attribute list for
    each attribute in the data (figure 3)

7
2.1 Data Structures (2/4)
  • As the tree is grown and nodes are split to
    create new children, the attribute lists
    belonging to each node are partitioned and
    associated with the children.(figure 4)

8
2.1 Data Structures (3/4)
  • Histograms
  • For continuous attributes, two histogram are
    associated with each decision-tree node that is
    under consideration for splitting
  • Cbelow maintains this distribution for
    attribute records that have already been
    processed
  • Cabove maintains it for those that have not
  • Categorical attributes also have a histogram
    associated with a node
  • However, only one histogram is needed and it
    contains the class distribution for each value of
    the given attribute
  • ? call this histogram a count matrix

9
2.1 Data Structures(4/4)
10
2.2 Finding split points(1/3)
  • While growing the tree, the goal at each node
  • To determine the split point that best divides
    the training records belonging to that leaf
  • This paper use the gini index
  • Gini index based on this papers experience with
    SLIQ
  • For data set S containing examples from n classes
  • gini(S) 1 - ?pj2 pj relative frequency of
    class j in S
  • If a split divides S into two subsets S1 and S2,
    the index of the divided data ginisplit(S)
  • Ginisplit(S) (n1/n)gini(S1)(n2/n)gini(S2)
  • Advantage of this index its calculation requires
    only the distribution of the class in each of the
    partitions.

11
2.2 Finding split points(2/3)
  • Continuous attributes
  • For continuous attributes, the candidate split
    points are mid-points between every two
    consecutive attribute values in the training data
  • Histogram Cbelow is initialized to zeros whereas
    Cabove is initialized with the class distribution
    for all records for the node
  • For the root node, this distribution is obtained
    at the time of sorting
  • For other nodes this distribution is obtained
    when the node is created
  • Attribute records are read one at a time and
    Cbelow and Cablove are updated for each record
    read (figure 5)
  • Note that Cbelow and Cablove have all the
    necessary information to compute the gini index.

12
2.2 Finding split points (3/3)
  • Categorical attributes
  • For categorical split-points, make a single scan
    through the attribute list collection counts in
    the count matrix for each combination of class
    label and attribute valus found in the
    data(figure 6)
  • The important point
  • Information required for computing the gini index
    for any subset splitting is available in the
    count matrix

13
2.3 Performing the split
  • Once the best split point has been found for
    node, we execute the split by creating child
    nodes and dividing the attribute records between
    them (figure 4)
  • For the remaining attribute lists of the
    node(CarType in our example)
  • have no test that can apply to the attributes
    values to decide how to divide the records ?
    therefore work with the rid
  • As we partition the list of the splitting
    attributes( i.e. Age), insert the rid of each
    record into probe structure(hash table), noting
    to which child the record was moved
  • During this splitting operation, we also build
    class histogram for each new leaf

14
2.4 Comparison with SLIQ(1/2)
  • The technique of creating separate attribute
    lists from the original data ?was proposed by the
    SLIQ algorithm
  • In SLIQ
  • Entry in an attribute list an attribute value
    and a rid
  • Class label kept in a separate data-structure
    called a class list which is indexed by rid
  • Entry in class list contains a pointer to a
    node of the classification tree
  • Figure 7
  • Our goal in designing SPRINT was not to
    outperform SLIQ on datasets where a class list
    can fit in memory
  • To develop an accurate classifier for datasets
    that are simply too large for any other algorithm
  • To de able to develop such a classifier
    sufficiently
  • SPRINT is designed to be easily parallelizable

15
2.4 Comparison with SLIQ(2/2)
16
3. Parallelizing Classification
  • In parallel tree-growth, the primary problems
  • Finding good split-points
  • Partitioning the data using the discovered split
    points
  • As in any parallel algorithm must be considered
  • Issues of data placement
  • Workload balancing
  • Resolved in the SPRINT algorithm
  • assume a shared-nothing parallel environment
    where each of N processors has private memory and
    disk

17
3.1 Data Placement and Workload Balancing
  • SPRINT achieves uniform data placement and
    workload balancing by distributing the attribute
    lists evenly over N processor of a shared-nothing
    machine.
  • The partitioning is achieved by first
    distributing the training-set examples equally
    among all the processors.

18
3.2 Finding split points
  • Very similar to the serial algorithm
  • In serial version, processor
  • Scan the attributes either evaluating split
    points for continuous attributes or collecting
    distribution counts for categorical attributes
  • This does not change in the parallel algorithm
  • No extra work or communication is required while
    each processor is scanning its attribute-list
    partitions.
  • The differences between the serial and parallel
    algorithm
  • Appear only before and after the attribute-list
    partitions are scanned.

19
3.3 Performing the Splits
  • Splitting is identical to the serial algorithm
  • Only additional step is that before building the
    probe structure, we will need to collect rids
    from all the processors

20
3.4 Parallelizing SLIQ
  • To primary approaches for parallelizing SLIQ
  • One where the class list is replicated in the
    memory of every processor
  • called SLIQ/R
  • The other where it is distributed such that each
    processors memory holds only a portion of the
    entire list
  • called SLIQ/D

21
4. Performance Evaluation
  • The primary metric for evaluating classifier
    performance is classification accuracy the
    percentage of test samples that are correctly
    classified.
  • The other important metrics are classification
    time and the size of the decision tree.
  • The ideal goal for decision tree
  • Classifier is to produce compact, accurate trees
    in a short classification time

22
4.1 Datasets (1/2)
  • Due to the lack of a classification benchmark
    containing large datasets? use the synthetic
    database for all of experiment

23
4.1 Datasets (2/2)
  • Ten classification functions produce databases
    with distributions with varying complexities
  • In this paper, present results for two of these
    function

24
4.2 Serial Performance
  • For serial analysis, compare the response times
    of serial SPRINT and SLIQ on training sets of
    various sizes
  • Experiments were conducted on an IBM RS/6000 250
    workstation running AIX level 3.2.5
  • The CPU has a clock rate of 66MHz and 16MB of
    main memory

25
4.3 Parallel Performance
  • To examine how well the SPRINT algorithm performs
    in parallel environments
  • Implemented parallelization on an IBM SP2
  • Using the standard MPI communication
    primitives
  • Experiments were conducted on a 16-node IBM sp2
    Model 9076
  • Each node in the multiprocessor is a 370 Node
    consisting of a POWER1 processor running at
    62.5MHZ with 128MB of real memory.

26
4.3.1 Comparison of Parallel Algorithm (1/2)
  • First compare parallel SPRINT to the two
    parallelizations of SLIQ.
  • In these experiments, each processor contained
    50,000 training examples and the number of
    processors varied from 2 to 16
  • The total training-set size ranges from 100,000
    records to 1.6 million records

27
4.3.1 Comparison of Parallel Algorithm (2/2)
28
4.3.2 Scaleup
29
4.3.3 Speedup
30
4.3.4 Sizeup
31
5. Conclusion (1/2)
  • Need for algorithm for building classifiers that
    can handle very large database.
  • Recently proposed SLIQ algorithm was the first to
    address these concerns.
  • In this paper, presented a new classification
    algorithm called SPRINT
  • Removes all memory restrictions that limit
    existing decision-tree algorithm
  • Yet exhibits the same excellent scaling behavior
    as SLIQ

32
5. Conclusion (2/2)
  • Design goals
  • Included the requirement that the algorithm be
    easily and efficiently parallelizable.
  • SPRINT does have an efficient parallelization
    that requires very few additions to the serial
    algorithm.
  • SPRINT handles datasets that are too large for
    SLIQ.
  • Implementation on SP2, a shared-nothing
    multiprocessor, showed that SPRINT does indeed
    parallelize efficiently.
  • Parallel SPRINTs efficiency improves as the
    problem size increase.
  • Scaleup, speedup, and sizeup characterisitc
Write a Comment
User Comments (0)
About PowerShow.com