Supporting Dynamic Load Balancing in a Parallel Data Mining Middleware - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Supporting Dynamic Load Balancing in a Parallel Data Mining Middleware

Description:

Supporting Dynamic Load Balancing in a Parallel Data Mining Middleware Tekin Bicer and Gagan Agrawal Department of Computer Science and Engineering – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 19
Provided by: Tek3
Category:

less

Transcript and Presenter's Notes

Title: Supporting Dynamic Load Balancing in a Parallel Data Mining Middleware


1
Supporting Dynamic Load Balancing in a Parallel
Data Mining Middleware
Tekin Bicer and Gagan Agrawal Department of
Computer Science and Engineering The Ohio State
University
Parallel Data-Mining Workshop 2011, Arizona.
PDM 2011
2
Motivation
  • Data mining applications
  • Non-dedicated machines in grids
  • Virtualized machines in clouds
  • Distributed Large Datasets
  • Distributed Computing Resources
  • Highly heterogeneous environments

3
A Parallel Data Mining Middleware - FREERIDE
  • Reduction Object represents the intermediate
    state of the execution
  • Reduce func. is commutative and associative
  • Sorting, grouping.. overheads are eliminated
    with red. func/obj.

4
Remote Data Analysis
  • Co-locating resources gives best performance
  • But may not be always possible
  • Cost, availability etc.
  • Data hosts and compute hosts are separated
  • Fits grid/cloud computing
  • Our parallel data mining middleware is an
    implementation of FREERIDE that supports remote
    data analysis

5
Outline
  • Motivation and Introduction
  • Load Balancing System Design
  • Experimental Evaluation
  • Conclusion

6
A Load Balancing System based on Independent Jobs
  • Reduction function is associative and commutative
  • Data can be divided into independent jobs
  • Jobs can be consumed according to the processing
    power of the compute resources

Suitable for pooling-based job scheduling
7
Jobs
  • Data are physically stored in many files and many
    nodes
  • The index information is extracted
  • data location, offset address, size of job
  • job creation
  • abstraction from the physical appearance of data
  • Compute and data nodes are geographically
    separated
  • Available bandwidth among nodes need to be
    considered

8
Load Balancing System Workflow
Compute Nodes
Data Nodes


4
DN
DN
CN
CN
1
2
3
5
Job Scheduler
  • Each data node registers its bandwidth and index
  • information to the scheduler

2. Compute nodes start requesting jobs from job
scheduler
3. Job scheduler creates jobs from the most
suitable data nodes and assigns them to the
requesting compute nodes
4. Compute nodes start retrieving and processing
assigned jobs from the data nodes
5. When the compute nodes are done processing the
jobs, they request more from the job
scheduler
9
Job Scheduler
  • While
  • compNode ReceiveRequest()
  • if ( CheckAssignedJobs(compNode) )
  • Set previously assigned jobs as processed
  • Increase available bandwidth on data node
  • dataNode AvailDataNode(dataNodes)
  • job CreateJob(dataNode, compNode.chunkNumber)
  • Transfer(job, compNode)
  • Decrease the available bandwidth on data node

10
Outline
  • Motivation and Introduction
  • Load Balancing System Design
  • Experimental Evaluation
  • Conclusion

11
Goals for the Experiments
  • Evaluate the performance of the proposed system
  • Evaluate the overheads with different
    heterogeneous configurations
  • Observing the load balancing system performance
    with different job granularities

12
Experimental Setup
  • FREERIDE-G
  • Data hosts and compute nodes are separated
  • Applications
  • K-means 25.6 and 6.4 GB datasets
  • PCA 17 and 4 GB datasets
  • 4 data nodes (fixed) and varying number of
    compute nodes
  • Opteron 254, 4 GB memory
  • Connected with Mellanox Infiniband

13
Effectiveness and Overheads
Maximum overhead 2.69
Speedups 1.46-1.50
K-Means 25.6 GB data, varying comp. nodes
14
Effectiveness and Overheads
Speedups 1.49-1.52
Maximum overhead 0
PCA 17 GB data, varying comp. nodes
15
Distribution of jobs with varying slowdown ratios
Absolute overhead 8.5
PCA 4 GB data, 8 comp. nodes
16
Performance w.r.t. Different Chunk Granularities
Absolute overheads
16.01
6.31
1.58
0.72
K-Means 6.4 GB data, 4 comp. nodes
17
Conclusion
  • Load balancing is necessary for heterogeneous
    environments
  • Independent tasks help effective work
    distribution
  • Our system has very low overhead
  • Provided API for load balancing can easily be
    customized
  • Selection of the data node, job granularities

18
  • Thanks
Write a Comment
User Comments (0)
About PowerShow.com