Grid Computing in Data Mining and Data Mining on Grid Computing - PowerPoint PPT Presentation

About This Presentation

Title:

Grid Computing in Data Mining and Data Mining on Grid Computing

Description:

Upload Data and Task to Pool. Condor Pool. Local Host. Learn Classifier. Evaluate Classifier ... Upload/download latency. Data size. Algorithm Complexity. 14 ... – PowerPoint PPT presentation

Number of Views:190

Avg rating:3.0/5.0

Slides: 26

Provided by: dougl229

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Grid Computing in Data Mining and Data Mining on Grid Computing

1
Grid Computing in Data Mining and Data Mining
on Grid Computing

David Cieslak (dcieslak_at_cse.nd.edu)
Advisor Nitesh Chawla (nchawla_at_cse.nd.edu)
University of Notre Dame

2
Grid Computing in Data Mining

How you help me

3
Data Mining Primer

Data Mining"The non-trivial process of
identifying valid, novel, potentially useful, and
ultimately understandable patterns in data".
-Fayyad, Piatetsky-Shapiro Smyth, 1996.
Classifier Learning algorithm which trains a
predictive model from data
Ensemble A set of classifiers working together
to improve prediction

4
Applications of Data Mining

Network Intrusion Detection
Categorizing Adult Income
Finding Calcifications in Mammography
Looking for Oil Spills
Identifying Handwritten Digits
Predicting Job Failure on a Computing Grid
Anticipating Successful Companies

5
Condor Makes DM Tractable

I use a small set of algorithms in high volume
Ex Run same classifier on many datasets
A single data mining operation may have easily
parallelized segments
Ex Learn an ensemble of 10 classifiers on
dataset
Introducing simple parallelism into data mining
conserves time significantly

6
Common DM Task 10 Fold CV
Original Data
Network Traffic Dataset 30 MB Data
7
Common DM Task 10 Fold CV
Original Data
10 Training Folds
27 MB Data
10 Testing Folds
30 MB Data
3 MB Data
8
Common DM Task 10 Fold CV
Training Fold i
Learning Algorithm
Train Classifier
Evaluate Classifier
Testing Fold i
27 MB Data
3 MB Data
RIPPER
lt 1 min
2 Hours
9
Common DM Task 10 Fold CV
Average and aggregate various statistics and
measures across folds
27 MB Data
3 MB Data
10
Using Condor on 10 Folds
Condor Pool
Local Host
Local Host

Receive Results
Aggregate/Average

Splits Data
Upload Data and Task to Pool

Learn Classifier
Evaluate Classifier
Return results

5 mins
2 Hours
5 mins

If there is 1 Hour Learn/Eval time, Condor saves
up to 18 hours in real time

11
A More Complex DM Task

Over/Under Sampling Wrapper
Split data into 50 folds (single)
Generate 10 undersamplings and 20
oversamplings per fold (pool)
Learn classifier on each undersampling
(pool)
Evaluate and select best undersampling
(single)
Learn classifier combing best undersampling
with each oversampling (pool)
Evaluate best combination (single)
Obtain results on test folds (pool)
Aggregate/Average results (single)

12
Condor Speed-Ups Usage

10 Fold CV Evaluation
Single Machine roughly one day
Using Condor under one hour
Over/Under Sampling Wrapper
Single Machine days to weeks
Using Condor under a day
In 2006, I used 471,126 CPU hours via Condor
I am slacking in 2007 13,235 CPU hours

13
A Data Miners Wishlist

User specifies task to system
Outlines serial task phases
System smartly divides labor
What is the logical task granule based on
Condor Pool Performance
Upload/download latency
Data size
Algorithm Complexity

14
Data Mining on Grid Computing

How I help you

15
Its Ugly in the Real World

Machine related failures
Power outages, network outages, faulty memory,
corrupted file system, bad config files, expired
certs, packet filters...
Job related failures
Crash on some args, bad executable, missing input
files, mistake in args, missing components,
failure to understand dependencies...
Incompatibilities between jobs and machines
Missing libraries, not enough disk/cpu/mem, wrong
software installed, wrong version installed,
wrong memory layout...
Load related failures
Slow actions induce timeouts kernel tables
files, sockets, procs router tables addresses,
routes, connections competition with other
users...
Non-deterministic failures
Multi-thread/CPU synchronization, event
interleaving across systems, random number
generators, interactive effects, cosmic rays...

16
A Grand Challenge Problem

A user submits one million jobs to the grid.
Half of them fail.
Now what?
Examine the output of every failed job?
Login to every site to examine the logs?
Resubmit and hope for the best?
We need some way of getting the big picture.
Need to identify problems not seen before.

17
An Idea

We have lots of structured information about the
components of a grid.
Can we perform some form of data mining to
discover the big picture of what is going on?
User Your jobs work fine on RH Linux 12.1 and
12.3 but they always seem to crash on version
12.2.
Admin User joe is running 1000s of jobs that
transfer 10 TB of data that fail immediately
perhaps he needs help.
Can we act on this information to improve the
system?
User Avoid resources that are working for you.
Admin Assist user in understand and fixing the
problem.

18
Job ClassAd MyType "Job" TargetType
"Machine" ClusterId 11839 QDate
1150231068 CompletionDate 0 Owner
"dcieslak JobUniverse 5 Cmd
"ripper-cost-can-9-50.sh" LocalUserCpu
0.000000 LocalSysCpu 0.000000 ExitStatus
0 ImageSize 40000 DiskUsage 110000 NumCkpts
0 NumRestarts 0 NumSystemHolds
0 CommittedTime 0 ExitBySignal FALSE PoolName
"ccl00.cse.nd.edu" CondorVersion "6.7.19 May
10 2006"
Machine ClassAd MyType "Machine" TargetType
"Job" Name "ccl00.cse.nd.edu" CpuBusy
((LoadAvg - CondorLoadAvg) gt 0.500000) MachineGro
up "ccl" MachineOwner "dthain" CondorVersion
"6.7.19 May 10 2006" CondorPlatform
"I386-LINUX_RH9" VirtualMachineID
1 ExecutableSize 20000 JobUniverse 1 NiceUser
FALSE VirtualMemory 962948 Memory 498 Cpus
1 Disk 19072712 CondorLoadAvg
1.000000 LoadAvg 1.130000
User Job Log Job 1 submitted. Job 2
submitted. Job 1 placed on ccl00.cse.nd.edu Job
1 evicted. Job 1 placed on smarty.cse.nd.edu. Job
1 completed. Job 2 placed on dvorak.helios.nd.edu
Job 2 suspended Job 2 resumed Job 2 exited
normally with status 1. ...
19
Failure Criteria exit !0 core
dump evicted suspended bad output
20

------------------------- run 1
-------------------------
Hypothesis
exit1 - Memorygt1930, JobStartgt1.14626e09,
MonitorSelfTimegt1.14626e09 (491/377)
exit1 - Memorygt1930, Disklt555320 (1670/1639).
default exit0 (11904/4503).
Error rate on holdout data is 30.9852
Running average of error rate is 30.9852
------------------------- run 2
-------------------------
Hypothesis exit1 - Memorygt1930, Disklt541186
(2076/1812).
default exit0 (12090/4606).
Error rate on holdout data is 31.8791
Running average of error rate is 31.4322
------------------------- run 3
-------------------------
Hypothesis
exit1 - Memorygt1930, MonitorSelfImageSizegt8.844
e09 (1270/1050).
exit1 - Memorygt1930, KeyboardIdlegt815995
(793/763).
exit1 - Memorygt1927, EnteredCurrentStatelt1.1462
5e09, VirtualMemorygt2.09646e06,
LoadAvggt30000, LastBenchmarklt1.14623e09,
MonitorSelfImageSizelt7.836e09 (94/84).
exit1 - Memorygt1927, TotalLoadAvglt1.43e06,
UpdatesTotallt8069, LastBenchmarklt1.14619e09,
UpdatesLostlt1 (77/61).
default exit0 (11940/4452).

21
Unexpected Discoveries

Purdue Teragrid (91343 jobs on 2523 CPUs)
Jobs fail on machines with (Memorygt1920MB)
Diagnosis Linux machines with gt 3GB have a
different memory layout that breaks some programs
that do inappropriate pointer arithmetic.
UND UW (4005 jobs on 1460 CPUs)
Jobs fail on machines with less than 4MB disk.
Diagnosis Condor failed in an unusual way when
the job transfers input files that dont fit.

22
Many Open Problems

Strengths and Weaknesses of Approach
Correlation ! Causation -gt could be enough?
Limits of reported data -gt increase resolution?
Not enough data points -gt direct job placement?
Acting on Information
Steering by the end user.
Applying learned rules back to the system.
Evaluating (and sometimes abandoning) changes.
Creating tools that assist with digging deeper.
Data Mining Research
Continuous intake incremental construction.
Creating results that non-specialists can
understand.

23
Acknowledgements

Dr. Thain (University of Notre Dame)
Local Condor expert
Use of some slides for this presentation
Cooperative Computing Lab
Maintain/Improve local Condor Pool
Provide computing resources

24
Condor Related Publications

D. Cieslak, D. Thain, N. Chawla, "Troubleshooting
Distributed Systems via Data Mining," (HPDC-15),
June 2006
N. Chawla, D. Cieslak, "Evaluating Calibration of
Probability Estimation Trees," AAAI Workshop on
the Evaluation Methods in Machine Learning, July
2006
N. Chawla, D. Cieslak, L. Hall, A. Joshi,
Killing Two Birds with One Stone Countering
Cost and Imbalance, Data Mining and Knowledge
Discovery, Under Revision

25
Questions?

Write a Comment

User Comments (0)