Title: Grid Computing in Data Mining and Data Mining on Grid Computing
1Grid Computing in Data Mining and Data Mining
on Grid Computing
- David Cieslak (dcieslak_at_cse.nd.edu)
- Advisor Nitesh Chawla (nchawla_at_cse.nd.edu)
- University of Notre Dame
2Grid Computing in Data Mining
3Data Mining Primer
- Data Mining"The non-trivial process of
identifying valid, novel, potentially useful, and
ultimately understandable patterns in data".
-Fayyad, Piatetsky-Shapiro Smyth, 1996. - Classifier Learning algorithm which trains a
predictive model from data - Ensemble A set of classifiers working together
to improve prediction
4Applications of Data Mining
- Network Intrusion Detection
- Categorizing Adult Income
- Finding Calcifications in Mammography
- Looking for Oil Spills
- Identifying Handwritten Digits
- Predicting Job Failure on a Computing Grid
- Anticipating Successful Companies
5Condor Makes DM Tractable
- I use a small set of algorithms in high volume
- Ex Run same classifier on many datasets
- A single data mining operation may have easily
parallelized segments - Ex Learn an ensemble of 10 classifiers on
dataset - Introducing simple parallelism into data mining
conserves time significantly
6Common DM Task 10 Fold CV
Original Data
Network Traffic Dataset 30 MB Data
7Common DM Task 10 Fold CV
Original Data
10 Training Folds
27 MB Data
10 Testing Folds
30 MB Data
3 MB Data
8Common DM Task 10 Fold CV
Training Fold i
Learning Algorithm
Train Classifier
Evaluate Classifier
Testing Fold i
27 MB Data
3 MB Data
RIPPER
lt 1 min
2 Hours
9Common DM Task 10 Fold CV
Average and aggregate various statistics and
measures across folds
27 MB Data
3 MB Data
10Using Condor on 10 Folds
Condor Pool
Local Host
Local Host
- Receive Results
- Aggregate/Average
- Splits Data
- Upload Data and Task to Pool
- Learn Classifier
- Evaluate Classifier
- Return results
5 mins
2 Hours
5 mins
- If there is 1 Hour Learn/Eval time, Condor saves
up to 18 hours in real time
11A More Complex DM Task
- Over/Under Sampling Wrapper
- Split data into 50 folds (single)
- Generate 10 undersamplings and 20
oversamplings per fold (pool) - Learn classifier on each undersampling
(pool) - Evaluate and select best undersampling
(single) - Learn classifier combing best undersampling
with each oversampling (pool) - Evaluate best combination (single)
- Obtain results on test folds (pool)
- Aggregate/Average results (single)
12Condor Speed-Ups Usage
- 10 Fold CV Evaluation
- Single Machine roughly one day
- Using Condor under one hour
- Over/Under Sampling Wrapper
- Single Machine days to weeks
- Using Condor under a day
- In 2006, I used 471,126 CPU hours via Condor
- I am slacking in 2007 13,235 CPU hours
13A Data Miners Wishlist
- User specifies task to system
- Outlines serial task phases
- System smartly divides labor
- What is the logical task granule based on
- Condor Pool Performance
- Upload/download latency
- Data size
- Algorithm Complexity
14Data Mining on Grid Computing
15Its Ugly in the Real World
- Machine related failures
- Power outages, network outages, faulty memory,
corrupted file system, bad config files, expired
certs, packet filters... - Job related failures
- Crash on some args, bad executable, missing input
files, mistake in args, missing components,
failure to understand dependencies... - Incompatibilities between jobs and machines
- Missing libraries, not enough disk/cpu/mem, wrong
software installed, wrong version installed,
wrong memory layout... - Load related failures
- Slow actions induce timeouts kernel tables
files, sockets, procs router tables addresses,
routes, connections competition with other
users... - Non-deterministic failures
- Multi-thread/CPU synchronization, event
interleaving across systems, random number
generators, interactive effects, cosmic rays...
16A Grand Challenge Problem
- A user submits one million jobs to the grid.
- Half of them fail.
- Now what?
- Examine the output of every failed job?
- Login to every site to examine the logs?
- Resubmit and hope for the best?
- We need some way of getting the big picture.
- Need to identify problems not seen before.
17An Idea
- We have lots of structured information about the
components of a grid. - Can we perform some form of data mining to
discover the big picture of what is going on? - User Your jobs work fine on RH Linux 12.1 and
12.3 but they always seem to crash on version
12.2. - Admin User joe is running 1000s of jobs that
transfer 10 TB of data that fail immediately
perhaps he needs help. - Can we act on this information to improve the
system? - User Avoid resources that are working for you.
- Admin Assist user in understand and fixing the
problem.
18Job ClassAd MyType "Job" TargetType
"Machine" ClusterId 11839 QDate
1150231068 CompletionDate 0 Owner
"dcieslak JobUniverse 5 Cmd
"ripper-cost-can-9-50.sh" LocalUserCpu
0.000000 LocalSysCpu 0.000000 ExitStatus
0 ImageSize 40000 DiskUsage 110000 NumCkpts
0 NumRestarts 0 NumSystemHolds
0 CommittedTime 0 ExitBySignal FALSE PoolName
"ccl00.cse.nd.edu" CondorVersion "6.7.19 May
10 2006"
Machine ClassAd MyType "Machine" TargetType
"Job" Name "ccl00.cse.nd.edu" CpuBusy
((LoadAvg - CondorLoadAvg) gt 0.500000) MachineGro
up "ccl" MachineOwner "dthain" CondorVersion
"6.7.19 May 10 2006" CondorPlatform
"I386-LINUX_RH9" VirtualMachineID
1 ExecutableSize 20000 JobUniverse 1 NiceUser
FALSE VirtualMemory 962948 Memory 498 Cpus
1 Disk 19072712 CondorLoadAvg
1.000000 LoadAvg 1.130000
User Job Log Job 1 submitted. Job 2
submitted. Job 1 placed on ccl00.cse.nd.edu Job
1 evicted. Job 1 placed on smarty.cse.nd.edu. Job
1 completed. Job 2 placed on dvorak.helios.nd.edu
Job 2 suspended Job 2 resumed Job 2 exited
normally with status 1. ...
19Failure Criteria exit !0 core
dump evicted suspended bad output
20- ------------------------- run 1
------------------------- - Hypothesis
- exit1 - Memorygt1930, JobStartgt1.14626e09,
MonitorSelfTimegt1.14626e09 (491/377) - exit1 - Memorygt1930, Disklt555320 (1670/1639).
- default exit0 (11904/4503).
- Error rate on holdout data is 30.9852
- Running average of error rate is 30.9852
- ------------------------- run 2
------------------------- - Hypothesis exit1 - Memorygt1930, Disklt541186
(2076/1812). - default exit0 (12090/4606).
- Error rate on holdout data is 31.8791
- Running average of error rate is 31.4322
- ------------------------- run 3
------------------------- - Hypothesis
- exit1 - Memorygt1930, MonitorSelfImageSizegt8.844
e09 (1270/1050). - exit1 - Memorygt1930, KeyboardIdlegt815995
(793/763). - exit1 - Memorygt1927, EnteredCurrentStatelt1.1462
5e09, VirtualMemorygt2.09646e06,
LoadAvggt30000, LastBenchmarklt1.14623e09,
MonitorSelfImageSizelt7.836e09 (94/84). - exit1 - Memorygt1927, TotalLoadAvglt1.43e06,
UpdatesTotallt8069, LastBenchmarklt1.14619e09,
UpdatesLostlt1 (77/61). - default exit0 (11940/4452).
21Unexpected Discoveries
- Purdue Teragrid (91343 jobs on 2523 CPUs)
- Jobs fail on machines with (Memorygt1920MB)
- Diagnosis Linux machines with gt 3GB have a
different memory layout that breaks some programs
that do inappropriate pointer arithmetic. - UND UW (4005 jobs on 1460 CPUs)
- Jobs fail on machines with less than 4MB disk.
- Diagnosis Condor failed in an unusual way when
the job transfers input files that dont fit.
22Many Open Problems
- Strengths and Weaknesses of Approach
- Correlation ! Causation -gt could be enough?
- Limits of reported data -gt increase resolution?
- Not enough data points -gt direct job placement?
- Acting on Information
- Steering by the end user.
- Applying learned rules back to the system.
- Evaluating (and sometimes abandoning) changes.
- Creating tools that assist with digging deeper.
- Data Mining Research
- Continuous intake incremental construction.
- Creating results that non-specialists can
understand.
23Acknowledgements
- Dr. Thain (University of Notre Dame)
- Local Condor expert
- Use of some slides for this presentation
- Cooperative Computing Lab
- Maintain/Improve local Condor Pool
- Provide computing resources
24Condor Related Publications
- D. Cieslak, D. Thain, N. Chawla, "Troubleshooting
Distributed Systems via Data Mining," (HPDC-15),
June 2006 - N. Chawla, D. Cieslak, "Evaluating Calibration of
Probability Estimation Trees," AAAI Workshop on
the Evaluation Methods in Machine Learning, July
2006 - N. Chawla, D. Cieslak, L. Hall, A. Joshi,
Killing Two Birds with One Stone Countering
Cost and Imbalance, Data Mining and Knowledge
Discovery, Under Revision
25Questions?