Grid Computing in Data Mining and Data Mining on Grid Computing - PowerPoint PPT Presentation

About This Presentation
Title:

Grid Computing in Data Mining and Data Mining on Grid Computing

Description:

Upload Data and Task to Pool. Condor Pool. Local Host. Learn Classifier. Evaluate Classifier ... Upload/download latency. Data size. Algorithm Complexity. 14 ... – PowerPoint PPT presentation

Number of Views:190
Avg rating:3.0/5.0
Slides: 26
Provided by: dougl229
Category:

less

Transcript and Presenter's Notes

Title: Grid Computing in Data Mining and Data Mining on Grid Computing


1
Grid Computing in Data Mining and Data Mining
on Grid Computing
  • David Cieslak (dcieslak_at_cse.nd.edu)
  • Advisor Nitesh Chawla (nchawla_at_cse.nd.edu)
  • University of Notre Dame

2
Grid Computing in Data Mining
  • How you help me

3
Data Mining Primer
  • Data Mining"The non-trivial process of
    identifying valid, novel, potentially useful, and
    ultimately understandable patterns in data".
    -Fayyad, Piatetsky-Shapiro Smyth, 1996.
  • Classifier Learning algorithm which trains a
    predictive model from data
  • Ensemble A set of classifiers working together
    to improve prediction

4
Applications of Data Mining
  • Network Intrusion Detection
  • Categorizing Adult Income
  • Finding Calcifications in Mammography
  • Looking for Oil Spills
  • Identifying Handwritten Digits
  • Predicting Job Failure on a Computing Grid
  • Anticipating Successful Companies

5
Condor Makes DM Tractable
  • I use a small set of algorithms in high volume
  • Ex Run same classifier on many datasets
  • A single data mining operation may have easily
    parallelized segments
  • Ex Learn an ensemble of 10 classifiers on
    dataset
  • Introducing simple parallelism into data mining
    conserves time significantly

6
Common DM Task 10 Fold CV
Original Data
Network Traffic Dataset 30 MB Data
7
Common DM Task 10 Fold CV
Original Data
10 Training Folds
27 MB Data
10 Testing Folds
30 MB Data
3 MB Data
8
Common DM Task 10 Fold CV
Training Fold i
Learning Algorithm
Train Classifier
Evaluate Classifier
Testing Fold i
27 MB Data
3 MB Data
RIPPER
lt 1 min
2 Hours
9
Common DM Task 10 Fold CV
Average and aggregate various statistics and
measures across folds
27 MB Data
3 MB Data
10
Using Condor on 10 Folds
Condor Pool
Local Host
Local Host
  • Receive Results
  • Aggregate/Average
  • Splits Data
  • Upload Data and Task to Pool
  • Learn Classifier
  • Evaluate Classifier
  • Return results

5 mins
2 Hours
5 mins
  • If there is 1 Hour Learn/Eval time, Condor saves
    up to 18 hours in real time

11
A More Complex DM Task
  • Over/Under Sampling Wrapper
  • Split data into 50 folds (single)
  • Generate 10 undersamplings and 20
    oversamplings per fold (pool)
  • Learn classifier on each undersampling
    (pool)
  • Evaluate and select best undersampling
    (single)
  • Learn classifier combing best undersampling
    with each oversampling (pool)
  • Evaluate best combination (single)
  • Obtain results on test folds (pool)
  • Aggregate/Average results (single)

12
Condor Speed-Ups Usage
  • 10 Fold CV Evaluation
  • Single Machine roughly one day
  • Using Condor under one hour
  • Over/Under Sampling Wrapper
  • Single Machine days to weeks
  • Using Condor under a day
  • In 2006, I used 471,126 CPU hours via Condor
  • I am slacking in 2007 13,235 CPU hours

13
A Data Miners Wishlist
  • User specifies task to system
  • Outlines serial task phases
  • System smartly divides labor
  • What is the logical task granule based on
  • Condor Pool Performance
  • Upload/download latency
  • Data size
  • Algorithm Complexity

14
Data Mining on Grid Computing
  • How I help you

15
Its Ugly in the Real World
  • Machine related failures
  • Power outages, network outages, faulty memory,
    corrupted file system, bad config files, expired
    certs, packet filters...
  • Job related failures
  • Crash on some args, bad executable, missing input
    files, mistake in args, missing components,
    failure to understand dependencies...
  • Incompatibilities between jobs and machines
  • Missing libraries, not enough disk/cpu/mem, wrong
    software installed, wrong version installed,
    wrong memory layout...
  • Load related failures
  • Slow actions induce timeouts kernel tables
    files, sockets, procs router tables addresses,
    routes, connections competition with other
    users...
  • Non-deterministic failures
  • Multi-thread/CPU synchronization, event
    interleaving across systems, random number
    generators, interactive effects, cosmic rays...

16
A Grand Challenge Problem
  • A user submits one million jobs to the grid.
  • Half of them fail.
  • Now what?
  • Examine the output of every failed job?
  • Login to every site to examine the logs?
  • Resubmit and hope for the best?
  • We need some way of getting the big picture.
  • Need to identify problems not seen before.

17
An Idea
  • We have lots of structured information about the
    components of a grid.
  • Can we perform some form of data mining to
    discover the big picture of what is going on?
  • User Your jobs work fine on RH Linux 12.1 and
    12.3 but they always seem to crash on version
    12.2.
  • Admin User joe is running 1000s of jobs that
    transfer 10 TB of data that fail immediately
    perhaps he needs help.
  • Can we act on this information to improve the
    system?
  • User Avoid resources that are working for you.
  • Admin Assist user in understand and fixing the
    problem.

18
Job ClassAd MyType "Job" TargetType
"Machine" ClusterId 11839 QDate
1150231068 CompletionDate 0 Owner
"dcieslak JobUniverse 5 Cmd
"ripper-cost-can-9-50.sh" LocalUserCpu
0.000000 LocalSysCpu 0.000000 ExitStatus
0 ImageSize 40000 DiskUsage 110000 NumCkpts
0 NumRestarts 0 NumSystemHolds
0 CommittedTime 0 ExitBySignal FALSE PoolName
"ccl00.cse.nd.edu" CondorVersion "6.7.19 May
10 2006"
Machine ClassAd MyType "Machine" TargetType
"Job" Name "ccl00.cse.nd.edu" CpuBusy
((LoadAvg - CondorLoadAvg) gt 0.500000) MachineGro
up "ccl" MachineOwner "dthain" CondorVersion
"6.7.19 May 10 2006" CondorPlatform
"I386-LINUX_RH9" VirtualMachineID
1 ExecutableSize 20000 JobUniverse 1 NiceUser
FALSE VirtualMemory 962948 Memory 498 Cpus
1 Disk 19072712 CondorLoadAvg
1.000000 LoadAvg 1.130000
User Job Log Job 1 submitted. Job 2
submitted. Job 1 placed on ccl00.cse.nd.edu Job
1 evicted. Job 1 placed on smarty.cse.nd.edu. Job
1 completed. Job 2 placed on dvorak.helios.nd.edu
Job 2 suspended Job 2 resumed Job 2 exited
normally with status 1. ...
19
Failure Criteria exit !0 core
dump evicted suspended bad output
20
  • ------------------------- run 1
    -------------------------
  • Hypothesis
  • exit1 - Memorygt1930, JobStartgt1.14626e09,
    MonitorSelfTimegt1.14626e09 (491/377)
  • exit1 - Memorygt1930, Disklt555320 (1670/1639).
  • default exit0 (11904/4503).
  • Error rate on holdout data is 30.9852
  • Running average of error rate is 30.9852
  • ------------------------- run 2
    -------------------------
  • Hypothesis exit1 - Memorygt1930, Disklt541186
    (2076/1812).
  • default exit0 (12090/4606).
  • Error rate on holdout data is 31.8791
  • Running average of error rate is 31.4322
  • ------------------------- run 3
    -------------------------
  • Hypothesis
  • exit1 - Memorygt1930, MonitorSelfImageSizegt8.844
    e09 (1270/1050).
  • exit1 - Memorygt1930, KeyboardIdlegt815995
    (793/763).
  • exit1 - Memorygt1927, EnteredCurrentStatelt1.1462
    5e09, VirtualMemorygt2.09646e06,
    LoadAvggt30000, LastBenchmarklt1.14623e09,
    MonitorSelfImageSizelt7.836e09 (94/84).
  • exit1 - Memorygt1927, TotalLoadAvglt1.43e06,
    UpdatesTotallt8069, LastBenchmarklt1.14619e09,
    UpdatesLostlt1 (77/61).
  • default exit0 (11940/4452).

21
Unexpected Discoveries
  • Purdue Teragrid (91343 jobs on 2523 CPUs)
  • Jobs fail on machines with (Memorygt1920MB)
  • Diagnosis Linux machines with gt 3GB have a
    different memory layout that breaks some programs
    that do inappropriate pointer arithmetic.
  • UND UW (4005 jobs on 1460 CPUs)
  • Jobs fail on machines with less than 4MB disk.
  • Diagnosis Condor failed in an unusual way when
    the job transfers input files that dont fit.

22
Many Open Problems
  • Strengths and Weaknesses of Approach
  • Correlation ! Causation -gt could be enough?
  • Limits of reported data -gt increase resolution?
  • Not enough data points -gt direct job placement?
  • Acting on Information
  • Steering by the end user.
  • Applying learned rules back to the system.
  • Evaluating (and sometimes abandoning) changes.
  • Creating tools that assist with digging deeper.
  • Data Mining Research
  • Continuous intake incremental construction.
  • Creating results that non-specialists can
    understand.

23
Acknowledgements
  • Dr. Thain (University of Notre Dame)
  • Local Condor expert
  • Use of some slides for this presentation
  • Cooperative Computing Lab
  • Maintain/Improve local Condor Pool
  • Provide computing resources

24
Condor Related Publications
  • D. Cieslak, D. Thain, N. Chawla, "Troubleshooting
    Distributed Systems via Data Mining," (HPDC-15),
    June 2006
  • N. Chawla, D. Cieslak, "Evaluating Calibration of
    Probability Estimation Trees," AAAI Workshop on
    the Evaluation Methods in Machine Learning, July
    2006
  • N. Chawla, D. Cieslak, L. Hall, A. Joshi,
    Killing Two Birds with One Stone Countering
    Cost and Imbalance, Data Mining and Knowledge
    Discovery, Under Revision

25
Questions?
Write a Comment
User Comments (0)
About PowerShow.com