JobData CoLocation with Condor at UWM - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

JobData CoLocation with Condor at UWM

Description:

Host1 /atlas/xrootd/file1. Host2 /atlas/xrootd/file2. Host3 /atlas/xrootd/file3. Host1 /atlas/xrootd/file4. To analyze dataset1: Dagman jobs: Job1: Host1 /atlas ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 9
Provided by: nen96
Category:

less

Transcript and Presenter's Notes

Title: JobData CoLocation with Condor at UWM


1
Job/Data Co-Location with Condorat UWM
  • W. Guan, B. Mellado, German Montoya, Sau Lan Wu,
    Neng Xu University of Wisconsin-Madison

2
Why we use Condor Dagman for data analysis
  • Easy for users to use their old code.
  • We already had Condor installed and the whole
    Condor team as our technical backup.
  • User can put lots of analysis jobs in the queue.
    Easy to maintain a multi-user environment.
  • Finding file location is faster than Xrootd,
    especially for large datasets(more than 20,000
    files)
  • User can use their own merge methods. The merge
    method in PROOF-batch is not also working. (Gerri
    is still working on it.)

3
Analysis jobs with Condor Dagman
  • The steps

Dagman jobs Job1 Host1 /atlas/xrootd/file1 Job2
Host2 /atlas/xrootd/file2 Job3 Host3
/atlas/xrootd/file3
4
2
To analyze dataset1
File in the dataset1 Host1 /atlas/xrootd/file1 H
ost2 /atlas/xrootd/file2 Host3 /atlas/xrootd/file3
Host1 /atlas/xrootd/file4
3
5
The output HIST files are sent back by Condor.
Submitter
Database
The Xrootd pool
6
The Merge job runs locally and put the output to
the local xrootd pool. Like root//higgs10.cs.wi
sc.edu//atlas/output/file.HIST.root User can
directly open it with root.
1
7
4
Condor Dagman for data analysis
  • Condor Dagman
  • Based on the databases, user only works on
    datasets. DQ2 and UW_ls provide the file lists.
  • We did some optimizations to reduce the overhead
    of the condor job submission.
  • This method is good for any I/O intensive tasks.
    User can even directly run over the ESD or AODs.
  • User can put lots of jobs in the queue.
  • Those Dagman jobs are running on a special fast
    queue. They will suspend other normal batch jobs.
  • Condor takes care of the multi-users job priority.

5
Some examples
  • Dq2-ls-local
  • UW-ls

6
Some examples
  • Use dq2-ls-local inside PROOF
  • Submit Condor dagman job

7
Problems with Condor Dagman
  • Problem 1
  • Idle jobs Those jobs never get matched because
    the destination machine is down.
  • Held jobs Those jobs get held because the
    output didnt get produced. They stay in the
    queue forever.
  • With these 2 types of jobs, the dag will never
    finish.
  • Problem 2
  • Performance is slow when there are too many jobs
    in the dag.
  • Matching takes time.
  • Too many shadow processes if Dagman releases too
    many jobs at once.

8
Solutions with Condor Dagman
  • Problem 1
  • We use
  • PeriodicRemove((JobStatus 1 CurrentTime
    - EnteredCurrentStatus gt 300)(JobStatus 5
    CurrentTime - EnteredCurrentStatus gt 60))
  • To remove those unfinished jobs. For Held
    jobs this is OK but not for Idle jobs.We check
    the machine status before creating the jobs.
  • Problem 2
  • We reduce the total number of jobs by running
    multiple files in each job.
  • Less matching, less shadow processes, less output
    files
Write a Comment
User Comments (0)
About PowerShow.com