Vladimir Litvin, Harvey Newman, - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Vladimir Litvin, Harvey Newman,

Description:

Multiple runs should be submitted in one PBS job. Due to NCSA computing policy, PBS cannot allocate one CPU it always allocates one NODE. ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 32
Provided by: RandyB153
Category:

less

Transcript and Presenter's Notes

Title: Vladimir Litvin, Harvey Newman,


1
Long-Term Massive Production Runs on Alliance
resources Experience
  • Vladimir Litvin, Harvey Newman,
  • Sergei Shevchenko
  • Caltech CMS

2
Introduction
  • Caltech Higgs diphoton decay channel study is in
    the second stage. 10M background events with full
    detector simulation are being simulated,
    reconstructed and analyzed.
  • 4.5M events has been done since April 2003
  • Physics results have been presented at the Les
    Houches conference and reported on atCMS weeks
    and in CMS notes.
  • In 2004-2005 a 50-100M run on TeraGrid Alliance
    facilities is planned.

3
CPU Usage Summary
1 SU 1 Node Hour
4
Analysis Chain
  • Production analysis chain
  • - cmsim125 and ORCAv6 (with ObjyDB) have been
    used
  • - FORTRAN part can be run everywhere
  • - ORCA C part was run on Caltech pTier2 only,
    due to ObjyDB RH6x restrictions

5

Operational Issues
  • There is a hierarchy of operational issues
  • We will concentrate mainly on the lowest,
    infrastructure, level including clusters,
    network, mass storage system, social issues
  • Even at this level there are a lot of open issues
  • 3-tier architecture

6
NCSA(I)
  • NCSA Platinum Cluster Technical Summary
  • IBM eServer x330 thin server (dual-processor)
  • ECC SDRAM 1.5GB (compute nodes)
  • Access nodes 4 (8 CPU)
  • Compute nodes 484 (968 CPU)
  • Storage nodes 32 (64 CPU)
  • Intel PIII 1GHz, 256kB full-speed L2 cache (peak
    performance 1 Gflop)
  • Network Interconnect
  • Access node Gigabit Ethernet
  • Compute node Myrinet 2000
  • Disks local 10GB per node and 4 NFS mounted FSs
    650GB each

7
NCSA Statistics
  • NCSA per day SU usage
  • NCSA total usage

8
NCSA Statistics
  • NCSA fraction of idle (Q) and running (R) time
    per day

9
NCSA Statistics
  • NCSA fraction of idle (Q) and running (R) time vs
    number of jobs completed per day

10
NCSA (II)
  • Multiple runs should be submitted in one PBS
    job.Due to NCSA computing policy, PBS cannot
    allocate one CPU it always allocates one NODE.
    The same will be on future TeraGrid
  • Two different tasks of two different users can
    ask large RAM in both tasks
  • During one CPU allocation, second CPU is sitting
    idle anyway.
  • Smaller number of allocated nodes - lower the
    priority of this PBS job
  • Large number of nodes allocated per one PBS
    job is not good from Caltech HPSS usage point of
    view
  • Fair maui policy is even more unfair

11
NCSA (III)
  • Job submission
  • !/bin/csh
  • PBS -q standard
  • PBS -N cmsim
  • PBS -l nodes4ppn2prod
  • PBS -l walltime120000
  • set PBSHOST hostname
  • foreach node (cat PBS_NODEFILE)
  • if ( node PBSHOST ) then
  • GEN_EXEout.file
  • else
  • ( ssh -a -x -q node "GEN_EXEout.file
    )
  • endif
  • end
  • wait

12
NCSA (IV)
  • Walltime limit exceeding problem
  • Huge LAN traffic from another computing nodes in
    the same segment
  • Job was started incorrectly
  • Random order of running jobs
  • If 100 same jobs were started, the first running
    job might be any job hard to predict the chunk
    of already ready events
  • PBS MAXJOB limit was set to 50 jobs are killed
    by PBS when number of jobs exceeds this limit

13
NCSA (V)
  • Blocked Jobs
  • maui can start a job and if it fails, job will be
    blocked and sitting in a queue without any
    notification. Unblock must be done by hand asking
    support team to do so. There isn't any way to be
    notified by maui/PBS.
  • Accounting Problem
  • PBS accounting cannot calculate correctly CPU
    used time when ssh was used to start runs on
    another allocated nodes
  • NCSA own Sybase based accounting system is wrong
    as well (overestimate the CPU usage)

14
NCSA (VI)
  • NCSA custom made utility usage is overestimates
    used CPU hours
  • corrected PBS accounting is too crude for the
    estimation as well

15
Caltech pTier2 (I)
  • Server node
  • Dell PowerEdge 4400, 2GB SDRAM
  • 3 RAID arrays 3TB in a total
  • Computing nodes
  • 20 rack mounted dual CPU PIII 800MHz Intel
  • SDRAM 512 MB and 133 MHz
  • 10GB local disk
  • Additional (P4) Compute Nodes
  • Used for official CMS Pre-Challenge Production
    (PCP)
  • Not part of the production discussed today

16
Caltech Tier2 (II)
  • Caltech Tier2 layout (PIII Nodes Only shown)

17
Caltech Tier2 Statistics
  • Caltech per day SU usage
  • Caltech total SU usage

18
Caltech Tier2 Ganglia monitor
  • Screenshot

19
Caltech pTier2 Statistics
  • Caltech Tier2 fraction of idle (Q) and running
    (R) time per day

20
Caltech pTier2 Statistics
  • Caltech Tier2 fraction of idle (Q) and running
    (R) time vs total number of jobs

21
Caltech pTier2 (III)
  • NFS related problem high CPU load average
    without any real work to do.
  • Typical diagnostics in /var/log/messages
  • Jul 24 062738 t007 kernel nfs server
    tier2 not responding, still trying
  • Jul 24 062738 t007 last message repeated 4
    times
  • Jul 24 062930 t007 kernel nfs task 3281
    can't get a request slot
  • Jul 24 064252 t007 automount5189
    expired /data/raid1
  • Jul 24 064406 t007 kernel nfs task 3283
    can't get a request slot
  • Looks like NFS server cannot send a data on
    request due to overloading or other reasons and
    when the length of pending requests from one
    client starts to exceed the certain threshold

22
Caltech Tier2 (IV)
  • Good node (Ganglia)
  • Bad node (Ganglia)

23
Caltech Tier2 (V)
  • Good node (Ganglia)
  • Bad node (Ganglia)

24
Condor (UW-Madison)
  • Condor flock of chaotically distributed nodes
  • Intel/Linux 614 nodes
  • Intel/WinNT50 113
  • SUN4u/Solaris28 105
  • SUN4x/Solaris28 6
  • We are using it for FORTRAN part only

25
Condor (UW-Madison) Statistics
  • Condor per day SU usage
  • Condor total usage

26
Condor (UW-Madison)
  • NFS problem evicted job cannot open existed
    file after start on a new node
  • Failure rate depends on type of job and varied
    from
  • Reason is unknown

27
Caltech HPSS
  • Hardware
  • 5SP2 four processor Silver nodes
  • 1 SP2 eight processor High node
  • IBM 3494 Robotic tape library with
  • 6 IBM Magstar 3590 drives (10GB per tape, 9MB/s
  • 2300 tape slots (23 TB capacity)
  • StorageTek 4410 Robotic tape silo
  • 4 STK Redwood drives (50GB per tape, 11MB/s)
  • 6000 tape slots (300TB capacity)

28
Caltech HPSS
  • Cannot keep more than 200-300 connections at the
    same time. 400 connections killed the whole
    system
  • Future TeraGrid Caltech Unit will have 100TB of
    disks under PVFS

29
Conclusion (I)
  • Alliance resources could be a great asset for the
    future CMS milestones like DC05, Physics and
    Computing TDRs, etc
  • Full use of such resources requires overcoming
    some operational problems at the outset. But we
    have been largely successful in dealing with such
    problems

30
Conclusion (II)
  • In addition to Application and Grid Middleware
    problems we have a set of infrastructure
    technical problems, policies and social open
    issues which can prevent fully utilize future
    grid capacity
  • Technical issues are closely connected with
    Computing Center policies.
  • PBS has a set of limitations MAXJOB limit,
    finest granularity is node, not CPU, wrong
    sequence of running jobs

31
Conclusion (III)
  • Lack of good monitoring and accounting system.
    This is the basis of any future activity of
    resource brokers and it is practically impossible
    to have robust broker without detailed
    statistical information
  • Reliable Mass Storage System is the critical
    issue for LHC data handling. Current limitation
    on opened connections at Caltech HPPS is raised
    questions about MSS reliability. Further
    investigations with HPSS and PVFS are essential
Write a Comment
User Comments (0)
About PowerShow.com