LACSI Priorities - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

LACSI Priorities

Description:

Monitoring for reliability and user feedback ... Improved System Administration through Clustermatic ... Tracing: time series of activity of one activity or resource ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 19
Provided by: lacsi
Category:

less

Transcript and Presenter's Notes

Title: LACSI Priorities


1
LACSI Priorities StrategiesSystems OverviewFeb
2005Discussion Points
http//lacsi.rice.edu/.../systems_overview.ppt
2
This Years Agenda
  • Brief overview of FY05 activities.
  • Evaluate r.e. long- vs. short-term, research vs
    development.
  • Which will have met their goals?
  • Declare victory and move on?
  • Transfer to other organizations/funding?
  • Which should be rethought?
  • Continuations?
  • Rethink long-term priorities.
  • Reassess technology trends research, industry
  • Reassess the needs of LANL, ASC, NNSA,
  • Identify a strategy by which LACSI resources can
    have the most positive impact.
  • Funding constraints
  • Expertise of participants
  • Leverage other projects and funding sources.

3
History PS2003 (Planning for FY04)
There were six thrusts
1. Reliability 2. Adaptability 3. Commodity 4.
Compiler/System Interface 5. Advanced
Architectures (WANs)
(6. Systems for Scalable Visualization)
Problems with this approach 1-3 were too
abstract 4-6 were too concrete Identified lots of
interesting problems, but way too many for
available funding. ? Poor match with academic
SOW, LANL work.

4
History PS 2004 (for FY05)
  • Focus on
  • Needs of ASCI HPC at LANL for new systems.
  • Items on which this group will have a significant
    impact.

5
BackgroundThe Future of Cluster Interconnects.
  • Commodity networks vs MPP interconnects
  • GigE and 10GigE
  • Questions of cost and reliability for full
    bandwidth interconnects.
  • Cheap (e.g. Broadcom) NICs being built into
    motherboards.
  • Futures of Quadrics, Myrinet are suspect.
  • Infiniband looks viable, but may never achieve
    commodity status.
  • It will be fast.
  • Quad data rate 12X Infiniband ? 120Gb/sec
    15GB/sec
  • Much faster than todays memories and I/O busses.
  • LACSI Systems Challenge end-point nodes that can
    handle very high bandwidth.
  • Processessors, NICs, and Iterconnect
    architectures?
  • Strawman Very high speed NIC on a cache coherent
    bus, where the NIC is a peer of the CPU, e.g.
    Hypertransport.

6
Messaging \ Reliability
  • Open-MPI a successor to LA-MPI, LAM, and FT-MPI
  • Improve MPI reliability at all levels.
  • Transport
  • Failure modes
  • Monitoring for reliability and user feedback
  • (Integration of performance monitoring/reliability
    framework?)

7
Network-Messaging \ Performance
  • Coordinated activities among all LACSI
    networking researchers to address future cluster
    interconnects.
  • Node Architectures
  • Interfacing the NICS to the Nodes
  • Protocol design and structure
  • Assignment of work to the hardware components
  • Implementation of protocols for performance
  • Assignment of work to NICS (co-procs.) to offload
    CPU
  • Zero-copy, zero-map implementations for latency,
    bandwidth, and efficiency.

8
Networking-Messaging \ Utility
  • Ensure that the messaging layers correctly
    implement standards.
  • Subsetting is tolerable, but correctness is
    required.
  • Tier 1 standards
  • MPI
  • MPI/IO
  • MPI-2
  • Tier 2
  • Everything else
  • Issues
  • Defines constraints on useful
    networking/messaging activities.
  • Research vs. Development vs. Deployment tar baby.

9
Clustering \ Performance
  • Continued efforts on inherent performance of
    Clustermatic
  • Build performance monitoring infrastructure into
    Clustermatic systems.
  • Different from other, e.g. fault, monitoring in
    the continuous and pervasive nature of
    performance monitoring.

10
Clustering \ Reliability
  • Address application reliability through support
    of compiler-driven (assisted) checkpointing
    mechanisms.
  • Dynamic application reconfiguration
  • fault prediction
  • Reliability characterization
  • HAPI Health API
  • Providing drivers for health monitoring sensors
  • Administrator/User Level Tools
  • Actuators
  • fail-over
  • Compute nodes
  • Master nodes

11
Clustering \ Utility
  • Improved System Administration through
    Clustermatic
  • Work on tools needed to improve
    administrator/user productivity
  • Improvements in the Single System Image
  • Scripting in SSI vs. pile of workstations
    models.
  • File system Issues.
  • Private namespaces, the V9 FS
  • Programming Models and Runtime Systems.
  • The right HLLs for performance and productivity.
  • Systems section or separate compilers section?

12
What was missing from the draft?
  • WAN activities IP for High Bandwidh High
    Latency networks.
  • Good progress being made by Feng et al.
  • Work was added after the PS meeting.

13
FY05 Projects on the Academic SOW
  • ? Project and task definitions tailored to 1-year
    contract cycle
  • Efficient, Portable, and Scalable Support for MPI
    Messaging
  • Scott Rixner, Alan Cox
  • Operating System Issues Related to Scalability
  • Arthur B. Maccabe, Patrick G. Bridges
  • Scalability of TCP
  • Application Impact of Fault-handling Placement
  • Infiniband Testbed
  • OpenMPI
  • Jack Dongarra
  • Highly Scalable Fault Tolerance
  • Dan Reed, Kevin Gamiel
  • Clustermatic Performance Instrumentation
  • Rob Fowler, Patrick Bridges, John Mellor-Crummey

14
FY06 Issues Whence MPI?
  • Status and future of MPI extensions Open-MPI
  • Fault tolerance
  • Performance
  • Development vs. research issues.
  • Alternatives and successors?

15
FY06 Issues The Runtime Software Stack
  • Kernel Issues
  • Linux vs alternatives (K42, Plan9, BSDvariants,
    etc.) on clusters.
  • Flexible, adaptive, rich, general purpose
    execution environmentvs. small, fast,
    surveyable/controllable special purpose env.
  • Other
  • File Systems
  • Communication interfaces, models, drivers
  • Beyond bproc SSI on non-Linux systems
  • Rethink everything?

16
FY06 Issues System Management
  • Health monitoring, reporting, actuators, etc.
  • Fundamental research to predict future behavior.
  • Actuators Reconfiguration, checkpointing, etc.
  • Improving the management interface.
  • Eclipse parallel tools
  • Development vs. research?
  • Long-term vs. short-term?

17
FY06 Issues Performance Instrumentation.
  • Processor chips and systems are becoming more
    difficult to understand.
  • Multi-issue, out-of-order processors with memory
    parallelism are difficult enough.
  • New chips/systems will have hardware
    multithreading (shared pipes and other CPU
    resources), multiple-cores, more complex memory
    systems, other shared resources.
  • Hardware instrumentation will necessarily support
  • Activity views The performance story of one
    thread, where a thread may visit multiple
    resources at hardware speeds
  • Resource views Measure the cost of contention
    for shared resources, attribute the costs in a
    useful way to activities
  • Tracing time series of activity of one activity
    or resource
  • Profiling spatial view of classes of
    activities and resources
  • Vendors will not implement instrumentation unless
    theres a business case.
  • Important customers need to demonstrate demand.
  • Software needed to justify hardware investment.
  • Counters Workshop at HPCA-11 is a step.

18
Slide 396
Write a Comment
User Comments (0)
About PowerShow.com