John David Eriksen - PowerPoint PPT Presentation

About This Presentation

Title:

John David Eriksen

Description:

High-Performance, Dependable Multiprocessor John David Eriksen Jamie Unger-Fink * This might not be necessary * -DMS identifies, classifies, and manages the ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 38

Provided by: Jam123

Learn more at: http://www.ann.ece.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: John David Eriksen

1
High-Performance, Dependable Multiprocessor

John David Eriksen
Jamie Unger-Fink

2
Background and Motivation

Traditional space computing limited primarily to
mission-critical applications
Spacecraft control
Life support
Data collected in space and processed on the
ground
Data sets in space applications continue to grow

3
Background and Motivation

Communication bandwidth not growing fast enough
to cope with increasing size of data sets
Instruments and sensors grow in capability
Increasing need for on-board data processing
Perform data filtering and other operations
on-board
Autonomous systems demand more computing power

4
Related Work

Advanced Onboard Signal Processor (AOSP)
Developed in 70s and 80s
Helped develop understanding of radiation on
computing systems and components.
Advanced Architecture Onboard Processor (AAOP)
Engineered new approaches to onboard data
processing

5
Related Work

Space Touchstone
First COTS-based, FT, high-performance system
Remote Exploration and Experimentation
Extended FT techniques to parallel and cluster
computing
Focused on low-cost, high-performance, good
power-ratio compute cluster designs.

6
Goal

Address need for increased data processing
requirements
Bring COTS systems to space
COTS (Commodity Off-The-Shelf)
Less expensive
General-purpose
Need special considerations to meet requirements
of aerospace environments
Fault-tolerance
High reliability
High availability

7
Dependable Multiprocessor is

A reconfigurable cluster computer with
centralized control.

8
Dependable Multiprocessor is

A hardware architecture
High-performance characteristics
Scalable
Upgradable (thanks to reliance on COTS)
A parallel processing environment
Support common scientific computing development
environment (FEMPI)
A fault-tolerant computing platform
System controllers provide FT properties
A toolset for predicting application behavior
Fault behavior, performance, availability

9
Hardware Architecture

Redundant radiation-hardened system controller
Cluster of COTS-based reconfigurable data
processors
Redundant COTS-based packet-switched networks
Radiation-hardened mass data store
Redundancy available in
System controller
Network
Configurable N-of-M sparing in compute nodes

10
Hardware Architecture
11
Hardware Architecture

Scalability
Variable number of compute nodes
Cluster-of-cluster
Compute nodes
IBM PowerPC 750FX general processor
Xilinx VirtexII 6000 FPGA co-processor
Reconfigurable to fulfill various roles
DSP processor
Data compression
Vector processing
Applications implemented in hardware can be very
fast
Memory and other support chips

12
Hardware Architecture
13
Hardware Architecture
14
Hardware Architecture

Network Interconnect
Gigabit Ethernet for data exchange
A low-latency, low-bandwidth bus used for control
Mission Interface
Provides interface to rest of space vehicles
computer systems
Radiation-hardened

15
Hardware Architecture

Current hardware implementation
Four data processors
Two redundant system controllers
One mass data store
Two gigabit ethernet networks including two
network switches
Software-controlled instrumented power supply
Workstation running spacecraft system emulator
software

16
Hardware Architecture
17
Software Architecture
18
Software Architecture

Platform layer is lowest layer, interfaces
hardware to middleware, hardware-specific
software, network drivers
Uses Linux, allows for use of many existing
software tools
Mission Layer
Middleware includes DM System Services fault
tolerance, job management, etc.

19
Middleware
20
Middleware

DM Framework is application independent, platform
independent
API to communicate with mission layer, SAL
(System Abstraction Layer) for platform layer
Allows for future applications by facilitating
porting to new platforms

21
High Availability Middleware

HA Middleware foundation includes Availability
Management (AMS), Distributed Messaging (DMS),
Cluster Management (CMS)
Primary functions
Resource monitoring
Fault detection, diagnosis, recovery and
reporting
Cluster configuration
Event logging
Distributed messaging
Based on small, cross-platform kernel

22
Availability Management Service

Hosted on the clusters system controller
Managed Resources include
Applications
Operating System
Chassis
I/O cards
Redundant CPUs
Networks
Peripherals
Clusters
Other middleware

23
Distributed Messaging Service

Provides a reliable messaging layer for
communications in DM cluster
Used for Checkpointing, Client/server,
Communications, Event notification, Fault
management, Time-critical communications
Application opens a DMS connection (channel) to
pass data to interested subscribers
Since messaging is in middleware instead of lower
layers, application doesnt have to specify
explicitly destination address
Messages are classified and machines choose to
receive message of a certain type

24
Cluster Management Service

Manages physical nodes or instances of HA
middleware
Discovers and monitors nodes in a cluster
Passes node failures to AMS and FT Manager via DMS

25
Other Middleware

Database Management
Logging Services
Tracing

26
Control Process

Interface to control computer or ground station
Communicates with system via DMS
Monitors system health with FT Manager
Heartbeat

27
Fault-tolerance Manager

Detects and recovers from system faults
FTM refers to set of recovery policies at runtime
Relies on distributed software agents to gather
system and application liveliness information
Avoids monitoring bottleneck

28
Job Manager

Provides application scheduling, resource
allocation
Opportunistic load balancing scheduler
Jobs are registered and trace by the JM via
tables
Checkpointing to allow seamless recovery of the
JM
Heartbeats to the FT via middleware

29
FEMPI

Fault-Tolerant Embedded Message Passing Interface
Application independent FT middleware
Message Passing Interface (MPI) Standard
Built on top of HA middleware

30
FEMPI Interface

Recovery from failure should be automatic, with
minimal impact
Needs to maintain global awareness of the
processes in parallel applications
3 Stages
Fault Detection
Notification
Recovery
Process failures vs Network failures
Survives the crash of n-1 processes in an
n-process job

31
FPGA Co-Processor Services

Proprietary nature of FPGA industry
USURP - USURPs Standard for Unified
Reconfigurable Platforms
Standard to interact with hardware
Provides middleware for portability
Black box IP cores
Wrappers mask FPGA board

32
USURP

Not a universal tool for mapping high-level code
with hardware design
OpenFPGA
Adaptive Computing System (ACS) vs USURP
Object Oriented Models vs Software APIs
IGOL
BLAST
CARMA

33
USURP HW/SW Interface

Responsible for
Unifying vendor APIs
Standardizing HW interface
Organization of data for the user application
core
Exposing the developer to common FPGA resources.

34
Checkpoint Interface

User level protocol for system recovery
Consists of
Server Process that runs on Mass Data Store
DMS
API for applications
C-type interfaces

35
ABFT Library

Algorithm-based Fault Tolerance Library
Collection of mathematical routines that can
detect and correct faults
BLAS-3 Library
Matrix multiply, LU decomposition, QR
decomposition, single-value decompositions (SVD)
and fast Fourier transform (FFT).
Uses checksums

36
Replication Services