Design of Memory Systems for Spaceborne Computers - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Design of Memory Systems for Spaceborne Computers

Description:

Title: PowerPoint Presentation Last modified by: rk Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles: Times New Roman ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 27
Provided by: klabsOrgm
Learn more at: http://www.klabs.org
Category:

less

Transcript and Presenter's Notes

Title: Design of Memory Systems for Spaceborne Computers


1
Design of Memory Systems for Spaceborne Computers
  • Richard B. Katz
  • NASA Office of Logic Design
  • 2004 MAPLD International Conference
  • September 8-10, 2004
  • Washington, D.C.

2
Agenda
This seminar will discuss the design of memory
systems for spaceborne computers. While normally
associated with computers, many of the concepts
in this seminar also apply to the "configuration
memory" of FPGAs. The seminar will include a
discussion of the following topics
  • Memory classification
  • Review and discussion of spaceborne memory system
    architectures in both manned and robotic NASA
    missions
  • Robust memory system design and criteria
  • Impact of software on memory system integrity
  • Frequently seen problems and lessons learned
  • Component considerations - Cell and device
    failures - Lock up
  • Recommendations

3
Memory Classification
  • While normally associated with computers, many of
    the concepts in this paper also apply to the
    configuration memory of FPGAs.
  • Fixed
  • The contents of the memory are physically fixed
    by the structure of the memory element.
  • Examples core rope memories (wire wound through
    or around a core), fusible link PROMs, and
    antifuse-based PROMs.
  • Erasable
  • The contents of the memory are non-volatile, like
    the fixed memories, but the contents can be
    changed. In many cases this involves an erase
    operation and then a write.
  • Examples core, plated wire, electrically
    erasable programmable read only memories
    (EEPROM), erasable read only memories (EPROM),
    ferroelectric memories, and flash. The ROM in
    EPROM and EEPROM is a poor part of the name as it
    implies permanence, which is incorrect. Devices
    such as EEPROM may need refreshing over long
    missions as many are rated with a 10 year storage
    lifetime, giving them dynamic characteristics.
  • Volatile
  • The contents of the memory are volatile they do
    not retain contents either after the cycling of
    power or during brown out conditions. This
    class is subdivided into two subclasses, static,
    which will retain state indefinitely and dynamic,
    where the memory must be read and subsequently
    refreshed.
  • Examples include SRAM, DRAM, and SDRAM.

4
Saturn V Launch Vehicle Duplex Memory
Each of the two core memory units was accessed in
parallel and each contained parity. If an error
was detected in the memory unit currently
designated as prime, then data from the secondary
unit was used with the secondary unit now given
the prime designation. Hardware automatically
wrote corrected data upon the detection of an
error.
5
Apollo Guidance Computer
The advantages of the ropes are numerous. The
program, once wired in, cannot be electrically
altered, a substantial asset for mission
reliability. 2 The permanent memory requires
very few active components and very little power
to operate, It also has properties that make it
indestructible short of mechanical damage, that
is, there is no inflight failure of any kind that
can destroy this part of the memory.
In case of inflight
failure that destroys the information in this
erasable memory the computation can be
restarted by reading in only a very few words.
3.
Memories in the AGC were single string each
memory used a parity bit for error detection.
Fixed storage was core rope, a permanent memory
technology, with coincident current core
implementing erasable memory. Involuntary
instructions, which operated as an interrupt and
not under program control, could shift data into
specific words of memory. Data could also be
entered via the astronauts keyboard and the the
"PACE" digital command system before launch. 3
6
Galileo Attitude Control Computer
RTG Power For Keep-A-Live
RTG Power For Keep-A-Live
CMOS Memory Array
CMOS Memory Array
ROM
ROM
GSE/DMA
GSE/DMA
Arbiter/ Controller
Arbiter/ Controller
CDH/DMA
CDH/DMA
Memory units were accessed one at a time. There
was no parity and RAM contents were protected by
write protect registers and monitored by
checksums in the background. Primary and
secondary memory designs were switched via a
discrete command. ROM contents implemented
safe-hold mode. DMA was functional either with
the processor clamped in reset or executing
flight software. A heartbeat was sent to the
CDH via DMA.
7
Single String Computer A
Single Board Computer
Conceptual diagram.
Code redundantly stored in three EEPROM modules.
Switching between copies is implemented in
software and all software must be running to be
able to accept and process the command to switch
images. The critical boot code and interrupt
vectors can not be made fault tolerant in this
software-centric architecture.
Command to the flight software.
µP
Logic Device
Simplified software-centric architecture.
Switching between critical boot sections is done
by software, leaving single point failures in
this architecture. There is no parity or EDAC.
Boot Code
Boot Code
Boot Code
EEPROM Module 1
EEPROM Module 2
EEPROM Module 3
8
Single String Computer B
These two computers are based on the same base
SBC but reflect different engineering approaches.
Single Board Computer
Conceptual diagram.
Code redundantly stored in three EEPROM modules.
Switching between copies is implemented in
hardware by an external discrete command.
µP
Hardware command selects between one of two spare
modules.
Hardware command for either on- or off-board boot
code selection.
Simplified hardware-centric architecture.
Switching between critical boot sections is done
by hardware discretes, eliminating the EEPROM as
a single point failure. Common mode EEPROM
failure modes do remain.
Boot Code
Boot Code
Boot Code
EEPROM Module 1
EEPROM Module 2
EEPROM Module 3
9
Lunar Orbiter Laser Altimeter (Proposed)
Block diagram of proposed processing electronics.
S/C CMD and telemetry interfaces can read and
write all memory locations directly the
processor may be clamped in reset for these
operations. The microprocessor may boot to
safe-hold from on-chip ROM or RAM or off-chip
PROM, EEPROM, or RAM. Default science algorithms
are stored in PROM with the EEPROM providing
operational flexibility for new algorithms that
are uploaded.
10
Requirement Design Against Any Credible
Off-Nominal Event
These Events Are Considered Both Credible and
Likely
  • Power Transitions and Disruptions
  • Power Up Transient
  • Power Down Transient
  • Glitches or brownouts on power lines
  • Software Faults
  • Cell and Device Failure
  • Asynchronous Reset

11
Power Transitions and Disruptions
  • Three Cases
  • Power Up Transient
  • Power Down Transient
  • Glitches or brownouts on power lines
  • Many designers use a simple RC timing circuit for
    the generation of a POR or Power On Reset
    signal. Looking closely at the acronym, is has
    the word on in it and the O does not stand
    for Off.
  • The RC timing circuit will result in a signal
    that has lag and will not be asserted early to
    protect erasable memory contents during power
    down and transients.

(contd on next slide)
12
Power Transitions and Disruptions (contd)
  • Reset circuit characteristics
  • Power-on Assert early and hold until after all
    voltages and circuits are stable
  • Power-off Assert prior to the removal of power
  • Glitches and brown-outs Similar to the
    power-off Case.
  • Often best generated in the power supply
  • Carefully analyze the signals controlling the
    memories
  • Controls are often implemented by an FPGA that is
    not guaranteed to be under control during the
    power-on, power-off, and periods when power is
    disrupted. FPGA and configuration memory device
    internal power-on reset circuits may be active
    along with initialization sequences, charge pumps
    have to supply sufficient charge and voltage to
    turn on high-voltage isolation FETs, etc.
  • Erasable memory device protection is an analog
    function and digital components must be used with
    extreme care. Along with timing, many memory
    devices require non-standard voltage levels and
    currents for protection.

13
Software Faults
  • Consider the likelihood of a software fault is
    100.
  • Device Protection
  • Many erasable devices implement software write
    protection to prevent against inadvertent writes
    to the memory.
  • JEDEC has published a standard on this type of
    protection.
  • Do not keep the keys to unlock the memory
    on-board unless absolutely necessary.
  • Subsystem Protection
  • System level write protection limits, implemented
    in hardware, to protect against software faults.
  • Some systems implement this in software which is
    risky see bullet 1 above.
  • Use external hardware discrete command as an
    additional barrier to prevent inadvertent writes.

14
Cell and Device FailureGeneral Guidelines to be
Tailored for Each Mission and Application
  • High-reliability, radiation-hardened CMOS RAM and
    PROM is available.
  • Designing against cell and device failure should
    be consistent with mission rules on single point
    failures.
  • Examine radiation-hardened label carefully as
    some devices marked as such are in fact SEU soft.
  • Commercial off the shelf (COTS) and Single Event
    Upset (SEU) soft devices should have parity for
    error detection or error detection and correction
    (EDAC) circuits, as required for the application.
  • Analyze and test devices for lockup states.
    These can occur in many memory types from illegal
    loads into command registers, poor signal
    integrity, poor power quality, or an SEU. Some
    device lockup states require power cycling to
    clear.
  • Consider the likelihood of an EEPROM or flash
    device fault to be 100. There are enough
    failures in the industry to justify such an
    approach.

15
Asynchronous Reset
  • Consider the system effects on the memory
    subsystem from an asynchronous reset.
  • Power disruption as discussed above, which are
    included here.
  • Reset either from another on-board computer or a
    ground command, perhaps in an attempt to clear a
    fault.
  • Will write cycles be aborted while being setup or
    in-process, leaving a non-volatile memory in an
    undefined state or altering RAM contents from a
    warm boot no longer valid?
  • Hardware memory controllers
  • Flight software, which is often involved by some
    systems in generating sequences and timing for
    non-volatile memories.
  • Will hardware operations be given time and energy
    to complete on-going operations? Many
    non-volatile memory devices take on order of 10
    ms to complete.

16
Frequently Seen Problems
  • Reset signals to memory devices not properly
    driven.
  • Higher current requirements are frequently
    ignored, resulting in too large of a voltage drop
    across a pull-up resistor.
  • Non-standard logic thresholds are frequently
    ignored, resulting in too small of a DC noise
    margin.
  • The two issues above, either singly or in
    concert, can result in the device going into a
    protection mode and not operating, causing memory
    fetch operations to fail and present incorrect
    data on a byte-wide basis to a CPU.
  • Power-off and brown out electrical conditions are
    often ignored. Non-volatile memories are not
    protected.
  • Device internal write protection not used.
  • FPGAs provide control of the non-volatile memory
    devices
  • FPGA transient behavior not understood or
    considered
  • FPGA state machine response to SEUs not
    considered.

(contd on next slide)
17
Frequently Seen Problems
  • Non-volatile, erasable memories are used for boot
    and safe hold.
  • Risky in general as there is no fixed memory.
    Many implementations are single string.
  • Risky in particular since there are a lot of
    unexplained failures in the industry.
  • Software architectures require that entire
    computer systems, hardware and software, be
    operational to accept any commands. Thus, if
    there are any problems, there is often little or
    nothing that can be done from the ground.
  • Lockup states in memory devices are often not
    considered either in memory controller designs
    (soft resets) or system designs (power cycle
    required for clearing of faults).
  • Critical switching between memory images for
    booting implemented as a software function which
    can not be guaranteed to function under all
    credible faults resulting in system lockup.

(contd on next slide)
18
Frequently Seen Problems
  • DMA functions require software to be operational
    to initiate transfers which can not be guaranteed
    to function under all credible faults.
  • Technology often not understood. For example,
    some memory devices while logically permitting
    byte writes, only perform subpage writes,
    resulting in an incorrect count of write cycles
    per location, with many erasable memory
    technologies being write cycle limited.

19
Some Component ConsiderationsNon-volatile Memory
Lockup
SEFI data for the R1701L PROM This stuck at
mode, not necessarily 0, requires power cycling
of this serial device to clear. 5 See also
6 and other reports for similar results. t
SEE Test Results for AT28C010 (EEPROM) 4 Types
I and II are Single Effect Functional Interrupts
(SEFI) and required power cycling to restore
functionality. Errors can be multi-bit,
defeating SEC/DEC EDAC schemes.
Some but not all non-volatile memory components
can enter lockup states and become stuck,
requiring the cycling of power to restore
functionality. Careful system considerations for
the use of such devices is needed, with regards
to error detection and clearing, protection of
device I/O pins, and loss of system functionality
and propagation of errors until recovery is
achieved.
20
Some Component ConsiderationsSynchronous DRAM
(SDRAM) Lockup
BURST LENGTH A2 A1 A0 M30
M31 0 0 0 1 1 0 0
1 2 2 0 1 0 4
4 0 1 1 8 8 1 0 0
RESERVED RESERVED 1 0 1 RESERVED
RESERVED 1 1 0 RESERVED RESERVED 1 1
1 FULL PAGE RESERVED
Loss of functionality for the Hyundai 256M SDRAM
(Auto Refresh Operation Mode) 7
Examination a command field, Burst Length, for a
Load Mode Register command for one SDRAM type.
SDRAMs contain finite state machines and some
models may lock up, requiring the cycling of
power, if RESERVED commands are loaded. For some
models, this can result in potential damage to a
device. Other methods of entering illegal and
potentially damaging states is via an SEU, as
shown in the chart on the right, and error in the
controlling device, poor signal integrity or poor
power quality. Careful system considerations for
the use of such devices is needed, with regards
to error detection and clearing, spare
replacement devices in the event of damage, and
loss of system functionality and propagation of
errors until recovery is achieved.
21
Recommendations
  • Boot and Safe-Hold Code
  • High-reliability, radiation-hardened, fixed
    memories should normally be employed for boot and
    safe-hold functions.
  • For applications such as instruments, DMA
    functions, properly implemented, can load
    memories with boot code. In this case, the
    instrument should be safed by hardware logic.
  • DMA functions should not require any operational
    software. A hardware discrete command to clamp a
    processor into reset is also recommended.
  • Hardware discrete commands should be used for
    switching critical memory banks, not software.
  • Checking Memory Validity
  • Parity should be used as practical.
  • CRC or block parity is useful for the storage of
    frames or blocks of data.
  • Checksums should be run in the background during
    idle time.

(contd on next slide)
22
Recommendations
  • Lockup States Must Be Considered
  • Select devices that do not have lockup states, if
    possible.
  • No device with a lockup state should be
    mission-critical or safety-critical.
  • Memory controllers should be tolerant of these
    conditions and at a minimum attempt to clear
    lockup states in devices
  • System devices should be tolerant of these
    conditions and be able to cycle power to clear
    those lockup states that require power cycling
    while meeting all mission requirements.
  • Systems should require the minimum of resources
    to function to enhance the probability of
    survival in the presence of either faults or
    off-nominal events.
  • Erasable memory devices should permit an analog
    measurement of state of a bit. For example, for
    an EEPROM cell, the amount of charge on the cell
    should be represented by an analog signal that is
    digitized. This enables margins to be determined
    and trends to be measured, detecting weak cells
    or other problems as early as practical during
    test.

(contd on next slide)
23
Recommendations
  • Erect Barriers to Prevent Inadvertent
    Contamination of Erasable Memory Contents
  • Write protection registers implemented in
    hardware to prevent software errors from
    corrupting memory contents
  • Use device specific protection functions such as
    memory protect hardware pins and required
    software sequences to restrict writes. Do not
    store software keys on board but make part of a
    command and not core software.
  • Select erasable memory devices that are not
    self-contained. That is, if a clock signal and
    high voltage are required to alter the memory
    contents, they should not be generated on-chip
    but at the system level. This permits the logic
    designer to insert barriers between the logic
    signals required to write (clock signals) and
    energy source (high voltage) and the memory
    device.
  • Refreshing of critical code, such as boot code,
    that is stored in erasable memory should not be
    done to mitigate faulty devices. Instead, use
    reliable fixed memory technology.

(contd on next slide)
24
Recommendations
  • Verify Margins of All Protection Signals
  • DC voltage margin
  • AC voltage margins (e.g., cross talk)
  • Timing (protection signals for power up, power
    down, and during glitches). The power down rate
    of voltage buses is often ignored or idealized.
  • Ensure that all in process, critical write cycles
    have time to complete properly.
  • Consideration of effects and propagation of
    logical resets
  • Ensuring enough energy is in the system to permit
    write cycles to properly finish before the
    voltage is out of specification.
  • Third party device packaging houses
  • Verify that they fully understand the technology
    and the original manufacturers test procedures
    and screening criteria
  • Compare failure rates of third party houses with
    those reported by the original die manufacturer
  • Ensure that proper and complete testing for space
    missions is performed

(contd on next slide)
25
Recommendations
  • Understand All Failure Modes and Consider Common
    Mode Failures and their system effects.
  • Certain models EEPROM, flash, DRAM, and SDRAM
    have been seen to have various lockup modes or
    test modes that can be entered by credible,
    off-nominal events.
  • Non-hardened SRAM, DRAM, SDRAM, etc., can have
    stuck bits from radiation.
  • Multiple copies of the same code in the same
    technology is risky, if the fundamental
    technology is not reliable. With the current
    rash of industry failures of EEPROM, for example,
    multiple copies of the same device type, even
    with hardware selection, is a form of Russian
    Roulette. Storing redundant copies of code in
    separate blocks of one device can be subject to
    common mode failures.
  • Treating bit, block, and device failures in
    software can be done in many instances, such as
    recorders. For critical boot code, as an
    example, treating failures as a software
    maintenance issue that must be done before a
    reset, should not be a function relegated to
    software. That would be a form of foam logic.

26
References
  1. Space Vehicle Design Criteria, (Guidance and
    Control) Spaceborne Digital Computer Systems,
    NASA SP-8070, March 1971, National Aeronautics
    and Space Administration
  2. The Apollo Guidance Computer, Ramon L. Alonso
    and Albert L. Hopkins, R-416, August, 1963.
  3. General Design Characteristics of the Apollo
    Guidance Computer, Eldon C. Hall, R-410, May
    1963.
  4. Single Event Functional Interrupt (SEFI)
    Sensitivity in EEPROMs, R. Koga, 1998 MAPLD
    International Conference, September, 1998,
    Greenbelt, MD.
  5. Single-Event Upset Test Results for the Xilinx
    R1701L PROM, S. M. Guertin, JPL Report, August
    24, 2000
  6. SEE and TID Extension Testing of the Xilinx
    XQR18V04 4Mbit Radiation Hardened Configuration
    PROM, Carl Carmichael, Joe Fabula, Candice Yui,
    and Gary Swift, 2002 MAPLD International
    Conference, September 10-12, 2002, Laurel, MD.
  7. "Permanent Single Event Functional Interrupts
    (SEFIs) in 128- and 256-megabit Synchronous
    Dynamic Random Access Memories (SDRAMs)," R.
    Koga, P. Yu, K.B. Crawford, S.H. Crain, and V.T.
    Tran, 2001 IEEE Radiation Effects Data Workshop.
Write a Comment
User Comments (0)
About PowerShow.com