Computers for the Post-PC Era

About This Presentation

Title:

Computers for the Post-PC Era

Description:

Computers for the Post-PC Era David Patterson University of California at Berkeley Patterson_at_cs.berkeley.edu UC Berkeley IRAM Group UC Berkeley ISTORE Group – PowerPoint PPT presentation

Number of Views:173

Avg rating:3.0/5.0

Slides: 58

Provided by: AaronB165

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Computers for the Post-PC Era

1
Computers for the Post-PC Era

David Patterson
University of California at Berkeley
Patterson_at_cs.berkeley.edu
UC Berkeley IRAM Group
UC Berkeley ISTORE Group
istore-group_at_cs.berkeley.edu
February 2000

2
Perspective on Post-PC Era

PostPC Era will be driven by 2 technologies
1) GadgetsTiny Embedded or Mobile Devices
ubiquitous in everything
e.g., successor to PDA, cell phone, wearable
computers
2) Infrastructure to Support such Devices
e.g., successor to Big Fat Web Servers, Database
Servers

3
Outline

1) Example microprocessor for PostPC gadgets
2) Motivation and the ISTORE project vision
AME Availability, Maintainability, Evolutionary
growth
ISTOREs research principles
Proposed techniques for achieving AME
Benchmarks for AME
Conclusions and future work

4
Intelligent RAM IRAM

Microprocessor DRAM on a single chip
10X capacity vs. SRAM
on-chip memory latency 5-10X, bandwidth 50-100X
improve energy efficiency 2X-4X (no off-chip
bus)
serial I/O 5-10X v. buses
smaller board area/volume
IRAM advantages extend to
a single chip system
a building block for larger systems

5
Revive Vector Architecture

Single-chip CMOS MPU/IRAM
IRAM
Much smaller than VLIW
For sale, mature (gt20 years)(We retarget Cray
compilers)
Easy scale speed with technology
Parallel to save energy, keep perf
Multimedia apps vectorizable too N64b, 2N32b,
4N16b

Cost 1M each?
Low latency, high BW memory system?
Code density?
Compilers?
Performance?
Power/Energy?
Limited to scientific applications?

6
V-IRAM1 Low Power v. High Perf.

4 x 64 or 8 x 32 or 16 x 16
x
2-way Superscalar
Vector
Instruction

Processor
Queue
Load/Store
Vector Registers
16K I cache
16K D cache
4 x 64
4 x 64
Serial I/O
Memory Crossbar Switch
M
M
M
M
M
M
M
M
M
M

M
M
M
M
M
M
M
M
M
M
4 x 64
4 x 64
4 x 64
4 x 64
4 x 64

M
M
M
M
M
M
M
M
M
M
7
VIRAM-1 System on a Chip

Prototype scheduled for tape-out mid 2000
0.18 um EDL process
16 MB DRAM, 8 banks
MIPS Scalar core and
caches _at_ 200 MHz
4 64-bit vector unit
pipelines _at_ 200 MHz
4 100 MB parallel I/O lines
17x17 mm, 2 Watts
25.6 GB/s memory (6.4 GB/s per direction
and per Xbar)
1.6 Gflops (64-bit), 6.4 GOPs (16-bit)

Memory (64 Mbits / 8 MBytes)
Xbar
I/O
Memory (64 Mbits / 8 MBytes)
8
Media Kernel Performance
9
Base-line system comparison

All numbers in cycles/pixel
MMX and VIS results assume all data in L1 cache

10
IRAM Chip Challenges

Merged Logic-DRAM process Cost Cost of wafer,
Impact on yield, testing cost of logic and DRAM
Price on-chip DRAM v. separate DRAM chips?
Delay in transistor speeds, memory cell sizes in
Merged process vs. Logic only or DRAM only
DRAM block flexibility via DRAM compiler (vary
size, width, no. subbanks) vs. fixed block
Apps advantages in memory bandwidth, energy,
system size to offset challenges?

11
Other examples IBM Blue Gene

1 PetaFLOPS in 2005 for 100M?
Application Protein Folding
Blue Gene Chip
32 Multithreaded RISC processors ??MB Embedded
DRAM high speed Network Interface on single 20
x 20 mm chip
1 GFLOPS / processor
2 x 2 Board 64 chips (2K CPUs)
Rack 8 Boards (512 chips,16K CPUs)
System 64 Racks (512 boards,32K chips,1M CPUs)
Total 1 million processors in just 2000 sq. ft.

12
Other examples Sony Playstation 2

Emotion Engine 6.2 GFLOPS, 75 million polygons
per second (Microprocessor Report, 135)
Superscalar MIPS core vector coprocessor
graphics/DRAM
Claim Toy Story realism brought to games

13
Outline

1) Example microprocessor for PostPC gadgets
2) Motivation and the ISTORE project vision
AME Availability, Maintainability, Evolutionary
growth
ISTOREs research principles
Proposed techniques for achieving AME
Benchmarks for AME
Conclusions and future work

14
The problem space big data

Big demand for enormous amounts of data
today high-end enterprise and Internet
applications
enterprise decision-support, data mining
databases
online applications e-commerce, mail, web,
archives
future infrastructure services, richer data
computational storage back-ends for mobile
devices
more multimedia content
more use of historical data to provide better
services
Todays SMP server designs cant easily scale to
meet these huge demands

15
One approach traditional NAS

Network-attached storage makes storage devices
first-class citizens on the network
network file server appliances (NetApp, SNAP,
...)
storage-area networks (CMU NASD, NSIC OOD, ...)
active disks (CMU, UCSB, Berkeley IDISK)
These approaches primarily target performance
scalability
scalable networks remove bus bandwidth
limitations
migration of layout functionality to storage
devices removes overhead of intermediate servers
There are bigger scaling problems than scalable
performance!

16
The real scalability problems AME

Availability
systems should continue to meet quality of
service goals despite hardware and software
failures
Maintainability
systems should require only minimal ongoing human
administration, regardless of scale or complexity
Evolutionary Growth
systems should evolve gracefully in terms of
performance, maintainability, and availability as
they are grown/upgraded/expanded
These are problems at todays scales, and will
only get worse as systems grow

17
The ISTORE project vision

Our goal
develop principles and investigate hardware/sof
tware techniques for building storage-based
server systems that
are highly available
require minimal maintenance
robustly handle evolutionary growth
are scalable to O(10000) nodes

18
Principles for achieving AME (1)

No single points of failure
Redundancy everywhere
Performance robustness is more important than
peak performance
performance robustness implies that real-world
performance is comparable to best-case
performance
Performance can be sacrificed for improvements in
AME
resources should be dedicated to AME
compare biological systems spend gt 50 of
resources on maintenance
can make up performance by scaling system

19
Principles for achieving AME (2)

Introspection
reactive techniques to detect and adapt to
failures, workload variations, and system
evolution
proactive techniques to anticipate and avert
problems before they happen

20
Outline

1) Example microprocessor for PostPC gadgets
2) Motivation and the ISTORE project vision
AME Availability, Maintainability, Evolutionary
growth
ISTOREs research principles
Proposed techniques for achieving AME
Benchmarks for AME
Conclusions and future work

21
Hardware techniques

Fully shared-nothing cluster organization
truly scalable architecture
architecture that tolerates partial failure
automatic hardware redundancy
No Central Processor Unit distribute processing
with storage
if AME is important, must provide resources to be
used for AME
Nodes responsible for health of their storage
Serial lines, switches also growing with Moores
Law less need today to centralize vs. bus
oriented systems

22
Hardware techniques (2)

Heavily instrumented hardware
sensors for temp, vibration, humidity, power,
intrusion
helps detect environmental problems before they
can affect system integrity
Independent diagnostic processor on each node
provides remote control of power, remote console
access to the node, selection of node boot code
collects, stores, processes environmental data
for abnormalities
non-volatile flight recorder functionality
all diagnostic processors connected via
independent diagnostic network

23
Hardware techniques (3)

On-demand network partitioning/isolation
Allows testing, repair of online system
managed by diagnostic processor and network
switches via diagnostic network
Built-in fault injection capabilities
power control to individual node components
injectable glitches into I/O and memory busses
managed by diagnostic processor
used for proactive hardware introspection
automated detection of flaky components
controlled testing of error-recovery mechanisms
important for AME benchmarking

24
Hardware techniques (4)

Benchmarking
One reason for 1000X processor performance was
ability to measure (vs. debate) which is better
e.g., Which most important to imrpove clock
rate, clocks per instruction, or instructions
executed?
Need AME benchmarks
what gets measured gets done
benchmarks shape a field
quantification brings rigor

25
ISTORE-1 hardware platform

80-node x86-based cluster, 1.4TB storage
cluster nodes are plug-and-play, intelligent,
network-attached storage bricks
a single field-replaceable unit to simplify
maintenance
each node is a full x86 PC w/256MB DRAM, 18GB
disk
more CPU than NAS fewer disks/node than cluster

Intelligent Disk Brick Portable PC CPU Pentium
II/266 DRAM Redundant NICs (4 100 Mb/s
links) Diagnostic Processor

ISTORE Chassis
80 nodes, 8 per tray
2 levels of switches
20 100 Mb/s
2 1 Gb/s
Environment Monitoring
UPS, redundant PS,
fans, heat and vibration sensors...

26
ISTORE Brick Block Diagram
Mobile Pentium II Module
SCSI
North Bridge
CPU
Disk (18 GB)
South Bridge
Diagnostic Net
DUAL UART
DRAM 256 MB
Super I/O
Monitor Control
Diagnostic Processor
BIOS
Ethernets 4x100 Mb/s
PCI

Sensors for heat and vibration
Control over power to individual nodes

Flash
RTC
RAM
27
A glimpse into the future?

System-on-a-chip enables computer, memory,
redundant network interfaces without
significantly increasing size of disk
ISTORE HW in 5-7 years

building block 2006 MicroDrive integrated with
IRAM
9GB disk, 50 MB/sec from disk
connected via crossbar switch
10,000 nodes fit into one rack!
O(10,000) scale is our ultimate design point

28
Software techniques

Fully-distributed, shared-nothing code
centralization breaks as systems scale up
O(10000)
avoids single-point-of-failure front ends
Redundant data storage
required for high availability, simplifies
self-testing
replication at the level of application objects
application can control consistency policy
more opportunity for data placement optimization

29
Software techniques (2)

River storage interfaces
NOW Sort experience performance heterogeneity
is the norm
e.g., disks outer vs. inner track (1.5X),
fragmentation
e.g., processors load (1.5-5x)
So demand-driven delivery of data to apps
via distributed queues and graduated declustering
for apps that can handle unordered data delivery
automatically adapts to variations in performance
of producers and consumers

30
Software techniques (3)

Reactive introspection
use statistical techniques to identify normal
behavior and detect deviations from it
policy-driven automatic adaptation to abnormal
behavior once detected
initially, rely on human administrator to specify
policy
eventually, system learns to solve problems on
its own by experimenting on isolated subsets of
the nodes
one candidate reinforcement learning

31
Software techniques (4)

Proactive introspection
continuous online self-testing of HW and SW
in deployed systems!
goal is to shake out Heisenbugs before theyre
encountered in normal operation
needs data redundancy, node isolation, fault
injection
techniques
fault injection triggering hardware and software
error handling paths to verify their
integrity/existence
stress testing push HW/SW to their limits
scrubbing periodic restoration of potentially
decaying hardware or software state
self-scrubbing data structures (like MVS)
ECC scrubbing for disks and memory

32
Applications

ISTORE is not one super-system that demonstrates
all these techniques!
Initially provide library to support AME goals
Initial application targets
cluster web/email servers
self-scrubbing data structures, online
self-testing
statistical identification of normal behavior
decision-support database query execution system
River-based storage, replica management
information retrieval for multimedia data
self-scrubbing data structures, structuring
performance-robust distributed computation

33
Outline

1) Example microprocessor for PostPC gadgets
2) Motivation and the ISTORE project vision
AME Availability, Maintainability, Evolutionary
growth
ISTOREs research principles
Proposed techniques for achieving AME
Benchmarks for AME
Conclusions and future work

34
Availability benchmarks

Questions to answer
what factors affect the quality of service
delivered by the system, and by how much/how
long?
how well can systems survive typical failure
scenarios?
Availability metrics
traditionally, percentage of time system is up
time-averaged, binary view of system state
(up/down)
traditional metric is too inflexible
doesnt capture spectrum of degraded states
time-averaging discards important temporal
behavior

Solution measure variation in system quality of
service metrics over time
performance, fault-tolerance, completeness,
accuracy

35
Availability benchmark methodology

Goal quantify variation in QoS metrics as events
occur that affect system availability
Leverage existing performance benchmarks
to generate fair workloads
to measure trace quality of service metrics
Use fault injection to compromise system
hardware faults (disk, memory, network, power)
software faults (corrupt input, driver error
returns)
maintenance events (repairs, SW/HW upgrades)
Examine single-fault and multi-fault workloads
the availability analogues of performance micro-
and macro-benchmarks

36
Methodology reporting results

Results are most accessible graphically
plot change in QoS metrics over time
compare to normal behavior?
99 confidence intervals calculated from no-fault
runs

Graphs can be distilled into numbers?

37
Example results software RAID-5

Test systems Linux/Apache and Win2000/IIS
SpecWeb 99 to measure hits/second as QoS metric
fault injection at disks based on empirical fault
data
transient, correctable, uncorrectable, timeout
faults
15 single-fault workloads injected per system
only 4 distinct behaviors observed
(A) no effect (C) RAID enters degraded mode
(B) system hangs (D) RAID enters degraded mode
starts
reconstruction
both systems hung (B) on simulated disk hangs
Linux exhibited (D) on all other errors
Windows exhibited (A) on transient errors and (C)
on uncorrectable, sticky errors

38
Example results multiple-faults
Windows 2000/IIS
Linux/ Apache

Windows reconstructs 3x faster than Linux
Windows reconstruction noticeably affects
application performance, while Linux
reconstruction does not

39
Conclusions (1) Benchmarks

Linux and Windows take opposite approaches to
managing benign and transient faults
Linux is paranoid and stops using a disk on any
error
Windows ignores most benign/transient faults
Windows is more robust except when disk is truly
failing
Linux and Windows have different reconstruction
philosophies
Linux uses idle bandwidth for reconstruction
Windows steals app. bandwidth for reconstruction
Windows rebuilds fault-tolerance more quickly
Win2k favors fault-tolerance over performance
Linux favors performance over fault-tolerance

40
Conclusions (2) ISTORE

Availability, Maintainability, and Evolutionary
growth are key challenges for server systems
more important even than performance
ISTORE is investigating ways to bring AME to
large-scale, storage-intensive servers
via clusters of network-attached,
computationally-enhanced storage nodes running
distributed code
via hardware and software introspection
we are currently performing application studies
to investigate and compare techniques
Availability benchmarks a powerful tool?
revealed undocumented design decisions affecting
SW RAID availability on Linux and Windows 2000

41
Conclusions (3)

IRAM attractive for two Post-PC applications
because of low power, small size, high memory
bandwidth
Gadgets Embedded/Mobile devices
Scaleable infrastructure
ISTORE hardware/software architecture for large
scale network services
PostPC infrastructure requires
new goals Availability/Maintainability/Evolution
new principles Introspection, Performance
Robustness
new techniques Isolation/fault insertion, SW
scrubbing

42
Future work

IRAM fab and test chip
ISTORE
implement AME-enhancing techniques in a variety
of Internet, enterprise, and info retrieval
applications
select the best techniques and integrate into a
generic runtime system with AME API
add maintainability benchmarks
can we quantify administrative work needed to
maintain a certain level of availability?
Perhaps look at data security via encryption?
Consider denial of service? (or a job for IATF?)

43
The UC Berkeley IRAM/ISTORE ProjectsComputers
for the PostPC Era

For more information
http//iram.cs.berkeley.edu/istore
istore-group_at_cs.berkeley.edu

44
Backup Slides

(mostly in the area of benchmarking)

45
Case study

Software RAID-5 plus web server
Linux/Apache vs. Windows 2000/IIS
Why software RAID?
well-defined availability guarantees
RAID-5 volume should tolerate a single disk
failure
reduced performance (degraded mode) after failure
may automatically rebuild redundancy onto spare
disk
simple system
easy to inject storage faults
Why web server?
an application with measurable QoS metrics that
depend on RAID availability and performance

46
Benchmark environment metrics

QoS metrics measured
hits per second
roughly tracks response time in our experiments
degree of fault tolerance in storage system
Workload generator and data collector
SpecWeb99 web benchmark
simulates realistic high-volume user load
mostly static read-only workload some dynamic
content
modified to run continuously and to measure
average hits per second over each 2-minute
interval

47
Benchmark environment faults

Focus on faults in the storage system (disks)
How do disks fail?
according to Tertiary Disk project, failures
include
recovered media errors
uncorrectable write failures
hardware errors (e.g., diagnostic failures)
SCSI timeouts
SCSI parity errors
note no head crashes, no fail-stop failures

48
Disk fault injection technique

To inject reproducible failures, we replaced one
disk in the RAID with an emulated disk
a PC that appears as a disk on the SCSI bus
I/O requests processed in software, reflected to
local disk
fault injection performed by altering SCSI
command processing in the emulation software
Types of emulated faults
media errors (transient, correctable,
uncorrectable)
hardware errors (firmware, mechanical)
parity errors
power failures
disk hangs/timeouts

49
System configuration

RAID-5 Volume 3GB capacity, 1GB used per disk
3 physical disks, 1 emulated disk, 1 emulated
spare disk
2 web clients connected via 100Mb switched
Ethernet

50
Results single-fault experiments

One expt for each type of fault (15 total)
only one fault injected per experiment
no human intervention
system allowed to continue until stabilized or
crashed
Four distinct system behaviors observed
(A) no effect system ignores fault
(B) RAID system enters degraded mode
(C) RAID system begins reconstruction onto spare
disk
(D) system failure (hang or crash)

51
System behavior single-fault
(A) no effect
(B) enter degraded mode
(D) system failure
(C) begin reconstruction
52
System behavior single-fault (2)

Windows ignores benign faults
Windows cant automatically rebuild
Linux reconstructs on all errors
Both systems fail when disk hangs

53
Interpretation single-fault expts

Linux and Windows take opposite approaches to
managing benign and transient faults
these faults do not necessarily imply a failing
disk
Tertiary Disk 368/368 disks had transient SCSI
errors 13/368 disks had transient hardware
errors, only 2/368 needed replacing.
Linux is paranoid and stops using a disk on any
error
fragile system is more vulnerable to multiple
faults
but no chance of slowly-failing disk impacting
perf.
Windows ignores most benign/transient faults
robust less likely to lose data, more
disk-efficient
less likely to catch slowly-failing disks and
remove them
Neither policy is ideal!
need a hybrid?

54
Results multiple-fault experiments

Scenario
(1) disk fails
(2) data is reconstructed onto spare
(3) spare fails
(4) administrator replaces both failed disks
(5) data is reconstructed onto new disks
Requires human intervention
to initiate reconstruction on Windows 2000
simulate 6 minute sysadmin response time
to replace disks
simulate 90 seconds of time to replace hot-swap
disks

55
Interpretation multi-fault expts

Linux and Windows have different reconstruction
philosophies
Linux uses idle bandwidth for reconstruction
little impact on application performance
increases length of time system is vulnerable to
faults
Windows steals app. bandwidth for reconstruction
reduces application performance
minimizes system vulnerability
but must be manually initiated (or scripted)
Windows favors fault-tolerance over performance
Linux favors performance over fault-tolerance
the same design philosophies seen in the
single-fault experiments

56
Maintainability Observations

Scenario administrator accidentally removes and
replaces live disk in degraded mode
double failure no guarantee on data integrity
theoretically, can recover if writes are queued
Windows recovers, but loses active writes
journalling NTFS is not corrupted
all data not being actively written is intact
Linux will not allow removed disk to be
reintegrated
total loss of all data on RAID volume!

57
Maintainability Observations (2)