IRAM and ISTORE Projects - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

IRAM and ISTORE Projects

Description:

www.eecs.berkeley.edu – PowerPoint PPT presentation

Number of Views:143
Avg rating:3.0/5.0
Slides: 79
Provided by: DavidOpp8
Category:

less

Transcript and Presenter's Notes

Title: IRAM and ISTORE Projects


1
IRAM and ISTORE Projects
  • Aaron Brown, James Beck, Rich Fromm, Joe Gebis,
    Paul Harvey, Adam Janin, Dave Judd,
    Kimberly Keeton, Christoforos Kozyrakis, David
    Martin, Rich Martin, Thinh Nguyen, David
    Oppenheimer, Steve Pope, Randi Thomas,
    Noah Treuhaft, Sam Williams, John
    Kubiatowicz, Kathy Yelick, and David Patterson
  • http//iram.cs.berkeley.edu/istore
  • Fall 2000 DIS DARPA Meeting

2
IRAM and ISTORE Vision
  • Integrated processor in memory provides efficient
    access to high memory bandwidth
  • Two Post-PC applications
  • IRAM Single chip system for embedded and
    portable applications
  • Target media processing (speech, images, video,
    audio)
  • ISTORE Building block when combined with disk
    for storage and retrieval servers
  • Up to 10K nodes in one rack
  • Non-IRAM prototype addresses key scaling issues
    availability, manageability, evolution

Photo from Itsy, Inc.
3
IRAM Overview
  • A processor architecture for embedded/portable
    systems running media applications
  • Based on media processing and embedded DRAM
  • Simple, scalable, energy and area efficient
  • Good compiler target

Flag 0
Flag 1
Instr Cache (8KB)
FPU
Flag Register File (512B)
MIPS64 5Kc Core
CP IF
Arith 0
Arith 1
256b
256b
SysAD IF
Vector Register File (8KB)
64b
64b
Memory Unit
TLB
256b
JTAG IF
DMA
Memory Crossbar

JTAG
DRAM0 (2MB)
DRAM1 (2MB)
DRAM7 (2MB)
4
Architecture Details
  • MIPS64 5Kc core (200 MHz)
  • Single-issue scalar core with 8 Kbyte ID caches
  • Vector unit (200 MHz)
  • 8 KByte register file (32 64b elements per
    register)
  • 256b datapaths, can be subdivided into 16b, 32b,
    64b
  • 2 arithmetic (1 FP, single), 2 flag processing
  • Memory unit
  • 4 address generators for strided/indexed accesses
  • Main memory system
  • 8 2-MByte DRAM macros
  • 25ns random access time, 7.5ns page access time
  • Crossbar interconnect
  • 12.8 GBytes/s peak bandwidth per direction
    (load/store)
  • Off-chip interface
  • 2 channel DMA engine and 64n SysAD bus

5
Floorplan
  • Technology IBM SA-27E
  • 0.18mm CMOS, 6 metal layers
  • 290 mm2 die area
  • 225 mm2 for memory/logic
  • Transistor count 150M
  • Power supply
  • 1.2V for logic, 1.8V for DRAM
  • Typical power consumption 2.0 W
  • 0.5 W (scalar) 1.0 W (vector) 0.2 W (DRAM)
    0.3 W (misc)
  • Peak vector performance
  • 1.6/3.2/6.4 Gops wo. multiply-add (64b/32b/16b
    operations)
  • 3.2/6.4 /12.8 Gops w. madd
  • 1.6 Gflops (single-precision)
  • Tape-out planned for March 01

6
Alternative Floorplans
  • VIRAM-8MB
  • 4 lanes, 8 Mbytes
  • 190 mm2
  • 3.2 Gops at 200 MHz(32-bit ops)

VIRAM-2Lanes 2 lanes, 4 Mbytes 120 mm2 1.6 Gops
at 200 MHz
VIRAM-Lite 1 lane, 2 Mbytes 60 mm2 0.8 Gops at
200 MHz
7
VIRAM Compiler
Optimizer
Frontends
Code Generators
C
T3D/T3E
Crays PDGCS
C
C90/T90/SV1
Fortran95
SV2/VIRAM
  • Based on the Crays production compiler
  • Challenges
  • narrow data types and scalar/vector memory
    consistency
  • Advantages relative to media-extensions
  • powerful addressing modes and ISA independent of
    datapath width

8
Exploiting 0n-Chip Bandwidth
  • Vector ISA uses high bandwidth to mask latency
  • Compiled matrix-vector multiplication 2
    Flops/element
  • Easy compilation problem stresses memory
    bandwidth
  • Compare to 304 Mflops (64-bit) for Power3
    (hand-coded)
  • Performance scales with number of lanes up to 4
  • Need more memory banks than default DRAM macro
    for 8 lanes

9
Compiling Media Kernels on IRAM
  • The compiler generates code for narrow data
    widths, e.g., 16-bit integer
  • Compilation model is simple, more scalable
    (across generations) than MMX, VIS, etc.
  • Strided and indexed loads/stores simpler than
    pack/unpack
  • Maximum vector length is longer than datapath
    width (256 bits) all lane scalings done with
    single executable

10
Vector Vs. SIMD Example
  • Simple image processing example
  • conversion from RGB to YUV
  • Y ( 9798R 19235G 3736B) / 32768
  • U (-4784R - 9437G 4221B) / 32768
    128
  • V (20218R 16941G 3277B) / 32768
    128

11
VIRAM Code (22 instructions)
  • RGBtoYUV
  • vlds.u.b r_v, r_addr, stride3, addr_inc
    load R
  • vlds.u.b g_v, g_addr, stride3, addr_inc
    load G
  • vlds.u.b b_v, b_addr, stride3, addr_inc
    load B
  • xlmul.u.sv o1_v, t0_s, r_v
    calculate Y
  • xlmadd.u.sv o1_v, t1_s, g_v
  • xlmadd.u.sv o1_v, t2_s, b_v
  • vsra.vs o1_v, o1_v, s_s
  • xlmul.u.sv o2_v, t3_s, r_v
    calculate U
  • xlmadd.u.sv o2_v, t4_s, g_v
  • xlmadd.u.sv o2_v, t5_s, b_v
  • vsra.vs o2_v, o2_v, s_s
  • vadd.sv o2_v, a_s, o2_v
  • xlmul.u.sv o3_v, t6_s, r_v
    calculate V
  • xlmadd.u.sv o3_v, t7_s, g_v
  • xlmadd.u.sv o3_v, t8_s, b_v
  • vsra.vs o3_v, o3_v, s_s
  • vadd.sv o3_v, a_s, o3_v
  • vsts.b o1_v, y_addr, stride3, addr_inc
    store Y

12
MMX Code (part 1)
  • RGBtoYUV
  • movq mm1, eax
  • pxor mm6, mm6
  • movq mm0, mm1
  • psrlq mm1, 16
  • punpcklbw mm0, ZEROS
  • movq mm7, mm1
  • punpcklbw mm1, ZEROS
  • movq mm2, mm0
  • pmaddwd mm0, YR0GR
  • movq mm3, mm1
  • pmaddwd mm1, YBG0B
  • movq mm4, mm2
  • pmaddwd mm2, UR0GR
  • movq mm5, mm3
  • pmaddwd mm3, UBG0B
  • punpckhbw mm7, mm6
  • pmaddwd mm4, VR0GR
  • paddd mm0, mm1
  • paddd mm4, mm5
  • movq mm5, mm1
  • psllq mm1, 32
  • paddd mm1, mm7
  • punpckhbw mm6, ZEROS
  • movq mm3, mm1
  • pmaddwd mm1, YR0GR
  • movq mm7, mm5
  • pmaddwd mm5, YBG0B
  • psrad mm0, 15
  • movq TEMP0, mm6
  • movq mm6, mm3
  • pmaddwd mm6, UR0GR
  • psrad mm2, 15
  • paddd mm1, mm5
  • movq mm5, mm7
  • pmaddwd mm7, UBG0B
  • psrad mm1, 15
  • pmaddwd mm3, VR0GR

13
MMX Code (part 2)
  • paddd mm6, mm7
  • movq mm7, mm1
  • psrad mm6, 15
  • paddd mm3, mm5
  • psllq mm7, 16
  • movq mm5, mm7
  • psrad mm3, 15
  • movq TEMPY, mm0
  • packssdw mm2, mm6
  • movq mm0, TEMP0
  • punpcklbw mm7, ZEROS
  • movq mm6, mm0
  • movq TEMPU, mm2
  • psrlq mm0, 32
  • paddw mm7, mm0
  • movq mm2, mm6
  • pmaddwd mm2, YR0GR
  • movq mm0, mm7
  • pmaddwd mm7, YBG0B
  • movq mm4, mm6
  • pmaddwd mm6, UR0GR
  • movq mm3, mm0
  • pmaddwd mm0, UBG0B
  • paddd mm2, mm7
  • pmaddwd mm4,
  • pxor mm7, mm7
  • pmaddwd mm3, VBG0B
  • punpckhbw mm1,
  • paddd mm0, mm6
  • movq mm6, mm1
  • pmaddwd mm6, YBG0B
  • punpckhbw mm5,
  • movq mm7, mm5
  • paddd mm3, mm4
  • pmaddwd mm5, YR0GR
  • movq mm4, mm1
  • pmaddwd mm4, UBG0B
  • psrad mm0, 15

14
MMX Code (pt. 3 121 instructions)
  • pmaddwd mm7, UR0GR
  • psrad mm3, 15
  • pmaddwd mm1, VBG0B
  • psrad mm6, 15
  • paddd mm4, OFFSETD
  • packssdw mm2, mm6
  • pmaddwd mm5, VR0GR
  • paddd mm7, mm4
  • psrad mm7, 15
  • movq mm6, TEMPY
  • packssdw mm0, mm7
  • movq mm4, TEMPU
  • packuswb mm6, mm2
  • movq mm7, OFFSETB
  • paddd mm1, mm5
  • paddw mm4, mm7
  • psrad mm1, 15
  • movq ebx, mm6
  • packuswb mm4,
  • movq ecx, mm4
  • packuswb mm5, mm3
  • add ebx, 8
  • add ecx, 8
  • movq edx, mm5
  • dec edi
  • jnz RGBtoYUV

15
IRAM Status
  • Chip
  • ISA has not changed significantly in over a year
  • Verilog complete, except SRAM for scalar cache
  • Testing framework in place
  • Compiler
  • Backend code generation complete
  • Continued performance improvements, especially
    for narrow data widths
  • Application Benchmarks
  • Handcoded kernels better than MMX,VIS, gp DSPs
  • DCT, FFT, MVM, convolution, image composition,
  • Compiled kernels demonstrate ISA advantages
  • MVM, sparse MVM, decrypt, image composition,
  • Full applications H263 encoding (done), speech
    (underway)

16
Scaling to 10K Processors
  • IRAM micro-disk offer huge scaling
    opportunities
  • Still many hard system problems (AME)
  • Availability
  • systems should continue to meet quality of
    service goals despite hardware and software
    failures
  • Maintainability
  • systems should require only minimal ongoing human
    administration, regardless of scale or complexity
  • Evolutionary Growth
  • systems should evolve gracefully in terms of
    performance, maintainability, and availability as
    they are grown/upgraded/expanded
  • These are problems at todays scales, and will
    only get worse as systems grow

17
Is Maintenance the Key?
  • Rule of Thumb Maintenance 10X HW
  • so over 5 year product life, 95 of cost is
    maintenance

18
Hardware Techniques for AME
  • Cluster of Storage Oriented Nodes (SON)
  • Scalable, tolerates partial failures, automatic
    redundancy
  • Heavily instrumented hardware
  • Sensors for temp, vibration, humidity, power,
    intrusion
  • Independent diagnostic processor on each node
  • Remote control of power collects environmental
    data for
  • Diagnostic processors connected via independent
    network
  • On-demand network partitioning/isolation
  • Allows testing, repair of online system
  • Managed by diagnostic processor
  • Built-in fault injection capabilities
  • Used for hardware introspection
  • Important for AME benchmarking

19
ISTORE-1 system
  • Hardware plug-and-play intelligent devices with
    self-monitoring, diagnostics, and fault injection
    hardware
  • intelligence used to collect and filter
    monitoring data
  • diagnostics and fault injection enhance
    robustness
  • networked to create a scalable shared-nothing
    cluster
  • Scheduled for 4Q 00

20
ISTORE-1 System Layout
PE1000s
PE1000s PowerEngines 100Mb switches PE5200s
PowerEngines 1 Gb switches UPSs used
PE5200
PE5200
UPS
UPS
UPS
UPS
UPS
UPS
21
ISTORE Brick Node Block Diagram
Mobile Pentium II Module
SCSI
North Bridge
CPU
Disk (18 GB)
South Bridge
Diagnostic Net
DUAL UART
DRAM 256 MB
Super I/O
Monitor Control
Diagnostic Processor
BIOS
Ethernets 4x100 Mb/s
PCI
  • Sensors for heat and vibration
  • Control over power to individual nodes

Flash
RTC
RAM
22
  • ISTORE Brick Node
  • Pentium-II/266MHz
  • 256 MB DRAM
  • 18 GB SCSI (or IDE) disk
  • 4x100Mb Ethernet
  • m68k diagnostic processor CAN diagnostic
    network
  • Packaged in standard half-height RAID array
    canister

23
Software Techniques
  • Reactive introspection
  • Mining available system data
  • Proactive introspection
  • Isolation fault insertion gt test recovery code
  • Semantic redundancy
  • Use of coding and application-specific
    checkpoints
  • Self-Scrubbing data structures
  • Check (and repair?) complex distributed
    structures
  • Load adaptation for performance faults
  • Dynamic load balancing for regular computations
  • Benchmarking
  • Define quantitative evaluations for AME

24
Network Redundancy
  • Each brick node has 4 100Mb ethernets
  • TCP striping used for performance
  • Demonstration on 2-node prototype using 3 links
  • When a link fails, packets on that link are
    dropped
  • Nodes detect failures using independent pings
  • More scalable approach being developed

Mb/s
25
Load Balancing for Performance Faults
  • Failure is not always a discrete property
  • Some fraction of components may fail
  • Some components may perform poorly
  • Graph shows effect of Graduated Declustering on
    cluster I/O with disk performance faults

26
Availability benchmarks
  • Goal quantify variation in QoS as fault events
    occur
  • Leverage existing performance benchmarks
  • to generate fair workloads
  • to measure trace quality of service metrics
  • Use fault injection to compromise system
  • Results are most accessible graphically

27
Example Faults in Software RAID
Linux
Solaris
  • Compares Linux and Solaris reconstruction
  • Linux minimal performance impact but longer
    window of vulnerability to second fault
  • Solaris large perf. impact but restores
    redundancy fast

28
Towards Manageability Benchmarks
  • Goal is to gain experience with a small piece of
    the problem
  • can we measure the time and learning-curve costs
    for one task?
  • Task handling disk failure in RAID system
  • includes detection and repair
  • Same test systems as availability case study
  • Windows 2000/IIS, Linux/Apache, Solaris/Apache
  • Five test subjects and fixed training session
  • (Too small to draw statistical conclusions)

29
Sample results time
  • Graphs plot human time, excluding wait time

30
Analysis of time results
  • Rapid convergence across all OSs/subjects
  • despite high initial variability
  • final plateau defines minimum time for task
  • plateau invariant over individuals/approaches
  • Clear differences in plateaus between OSs
  • Solaris lt Windows lt Linux
  • note statistically dubious conclusion given
    sample size!

31
ISTORE Status
  • ISTORE Hardware
  • All 80 Nodes (boards) manufactured
  • PCB backplane in layout
  • Finish 80 node system December 2000
  • Software
  • 2-node system running -- boots OS
  • Diagnostic Processor SW and device driver done
  • Network striping done fault adaptation ongoing
  • Load balancing for performance heterogeneity done
  • Benchmarking
  • Availability benchmark example complete
  • Initial maintainability benchmark complete,
    revised strategy underway

32
BACKUP SLIDES
  • IRAM

33
Modular Vector Unit Design
256b
Control
  • Single 64b lane design replicated 4 times
  • Reduces design and testing time
  • Provides a simple scaling model (up or down)
    without major control or datapath redesign
  • Lane scaling independent of DRAM scaling
  • Most instructions require only intra-lane
    interconnect
  • Tolerance to interconnect delay scaling

34
Performance FFT (1)
35
Performance FFT (2)
36
Media Kernel Performance
37
Base-line system comparison
  • All numbers in cycles/pixel
  • MMX and VIS results assume all data in L1 cache

38
Vector Architecture State
39
Vector Instruction Set
  • Complete load-store vector instruction set
  • Uses the MIPS64 ISA coprocessor 2 opcode space
  • Ideas work with any core CPU Arm, PowerPC, ...
  • Architecture state
  • 32 general-purpose vector registers
  • 32 vector flag registers
  • Data types supported in vectors
  • 64b, 32b, 16b (and 8b)
  • 91 arithmetic and memory instructions
  • Not specified by the ISA
  • Maximum vector register length
  • Functional unit datapath width

40
Compiler/OS Enhancements
  • Compiler support
  • Conditional execution of vector instruction
  • Using the vector flag registers
  • Support for software speculation of load
    operations
  • Operating system support
  • MMU-based virtual memory
  • Restartable arithmetic exceptions
  • Valid and dirty bits for vector registers
  • Tracking of maximum vector length used

41
BACKUP SLIDES
  • ISTORE

42
ISTORE A server for the PostPC Era
  • Aaron Brown, Dave Martin, David Oppenheimer, Noah
    Trauhaft, Dave Patterson,Katherine Yelick
  • University of California at Berkeley
  • Patterson_at_cs.berkeley.edu
  • UC Berkeley ISTORE Group
  • istore-group_at_cs.berkeley.edu
  • August 2000

43
ISTORE as Storage System of the Future
  • Availability, Maintainability, and Evolutionary
    growth key challenges for storage systems
  • Maintenance Cost gt10X Purchase Cost per year,
  • Even 2X purchase cost for 1/2 maintenance cost
    wins
  • AME improvement enables even larger systems
  • ISTORE also cost-performance advantages
  • Better space, power/cooling costs (_at_colocation
    site)
  • More MIPS, cheaper MIPS, no bus bottlenecks
  • Compression reduces network , encryption
    protects
  • Single interconnect, supports evolution of
    technology, single network technology to
    maintain/understand
  • Match to future software storage services
  • Future storage service software target clusters

44
Lampson Systems Challenges
  • Systems that work
  • Meeting their specs
  • Always available
  • Adapting to changing environment
  • Evolving while they run
  • Made from unreliable components
  • Growing without practical limit
  • Credible simulations or analysis
  • Writing good specs
  • Testing
  • Performance
  • Understanding when it doesnt matter

Computer Systems Research-Past and Future
Keynote address, 17th SOSP, Dec. 1999 Butler
Lampson Microsoft
45
Jim Gray Trouble-Free Systems
  • Manager
  • Sets goals
  • Sets policy
  • Sets budget
  • System does the rest.
  • Everyone is a CIO (Chief Information Officer)
  • Build a system
  • used by millions of people each day
  • Administered and managed by a ½ time person.
  • On hardware fault, order replacement part
  • On overload, order additional equipment
  • Upgrade hardware and software automatically.

What Next? A dozen remaining IT
problems Turing Award Lecture, FCRC, May
1999 Jim Gray Microsoft
46
Jim Gray Trustworthy Systems
  • Build a system used by millions of people that
  • Only services authorized users
  • Service cannot be denied (cant destroy data or
    power).
  • Information cannot be stolen.
  • Is always available (out less than 1 second per
    100 years 8 9s of availability)
  • 1950s 90 availability, Today 99 uptime for
    web sites, 99.99 for well managed sites
    (50 minutes/year)3 extra 9s in 45 years.
  • Goal 5 more 9s 1 second per century.
  • And prove it.

47
Hennessy What Should the New World Focus Be?
  • Availability
  • Both appliance service
  • Maintainability
  • Two functions
  • Enhancing availability by preventing failure
  • Ease of SW and HW upgrades
  • Scalability
  • Especially of service
  • Cost
  • per device and per service transaction
  • Performance
  • Remains important, but its not SPECint

Back to the Future Time to Return to
Longstanding Problems in Computer Systems?
Keynote address, FCRC, May 1999 John
Hennessy Stanford
48
The real scalability problems AME
  • Availability
  • systems should continue to meet quality of
    service goals despite hardware and software
    failures
  • Maintainability
  • systems should require only minimal ongoing human
    administration, regardless of scale or
    complexity Today, cost of maintenance 10-100
    cost of purchase
  • Evolutionary Growth
  • systems should evolve gracefully in terms of
    performance, maintainability, and availability as
    they are grown/upgraded/expanded
  • These are problems at todays scales, and will
    only get worse as systems grow

49
Principles for achieving AME
  • No single points of failure, lots of redundancy
  • Performance robustness is more important than
    peak performance
  • Performance can be sacrificed for improvements in
    AME
  • resources should be dedicated to AME
  • biological systems gt 50 of resources on
    maintenance
  • can make up performance by scaling system
  • Introspection
  • reactive techniques to detect and adapt to
    failures, workload variations, and system
    evolution
  • proactive techniques to anticipate and avert
    problems before they happen

50
Hardware Techniques (1) SON
  • SON Storage Oriented Nodes
  • Distribute processing with storage
  • If AME really important, provide resources!
  • Most storage servers limited by speed of CPUs!!
  • Amortize sheet metal, power, cooling, network for
    disk to add processor, memory, and a real
    network?
  • Embedded processors 2/3 perf, 1/10 cost, power?
  • Serial lines, switches also growing with Moores
    Law less need today to centralize vs. bus
    oriented systems
  • Advantages of cluster organization
  • Truly scalable architecture
  • Architecture that tolerates partial failure
  • Automatic hardware redundancy

51
Hardware techniques (2)
  • Heavily instrumented hardware
  • sensors for temp, vibration, humidity, power,
    intrusion
  • helps detect environmental problems before they
    can affect system integrity
  • Independent diagnostic processor on each node
  • provides remote control of power, remote console
    access to the node, selection of node boot code
  • collects, stores, processes environmental data
    for abnormalities
  • non-volatile flight recorder functionality
  • all diagnostic processors connected via
    independent diagnostic network

52
Hardware techniques (3)
  • On-demand network partitioning/isolation
  • Internet applications must remain available
    despite failures of components, therefore can
    isolate a subset for preventative maintenance
  • Allows testing, repair of online system
  • Managed by diagnostic processor and network
    switches via diagnostic network
  • Built-in fault injection capabilities
  • Power control to individual node components
  • Injectable glitches into I/O and memory busses
  • Managed by diagnostic processor
  • Used for proactive hardware introspection
  • automated detection of flaky components
  • controlled testing of error-recovery mechanisms

53
Hardware culture (4)
  • Benchmarking
  • One reason for 1000X processor performance was
    ability to measure (vs. debate) which is better
  • e.g., Which most important to improve clock
    rate, clocks per instruction, or instructions
    executed?
  • Need AME benchmarks
  • what gets measured gets done
  • benchmarks shape a field
  • quantification brings rigor

54
Example single-fault result
Linux
Solaris
  • Compares Linux and Solaris reconstruction
  • Linux minimal performance impact but longer
    window of vulnerability to second fault
  • Solaris large perf. impact but restores
    redundancy fast

55
Deriving ISTORE
  • What is the interconnect?
  • FC-AL? (Interoperability? Cost of switches?)
  • Infiniband? (When? Cost of switches? Cost of
    NIC?)
  • Gbit Ehthernet?
  • Pick Gbit Ethernet as commodity switch, link
  • As main stream, fastest improving in cost
    performance
  • We assume Gbit Ethernet switches will get cheap
    over time (Network Processors, volume, )

56
Deriving ISTORE
  • Number of Disks / Gbit port?
  • Bandwidth of 2000 disk
  • Raw bit rate 427 Mbit/sec.
  • Data transfer rate 40.2 MByte/sec
  • Capacity 73.4 GB
  • Disk trends
  • BW 40/year
  • Capacity, Areal density,/MB 100/year
  • 2003 disks
  • 500 GB capacity (lt8X)
  • 110 MB/sec or 0.9 Gbit/sec (2.75X)
  • Number of Disks / Gbit port 1

57
ISTORE-1 Brick
  • Websters Dictionary brick a handy-sized unit
    of building or paving material typically being
    rectangular and about 2 1/4 x 3 3/4 x 8 inches
  • ISTORE-1 Brick 2 x 4 x 11 inches (1.3x)
  • Single physical form factor, fixed cooling
    required, compatible network interface to
    simplify physical maintenance, scaling over time
  • Contents should evolve over time contains most
    cost effective MPU, DRAM, disk, compatible NI
  • If useful, could have special bricks (e.g., DRAM
    rich)
  • Suggests network that will last, evolve Ethernet

58
ISTORE-1 hardware platform
  • 80-node x86-based cluster, 1.4TB storage
  • cluster nodes are plug-and-play, intelligent,
    network-attached storage bricks
  • a single field-replaceable unit to simplify
    maintenance
  • each node is a full x86 PC w/256MB DRAM, 18GB
    disk
  • more CPU than NAS fewer disks/node than cluster

59
Common Question RAID?
  • Switched Network sufficient for all types of
    communication, including redundancy
  • Hierarchy of buses is generally not superior to
    switched network
  • Veritas, others offer software RAID 5 and
    software Mirroring (RAID 1)
  • Another use of processor per disk

60
A Case for Intelligent Storage
  • Advantages
  • Cost of Bandwidth
  • Cost of Space
  • Cost of Storage System v. Cost of Disks
  • Physical Repair, Number of Spare Parts
  • Cost of Processor Complexity
  • Cluster advantages dependability, scalability
  • 1 v. 2 Networks

61
Cost of Space, Power, Bandwidth
  • Co-location sites (e.g., Exodus) offer space,
    expandable bandwidth, stable power
  • Charge 1000/month per rack ( 10 sq. ft.)
  • Includes 1 20-amp circuit/rack charges
    100/month per extra 20-amp circuit/rack
  • Bandwidth cost 500 per Mbit/sec/Month

62
Cost of Bandwidth, Safety
  • Network bandwidth cost is significant
  • 1000 Mbit/sec/month gt 6,000,000/year
  • Security will increase in importance for storage
    service providers
  • XML gt server format conversion for gadgets
  • gt Storage systems of future need greater
    computing ability
  • Compress to reduce cost of network bandwidth 3X
    save 4M/year?
  • Encrypt to protect information in transit for B2B
  • gt Increasing processing/disk for future storage
    apps

63
Cost of Space, Power
  • Sun Enterprise server/array (64CPUs/60disks)
  • 10K Server (64 CPUs) 70 x 50 x 39 in.
  • A3500 Array (60 disks) 74 x 24 x 36 in.
  • 2 Symmetra UPS (11KW) 2 52 x 24 x 27 in.
  • ISTORE-1 2X savings in space
  • ISTORE-1 1 rack (big) switches, 1 rack (old)
    UPSs, 1 rack for 80 CPUs/disks (3/8 VME rack
    unit/brick)
  • ISTORE-2 8X-16X space?
  • Space, power cost/year for 1000 disks Sun
    924k, ISTORE-1 484k, ISTORE2 50k

64
Disk Limit Bus Hierarchy
Server
Storage Area Network
CPU
Memory bus
(FC-AL)
Internal I/O bus
Memory
RAID bus
(PCI)
Mem
  • Data rate vs. Disk rate
  • SCSI Ultra3 (80 MHz), Wide (16 bit) 160
    MByte/s
  • FC-AL 1 Gbit/s 125 MByte/s
  • Use only 50 of a bus
  • Command overhead ( 20)
  • Queuing Theory (lt 70)

External I/O bus
Disk Array
(SCSI)
(15 disks/bus)
65
Physical Repair, Spare Parts
  • ISTORE Compatible modules based on hot-pluggable
    interconnect (LAN) with few Field Replacable
    Units (FRUs) Node, Power Supplies, Switches,
    network cables
  • Replace node (disk, CPU, memory, NI) if any fail
  • Conventional Heterogeneous system with many
    server modules (CPU, backplane, memory cards, )
    and disk array modules (controllers, disks, array
    controllers, power supplies, )
  • Store all components available somewhere as FRUs
  • Sun Enterprise 10k has 100 types of spare parts
  • Sun 3500 Array has 12 types of spare parts

66
ISTORE Complexity v. Perf
  • Complexity increase
  • HP PA-8500 issue 4 instructions per clock cycle,
    56 instructions out-of-order execution, 4Kbit
    branch predictor, 9 stage pipeline, 512 KB I
    cache, 1024 KB D cache (gt 80M transistors just in
    caches)
  • Intel Xscale 16 KB I, 16 KB D, 1 instruction,
    in order execution, no branch prediction, 6 stage
    pipeline
  • Complexity costs in development time, development
    power, die size, cost
  • 550 MHz HP PA-8500 477 mm2, 0.25 micron/4M 330,
    60 Watts
  • 1000 MHz Intel StrongARM2 (Xscale) _at_ 1.5 Watts,
    800 MHz at 0.9 W, 50 Mhz _at_ 0.01W, 0.18 micron
    (old chip 50 mm2, 0.35 micron, 18)
  • gt Count for system, not processors/disk

67
ISTORE Cluster Advantages
  • Architecture that tolerates partial failure
  • Automatic hardware redundancy
  • Transparent to application programs
  • Truly scalable architecture
  • Given maintenance is 10X-100X capital costs,
    clustersize limits today are maintenance, floor
    space cost - generally NOT capital costs
  • As a result, it is THE target architecture for
    new software apps for Internet

68
ISTORE 1 vs. 2 networks
  • Current systems all have LAN Disk interconnect
    (SCSI, FCAL)
  • LAN is improving fastest, most investment, most
    features
  • SCSI, FC-AL poor network features, improving
    slowly, relatively expensive for switches,
    bandwidth
  • FC-AL switches dont interoperate
  • Two sets of cables, wiring?
  • SysAdmin trained in 2 networks, SW interface,
    ???
  • Why not single network based on best HW/SW
    technology?
  • Note there can be still 2 instances of the
    network (e.g. external, internal), but only one
    technology

69
Initial Applications
  • ISTORE-1 is not one super-system that
    demonstrates all these techniques!
  • Initially provide middleware, library to support
    AME
  • Initial application targets
  • information retrieval for multimedia data (XML
    storage?)
  • self-scrubbing data structures, structuring
    performance-robust distributed computation
  • Example home video server using XML interfaces
  • email service
  • self-scrubbing data structures, online
    self-testing
  • statistical identification of normal behavior

70
A glimpse into the future?
  • System-on-a-chip enables computer, memory,
    redundant network interfaces without
    significantly increasing size of disk
  • ISTORE HW in 5-7 years
  • 2006 brick System On a Chip integrated with
    MicroDrive
  • 9GB disk, 50 MB/sec from disk
  • connected via crossbar switch
  • From brick to domino
  • If low power, 10,000 nodes fit into one rack!
  • O(10,000) scale is our ultimate design point

71
Conclusion ISTORE as Storage System of the
Future
  • Availability, Maintainability, and Evolutionary
    growth key challenges for storage systems
  • Maintenance Cost 10X Purchase Cost per year, so
    over 5 year product life, 95 of cost of
    ownership
  • Even 2X purchase cost for 1/2 maintenance cost
    wins
  • AME improvement enables even larger systems
  • ISTORE has cost-performance advantages
  • Better space, power/cooling costs (_at_colocation
    site)
  • More MIPS, cheaper MIPS, no bus bottlenecks
  • Compression reduces network , encryption
    protects
  • Single interconnect, supports evolution of
    technology, single network technology to
    maintain/understand
  • Match to future software storage services
  • Future storage service software target clusters

72
Questions?
  • Contact us if youre interestedemail
    patterson_at_cs.berkeley.edu http//iram.cs.berkeley
    .edu/
  • If its important, how can you say if its
    impossible if you dont try?
  • Jean Morreau, a founder of European Union

73
Clusters and TPC Software 8/00
  • TPC-C 6 of Top 10 performance are clusters,
    including all of Top 5 4 SMPs
  • TPC-H SMPs and NUMAs
  • 100 GB All SMPs (4-8 CPUs)
  • 300 GB All NUMAs (IBM/Compaq/HP 32-64 CPUs)
  • TPC-R All are clusters
  • 1000 GB NCR World Mark 5200
  • TPC-W All web servers are clusters (IBM)

74
Clusters and TPC-C Benchmark
  • Top 10 TPC-C Performance (Aug. 2000) Ktpm
  • 1. Netfinity 8500R c/s Cluster 441
  • 2. ProLiant X700-96P Cluster 262
  • 3. ProLiant X550-96P Cluster 230
  • 4. ProLiant X700-64P Cluster 180
  • 5. ProLiant X550-64P Cluster 162
  • 6. AS/400e 840-2420 SMP 152
  • 7. Fujitsu GP7000F Model 2000 SMP 139
  • 8. RISC S/6000 Ent. S80 SMP 139
  • 9. Bull Escala EPC 2400 c/s SMP 136
  • 10. Enterprise 6500 Cluster Cluster 135

75
Cost of Storage System v. Disks
  • Examples show cost of way we build current
    systems (2 networks, many buses, CPU, )
  • Disks Disks Date Cost Main. Disks /CPU
    /IObus
  • NCR WM 10/97 8.3M -- 1312 10.2 5.0
  • Sun 10k 3/98 5.2M -- 668 10.4 7.0
  • Sun 10k 9/99 6.2M 2.1M 1732 27.0 12.0
  • IBM Netinf 7/00 7.8M 1.8M 7040 55.0 9.0
  • gtToo complicated, too heterogenous
  • And Data Bases are often CPU or bus bound!
  • ISTORE disks per CPU 1.0
  • ISTORE disks per I/O bus 1.0

76
Common Question Why Not Vary Number of
Processors and Disks?
  • Argument if can vary numbers of each to match
    application, more cost-effective solution?
  • Alternative Model 1 Dual Nodes E-switches
  • P-node Processor, Memory, 2 Ethernet NICs
  • D-node Disk, 2 Ethernet NICs
  • Response
  • As D-nodes running network protocol, still need
    processor and memory, just smaller how much
    save?
  • Saves processors/disks, costs more NICs/switches
    N ISTORE nodes vs. N/2 P-nodes N D-nodes
  • Isn't ISTORE-2 a good HW prototype for this
    model? Only run the communication protocol on N
    nodes, run the full app and OS on N/2

77
Common Question Why Not Vary Number of
Processors and Disks?
  • Alternative Model 2 N Disks/node
  • Processor, Memory, N disks, 2 Ethernet NICs
  • Response
  • Potential I/O bus bottleneck as disk BW grows
  • 2.5" ATA drives are limited to 2/4 disks per ATA
    bus
  • How does a research project pick N? Whats
    natural?
  • Is there sufficient processing power and memory
    to run the AME monitoring and testing tasks as
    well as the application requirements?
  • Isn't ISTORE-2 a good HW prototype for this
    model? Software can act as simple disk interface
    over network and run a standard disk protocol,
    and then run that on N nodes per apps/OS node.
    Plenty of Network BW available in redundant
    switches

78
SCSI v. IDE /GB
  • Prices from PC Magazine, 1995-2000
Write a Comment
User Comments (0)
About PowerShow.com