Rough Schedule - PowerPoint PPT Presentation

About This Presentation
Title:

Rough Schedule

Description:

ISA manual publicly available. http://www.cs.berkeley.edu. Suite of simulators actively used ... Instruction scheduling for VIRAM-1 (works, but could be improved) ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 19
Provided by: yel3
Category:

less

Transcript and Presenter's Notes

Title: Rough Schedule


1
Rough Schedule
  • 130-215 IRAM overview
  • 215-300 ISTORE overview
  • break
  • 315-330 Financial
  • 400-500 Future

2
IRAM Hardware and Software
  • Kathy Yelick
  • Computer Science Division
  • UC Berkeley

3
Intelligent RAM IRAM
  • Microprocessor DRAM on a single chip
  • 10X capacity vs. DRAM
  • on-chip memory latency 5-10X, bandwidth 50-100X
  • improve energy efficiency 2X-4X (no off-chip
    bus)
  • serial I/O 5-10X v. buses
  • smaller board area/volume
  • IRAM advantages extend to
  • a single chip system
  • a building block for larger systems

4
VIRAM System on a Chip
  • 0.18 um EDL process
  • 16 MB DRAM, 8 banks
  • MIPS Scalar core and
    caches _at_ 200 MHz
  • 4 64-bit vector unit
    pipelines _at_ 200 MHz
  • 17x17 mm, 2 Watts target
  • 25.6 GB/s memory (6.4 GB/s per direction
    and per Xbar)
  • 0.8 Gflops (64-bit), 6.4 GOPs (16-bit)

Memory (64 Mbits / 8 MBytes)
Xbar
Memory (64 Mbits / 8 MBytes)
5
IRAM Chip Update
  • IBM supplying embedded DRAM/Logic (100)
  • Agreement in place and technology files available
  • MIPS supplying scalar core (100)
  • MIPS processor, caches, TLB
  • MIT supplying FPU (100)
  • VIRAM-1 Tape-out scheduled for late-2000
  • Simplifications
  • Floating point
  • Network Interface

6
VIRAM-1 Chip Design Status
  • MIPS scalar core
  • Synthesizable RTL code received from MIPS
  • Cache RAMs to be compiled for IBM technology
  • FPU RTL code almost compete
  • Vector unit
  • RTL models for sub-blocks developed currently
    integrated and tested
  • Control logic to be compiled for IBM technology
  • Full-custom layout for multipliers/adders
    developed layout for shifters to be developed
  • Memory system
  • Synthesizable model for DRAM controllers done
  • To be integrated with IBM DRAM macros
  • Full-custom layout for crossbar under development
  • Testing infrastructure
  • Environment developed for automatic test
    validation
  • Directed tests for single/multiple instruction
    groups developed
  • Random instruction sequence generator developed

7
IRAM Architecture Update
  • ISA mostly frozen since 6/99
  • Changes in 2H 99 for better fixed-point model and
    some instructions for short vectors (auto
    increment and in-register permutations)
  • Minor changes in 1H 00 to address new
    co-processor interface in MIPS core
  • ISA manual publicly available
  • http//www.cs.berkeley.edu
  • Suite of simulators actively used
  • vsim-isa (functional)
  • Major rewrite underway for new scalar processor
  • All UCB code
  • vsim-p (performance), vsim-db (debugger),
    vsim-sync (memory synchronization)

8
IRAM Compiler Status
  • Retarget of Cray Backend
  • Steps in compiler development
  • Build MIPS backend (done)
  • Build VIRAM bacckend for vectorized loops (done)
  • Instruction scheduling for VIRAM-1 (works, but
    could be improved)
  • Insertion of memory barriers (using Cray
    strategy, improving)
  • Optimizations for short loops (reduce overhead)
  • Feedback results to Cray, new version from Cray
    (ongoing)

9
IRAM Compiler Update
  • Study of compiler quality using 100 Dongarra
    loops
  • 70 vectorized
  • Average 10x reduction in dynamic instruction
    count
  • Average vector length of 42
  • 30 did not, usually due to a dependence
  • Some reductions missed
  • Vector version of math libraries (sin, cos, etc.)
    needed
  • Some failed due to bugs in benchmark
  • Identified 2 specific areas for improvements in
    loop overhead
  • Use VL and MVL more carefully
  • Use auto-increment instruction more extensively

10
Compiled Applications Update
  • Applications using compiler
  • Speech processing under development
  • Developed new small-memory algorithm for speech
    processing
  • Uses some existing kernels (FFT and MM)
  • Vector search algorithm is most challenging
  • DIS image understanding application under
    development
  • Compiles, but does not yet vectorize well
  • Singular Value Decomposition
  • Better than 2 VLIW machines (TI C67 and TM 1100)
  • Challenging BLAS-1,2 work well on IRAM because of
    memory BW
  • Kernels
  • SAXPY, MVM, etc.
  • Will include DIS stress-marks

11
(10n x n SVD, rank 10)
(From Herman, Loo, Tang, CS252 project)
12
Hand-Coded Applications Update
  • Image processing kernels (old FPU model)
  • Note BLAS-2 performance

13
Problem General Element Permutation
  • Hardware for a full vector permutation
    instruction (128 16b elements, 256b datapath)
  • Datapath 16 x 16 (x 16b) crossbar scales by
    0(N2)
  • Control 16 16-to-1 multiplexors scales by
    0(NlogN)
  • Time/energy wasted on wide vector register file
    port

14
Simple Vector Permutations
  • Simple steps of butterfly permutations
  • A register provides the butterfly radix
  • Separate instructions for moving elements to
    left/right
  • Sufficient semantics for
  • Fast reductions of vector registers (dot
    products)
  • Fast FFT kernels

15
Hardware for Simple Permutations
  • Hardware for 128 16b elements, 256b datapath
  • Datapath 2 buses, 8 tristate drivers, 4
    multiplexors, 4 shifters (by 0, 16b, 32b only)
    Scales by O(N)
  • Control 6 control cases scales by O(N)
  • Other benefits
  • Consecutive result elements written together
  • Buses used only for small radices

16
FFT Uses In-Register Permutations
Without in-register permutations
17
Summary
  • IRAM takes advantage of high on-chip bandwidth
  • BLAS-2 performance confirms this
  • Vector IRAM ISA utilizes this bandwidth
  • Unit, strided, and indexed memory access patterns
    supported
  • Exploits fine-grained parallelism, even with
    pointer chasing
  • Compiler
  • Well-understood compiler model, semi-automatic
  • Still some work on code generation quality
  • Application benchmarks
  • Compiled and hand-coded
  • Include FFT, SVD, MVM, sparse MVM, and other
    kernels used in image and signal processing

18
IRAM as Building Block for ISTORE
  • System-on-a-chip enables computer, memory,
    redundant network interfaces without
    significantly increasing size of disk
  • Target for 5-7 years
  • building block 2006 MicroDrive integrated with
    IRAM
  • 9GB disk, 50 MB/sec disk (projected)
  • connected via crossbar switch
  • O(10) Gflops
  • 10,000 nodes fit into one rack!
Write a Comment
User Comments (0)
About PowerShow.com