CS 201 Computer Systems Programming Chapter 4 - PowerPoint PPT Presentation


Title: CS 201 Computer Systems Programming Chapter 4


1
CS 201Computer Systems ProgrammingChapter
4Computer Taxonomy
  • Herbert G. Mayer, PSU CS
  • Status 10/15/2013

2
Syllabus
  • Introduction
  • Common Architecture Attributes
  • General Limitations
  • Data-Stream Instruction-Stream
  • Generic Architecture Model
  • Instruction Set Architecture (ISA)
  • Iron Law of Performance
  • Uniprocessor (UP) Architectures
  • Multiprocessor (MP) Architectures
  • Hybrid Architectures
  • References

3
Introduction Uniprocessors
  • Single Accumulator Architectures, earliest in the
    1940s e.g. Atanasoff, Zuse, von Neumann
  • General-Purpose Register Architectures (GPR)
  • 2-Address Architecture, i.e. GPR with one operand
    implied, e.g. IBM 360
  • 3-Address Architecture, i.e. GPR with all
    operands of arithmetic operation explicit, e.g.
    VAX 11/70
  • Stack Machines (e.g. B5000, B6000, HP3000)
  • Pipelined architecture, e.g. CDC 5000, Cyber 6000
  • Vector Architecture, e.g. Amdahl 470/6, competing
    with IBMs 360 in the 1970s
  • blurs line to Multiprocessor

4
Introduction Multiprocessors
  • Shared Memory Architecture e.g. Illiac IV, BSP
  • Distributed Memory Architecture
  • Systolic Architecture see Intel iWarp and CMUs
    warp architecture
  • Data Flow Machine see Jack Dennis work at MIT

5
Introduction Hybrid Architectures
  • Superscalar Architecture see Intel 80860, AKA
    i860
  • VLIW Architecture
  • see Multiflow computer
  • or systolic array architecture, like Warp at CPU
    or iWarp at Intel in the 1990s
  • Pipelined Architecture debatable if it is a
    hybrid architecture ?
  • EPIC Architecture see HP and Intel Itanium
    architecture

6
Common Architecture Attributes
  • Main memory (main store), external from processor
  • Program instructions stored in main memory
  • Also, data stored in memory typical for von
    Neumann architecture
  • Data available in distributed over static
    memory, stack, heap, reserved OS space, free
    space, IO space
  • Instruction pointer (AKA instruction counter,
    program counter pc), other special registers
  • Von Neumann memory bottle-neck everything
    travels on the same, single bus ?

7
Common Architecture Attributes
  • Accumulator (register, 1 or many) holds result of
    arithmetic-logical operation
  • Memory Controller handles memory access requests
    from processor moves bits to/from memory is
    part of chipset
  • Current trend is to move some of the memory
    controller or IO controller onto CPU chip
    caveat that does not mean the chipset IS part of
    the CPU!
  • Logical processor unit includes FP unit, Integer
    unit, control unit, register file, load-store
    unit, pathways
  • Physical processor unit includes heat sensors,
    frequency control, voltage regulator, and more

8
General Limitations
  • Compute-Bound
  • type of application, in which the vast majority
    of execution time is spent fetching and executing
    instructions time to load and store data in/from
    memory is small of overall
  • Memory-Bound
  • application, in which the majority of execution
    time is spent loading and storing data in memory
    time executing instructions is small vs. time
    to access memory
  • IO-Bound
  • application, in which the majority of execution
    time is spent accessing secondary storage time
    executing instructions, even the time accessing
    memory, is small vs. time to access secondary
    storage
  • Backup-Bound
  • Like IO-Bound, but backup storage medium can be
    even slower than typical secondary storage devices

9
Data-Stream Instruction-Stream
  • Classification developed by Michael J. Flynn,
    1966
  • Single-Instruction, Single-Data Stream (SISD)
    Architecture
  • PDP-11
  • Single-Instruction, Multiple-Data Stream (SIMD)
    Architecture
  • Array Processors, Solomon, Illiac IV, BSP, TMC
  • Multiple-Instruction, Single-Data Stream (MISD)
    Architecture
  • Pipelined architecture
  • Multiple-Instruction, Multiple-Data Stream
    Architecture (MIMD)
  • true multiprocessor

10
Generic Architecture Model
11
Instruction Set Architecture (ISA)
  • ISA is boundary between Software and Hardware
  • Specifies logical machine visible to programmer
    compiler
  • Is functional specification for processor
    designers
  • That boundary is sometimes a very low-level piece
    of system SW that handles exceptions, interrupts,
    and HW-specific services that could fall into the
    domain of the OS

12
Instruction Set Architecture (ISA)
  • Specified by ISA are
  • Operations what to perform and in which order
  • Active, temporary operand storage in CPU
  • accumulator, stack, registers
  • note that stack can be word-sized, even bit-sized
    (e.g. extreme design of successor for NCRs
    Century architecture of the 1970s)
  • Number of operands per instruction implicit,
    others explicit
  • Operand location where and how to locate/specify
    the operands Register, literal, data in memory
  • Type and size of operands bit, byte, word,
    double-word, . . .
  • Instruction Encoding in binary
  • Data types int, float, double, decimal, char, bit

13
Instruction Set Architecture (ISA)
14
Iron Law of Performance
  • Clock-rate doesnt count! Bus width doesnt
    count. Number of registers and operations
    executed in parallel doesnt count! ?
  • What counts is how long it takes for my
    computational task to complete. That time is of
    the essence of computing!
  • If a MIPS-based solution runs at 1 GHz that
    completes a program X in 2 minutes, while an
    Intel Pentium 4based program runs at 3 GHz and
    completes that same program x in 2.5 minutes,
    programmers are more interested in the MIPS
    solution
  • If a solution on an Intel CPU can be expressed in
    an object program of size Y bytes, but on an IBM
    architecture of size 1.1 Y bytes, the Intel
    solution is generally more attractive
  • Meaning of this
  • Wall-clock time (Time) is time I have to wait for
    completion
  • Program Size is overall complexity of
    computational task

15
Iron Law of Performance
16
Uniprocessor (UP) Architectures
  • Single Accumulator Architecture (SAA)
  • Single register to hold operation results
  • Conventionally called accumulator
  • Accumulator used as destination of arithmetic
    operations, and as (one) source
  • SAA has central processing unit, memory unit,
    connecting memory bus typical for van Neumann
    architecture
  • The pc points to next instruction in memory to be
    executed
  • Sample ENIAC

17
Uniprocessor (UP) Architectures
  • General-Purpose Register (GPR) Architecture
  • Accumulates ALU results in n registers, typically
    4, 8, 16, 64
  • Allows register-to-register operations, fast!
  • GPR is essentially a multi-register extension of
    SAA
  • Two-address architecture specifies one source
    operand explicitly, another implicitly, plus one
    destination
  • Three-address architecture specifies two source
    operands explicitly, plus an explicit destination
  • Variations allow additional index registers, base
    registers, multiple index registers, etc.

18
Uniprocessor (UP) Architectures
  • Stack Machine Architecture (SMA)
  • AKA zero-address architecture, since arithmetic
    operations require no explicit operand, hence no
    operand addresses all are implied to be on the
    stack, except for push and pop
  • Wake-up call to Students What is equivalent of
    push/pop on GPR?
  • Pure Stack Machine (SMA) has no registers
  • Hence performance is inherently poor, as all
    operations involve memory on a stack machine
  • However, one will design an SMA that implements
    the n top of stack elements as registers, i.e. as
    a Stack Cache n 4, 8, . . .
  • Sample architectures Burroughs B5000, HP 3000
  • Implement impure stack operations that bypass tos
    operand addressing
  • Sample code sequence to compute

19
Uniprocessor (UP) Architectures
  • Stack Machine Architecture (SMA)
  • res a ( 145 b ) -- operand sizes are
    implied!
  • push a -- destination implied stack
  • pushlit 145 -- also destination implied
  • push b -- ditto
  • add -- 2 sources, and destination implied
  • mult -- 2 sources, and destination implied
  • pop res -- source implied stack

20
Uniprocessor (UP) Architectures
  • Pipelined Architecture (PA)
  • Arithmetic Logic Unit, ALU, split into separate,
    sequentially connected units in PA
  • Unit is referred to as a stage more precisely
    the time at which the action is done is the stage
  • Each of these stages/units can be initiated once
    per cycle
  • Yet each subunit is implemented in HW just once
  • Multiple subunits operate in parallel on
    different sub-ops, each executing a different
    stage each stage is part of one instruction
    execution, many stages running in parallel ?
  • Non-unit time, differing of cycles per
    operation cause different terminations ?
  • Operations abort in intermediate stage, if some
    later instruction changes the flow of control
    e.g. due to a branch, exception, return,
    conditional branch, call

21
Uniprocessor (UP) Architectures
  • Pipelined Architecture (PA)

22
Uniprocessor (UP) Architectures
  • Pipelined Architecture (PA)
  • Operation must stall in case of data or control
    dependence stall, AKA interlock
  • Ideally each instruction can be partitioned into
    the same number of stages, i.e. sub-operations
  • Operations to be pipelined can sometimes be
    evenly partitioned into equal-length
    sub-operations
  • That equal-length time quantum might as well be a
    single sub-clock
  • In practice it is hard/impossible for architect
    to achieve compare for example integer add and
    floating point divide!

23
Uniprocessor (UP) Architectures
  • Pipelined Architecture (PA)
  • Ideally all operations have independent operands
  • i.e. one operand being computed is not needed as
    source of the next few operations
  • if they were needed and often they arethen this
    would cause dependence, which causes stall
  • read after write (RAW)write after read (WAR)
  • write after write with use in between (WAW)
  • Also, ideally, all instructions just happen to be
    arranged sequentially one after another
  • In reality, there are branches, calls, returns
    etc.

24
Uniprocessor (UP) Architectures
  • Simplified Pipelined Resource Diagram
  • if fetch an instruction
  • de decode the instruction
  • op1 fetch or generate the first operand if any
  • op2 fetch or generate the second operand if
    any
  • exec execute that stage of the overall
    operation
  • wb write result back to destination, if any
  • e.g. noop has no destination halt has no
    destination

25
Uniprocessor (UP) Architectures
  • Superscalar Architecture shown at Hybrid
  • Identical to regular uniprocessor architecture
  • But some arithmetic or logical units are
    replicated
  • E.g. may have multiple floating point (FP)
    multipliers
  • Or FP multiplier and FP adder may work at the
    same time
  • The key is On a superscalar architecture
    sometimes more instructions than one can execute
    at one moment!
  • Provided that there is no data dependence!
  • First superscalar machines included CDC 6600,
    Intel i960CA, and AMD 29000 series
  • Object code can look identical to code for strict
    uni-processor, yet the HW fetches more than just
    the next instruction, and performs data
    dependence analysis

26
Uniprocessor (UP) Architectures
  • Vector Architecture (VA)
  • Register implemented as HW array of identical
    registers, named vri
  • VA may also have scalar registers, named r0, r1,
    etc.
  • Scalar register can also be the first of the
    vector registers
  • Vector registers can load/store block of
    contiguous data
  • Still in sequence, but overlapped number of
    steps to complete load/store of a vector also
    depends bus width
  • Vector machine can perform multiple operations of
    the same kind on whole contiguous blocks of
    operands
  • Still in sequence, but overlapped, and all
    operands are readily available
  • Otherwise operates like GPR architecture, but on
    vector operands if vector size is 1, then VA
    identical to UP

27
Uniprocessor (UP) Architectures
Vector Architecture (VA)
28
Uniprocessor (UP) Architectures
Sample Vector Architecture operation ldv vr1,
memi -- loads 64 memory locs from
memi0..63 stv vr2, memj -- stores vr2 in
64 contiguous locs vadd vr1, vr2, vr3 --
register-register vector add   cvaddf r0, vr1,
vr2, vr3 -- has conditional meaning --
sequential equivalent for i 0 to 63 do if bit
i in r0 is 1 then vr1i vr2i vr3i //
e.g. cvadd r0, r1, r2, r3 else -- do not move
corresponding bits end if end for   -- parallel
syntax equivalent forall i 0 to 63
doparallel if bit i in r0 is 1 then vr1i
vr2i vr3i end if end parallel for
29
Multiprocessor (MP) Architectures
  • Shared Memory Architecture (SMA)
  • Equal access to memory for all n processors, p0
    to pn-1
  • Only one will succeed in accessing shared memory,
    when there are multiple, quasi-simultaneous
    accesses
  • Simultaneous memory access must be deterministic
    needs an arbiter to ensure determinism
  • Von Neumann bottleneck tighter than conventional
    UP system
  • Generally there are twice as many loads as there
    are stores in typical object code
  • Occasionally, some processors are idle due to
    memory conflict
  • Typical number of processors n4, but n8 and
    greater possible, with large 2nd level cache,
    even larger 3rd level
  • Only limited commercial success and acceptance,
    programming burden frequently on programmer
  • Morphing in the 2000s into multi-core and
    hyper-threaded architectures, where programming
    burden is on multi-threading OS or the programmer

30
Multiprocessor (MP) Architectures
  • Shared Memory Architecture (SMA)

31
Multiprocessor (MP) Architectures
  • Distributed Memory Architecture (DMA)
  • Processors have private memories, AKA local
    memories
  • Yet programmer has to see single, logical memory
    space, regardless of local distribution
  • Hence each processor pi always has access to its
    own memory Memi
  • Collection of all memories Memi i 0..n-1 is
    logical data space
  • Thus, processors must access others memories
  • Done via Message Passing or Virtual Shared Memory
  • Messages must be routed, route be determined
  • Route may be long, i.e. require multiple,
    intermediate nodes
  • Blocking when message expected but hasnt
    arrived yet
  • Blocking when when destination cannot receive
  • Growing message buffer size increases illusion of
    asynchronicity of sending and receiving
    operations
  • Key parameter time for 1 hop and package
    overhead to send empty message
  • Message may also be delayed because of network
    congestion

32
Multiprocessor (MP) Architectures
  • Distributed Memory Architecture (DMA)

33
Multiprocessor (MP) Architectures
  • Systolic Array Architecture (SAA)
  • Very few designed CMU and Intel for (then) ARPA
  • Each processor has private memory
  • Network is pre-defined by the Systolic Pathway
    (SP)
  • Each node is pre-connected via SP to some subset
    of other processors
  • Node connectivity determined by network topology
  • Systolic pathway is high-performance network
    sending and receiving may be synchronized
    (blocking) or asynchronous (data received are
    buffered)
  • Typical network topologies line, ring, torus,
    hex grid, mesh, etc.
  • Sample below is a ring wrap-around along x and y
    dimensions not shown
  • Processor can write to x or y gate sends word
    off on x or y SP
  • Processor can read from x or y gate consumes
    word from x or y SP
  • Buffered SA can write to gate, even if receiver
    cannot read
  • Reading from gate when no message available
    blocks
  • Automatic code generation for non-buffered SA
    hard, compiler must keep track of interprocessor
    synchronization
  • Can view SP as an extension of memory with
    infinite capacity, but with sequential access

34
Multiprocessor (MP) Architectures
  • Systolic Array Architecture (SAA)

35
Multiprocessor (MP) Architectures
  • Systolic Array Architecture (SAA)
  • Note that each pathway, x or y, may be
    bi-directional
  • May have any number of pathways, nothing magic
    about 2, x and y could be 3 or more
  • Possible to have I/O capability with each node
  • Typical application large polynomials of the
    form
  • y k0 k1x1 k2x2 .. kn-1xn-1 S kixi
  • Next example shows a torus without displaying the
    wrap-around pathways across both dimensions

36
Multiprocessor (MP) Architectures
  • Systolic Array Architecture (SAA)

37
Hybrid Architectures
  • Superscalar Architecture (SA)
  • Replicates (duplicates) some operations in HW
  • Seems like scalar architecture w.r.t. object
    code, can compute some operations of UP in
    parallel, e.g. fadd and fmult
  • Is almost a parallel architecture, if it has
    multiple copies of some hardware units, say two
    fadd units
  • Is not an MP architecture ALU is not replicated
  • Has multiple parts of an ALU, possibly multiple
    FPA units, or FPM units, and/or integer units
  • Arithmetic operations simultaneous with load and
    store operations note data dependence!
  • Instruction fetch speculative, since number of
    parallel operations unknown rule fetch too
    much! But fetch no more than longest possible
    superscalar pattern

38
Hybrid Architectures
  • Superscalar Architecture (SA)
  • Code sequence looks like sequence of instructions
    for scalar processor
  • Example 80486 code executed on Pentium
    processors
  • More famous and successful example 80860
    processor
  • Object code can be custom-tailored by compiler
    i.e. compiler can have superscalar target
    processor in mind, bias code emission, knowing
    that some code sequences are better suited for
    superscalar execution
  • Fetch enough instruction bytes to support longest
    possible object sequence
  • Decoding is bottle-neck for CISC, way easier for
    RISC ? 32-bit units
  • Sample of superscalar i80860 could run in
    parallel one FPA, one FPM, two integer ops, and a
    load or store in or --

39
Hybrid Architectures
  • Superscalar Architecture (SA)

40
Hybrid Architectures
  • Very Long Instruction Word Architecture (VLIW)
  • Very Long Instruction Word, typically 128 bits or
    more
  • VLIW machine also has scalar operations
  • VLIW code is no longer scalar, but explicitly
    parallel
  • Limitations like in superscalar VLIW is not a
    general MP architecture
  • subinstructions do not have concurrent memory
    access
  • dependences must be resolved before code emission
  • But the VLIW opcode is designed to execute in
    parallel
  • VLIW suboperations can be defined as no-op, thus
    just the other suboperations run in parallel
  • Compiler/programmer explicitly packs
    parallelizable operations into VLIW instruction
  • Just like horizontal microcode compaction

41
Hybrid Architectures
  • VLIW
  • Sample Compute instruction of CMU warp and
    Intel iWarp
  • Could be 1-bit (or few-bit) opcode for compute
    instruction plus sub-opcodes for subinstructions
  • Data dependence example Result of FPA cannot be
    used as operand for FPM in the same VLIW
    instruction
  • But provided proper SW pipelining (not covered in
    CS 201) both subinstructions may refer to the
    same FP register
  • Result of int1 cannot be used as operand for
    int2, etc.
  • With SW pipelining both subinstructions may refer
    to same int register
  • Thus, need to software-pipeline

42
Hybrid Architectures
  • Itanium EPIC Architecture
  • Explicitly Parallel Instruction Computing
  • Group instructions into bundles
  • Straighten out the branches by associating
    predicate with instructions avoids branch and
    executes speculatively
  • Execute instructions in parallel, say the else
    clause and the then clause of an If Statement
  • Decide at run time which of the predicates is
    true, and (post) complete just that path from
    multiple choices discard others
  • Use speculation to straighten branch tree
  • Use rotating register file
  • Has many registers, not just 64 GPRs

43
Hybrid Architectures
  • Itanium
  • Groups and bundles lump multiple compute steps
    into one that can be run in parallel
  • Parallel comparisons allow fast decisions
  • Predication associates a condition (the
    predicate) with 2 simultaneously executed
    instruction sequences, only 1 of which will be
    posted
  • Speculation fetches operands, not knowing for
    sure, whether this results in use branch may
    invalidate early fetch
  • Branch elimination, straightens out code with
    jumps
  • Branch prediction
  • Large register file

44
Hybrid Architectures
  • Itanium
  • Numerous branch registers speeds up execution by
    having some branch destinations in register fast
    to load into ip reg
  • Multiple CFM registers, Current Frame Marker
    regs avoid slowness due to memory access
  • See separate lecture note

45
References
  • http//cs.illinois.edu/csillinois/history
  • http//www.arl.wustl.edu/pcrowley/cse526/bsp2.pdf
  • http//dl.acm.org/citation.cfm?id102450
  • http//csg.csail.mit.edu/Dataflow/talks/DennisTalk
    .pdf
  • http//en.wikipedia.org/wiki/Flynn's_taxonomy
  • http//www.ajwm.net/amayer/papers/B5000.html
  • http//www.robelle.com/smugbook/classic.html
  • http//en.wikipedia.org/wiki/ILLIAC_IV
  • http//www.intel.com/design/itanium/manuals.htm
  • http//www.csupomona.edu/hnriley/www/VonN.html
  • http//cva.stanford.edu/classes/ee482s/scribed/lec
    t11.pdf
  • VLIW Architecture http//www.nxp.com/acrobat_down
    load2/other/vliw-wp.pdf
  • ACM reference to Multiflow computer architecture
    http//dl.acm.org/citation.cfm?id110622collport
    aldlACM
View by Category
About This Presentation
Title:

CS 201 Computer Systems Programming Chapter 4

Description:

Title: Ethics Last modified by: herbert mayer Document presentation format: On-screen Show (4:3) Other titles: Times New Roman MS PGothic Helvetica Wingdings MS P ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 46
Provided by: webCecsP4
Learn more at: http://web.cecs.pdx.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CS 201 Computer Systems Programming Chapter 4


1
CS 201Computer Systems ProgrammingChapter
4Computer Taxonomy
  • Herbert G. Mayer, PSU CS
  • Status 10/15/2013

2
Syllabus
  • Introduction
  • Common Architecture Attributes
  • General Limitations
  • Data-Stream Instruction-Stream
  • Generic Architecture Model
  • Instruction Set Architecture (ISA)
  • Iron Law of Performance
  • Uniprocessor (UP) Architectures
  • Multiprocessor (MP) Architectures
  • Hybrid Architectures
  • References

3
Introduction Uniprocessors
  • Single Accumulator Architectures, earliest in the
    1940s e.g. Atanasoff, Zuse, von Neumann
  • General-Purpose Register Architectures (GPR)
  • 2-Address Architecture, i.e. GPR with one operand
    implied, e.g. IBM 360
  • 3-Address Architecture, i.e. GPR with all
    operands of arithmetic operation explicit, e.g.
    VAX 11/70
  • Stack Machines (e.g. B5000, B6000, HP3000)
  • Pipelined architecture, e.g. CDC 5000, Cyber 6000
  • Vector Architecture, e.g. Amdahl 470/6, competing
    with IBMs 360 in the 1970s
  • blurs line to Multiprocessor

4
Introduction Multiprocessors
  • Shared Memory Architecture e.g. Illiac IV, BSP
  • Distributed Memory Architecture
  • Systolic Architecture see Intel iWarp and CMUs
    warp architecture
  • Data Flow Machine see Jack Dennis work at MIT

5
Introduction Hybrid Architectures
  • Superscalar Architecture see Intel 80860, AKA
    i860
  • VLIW Architecture
  • see Multiflow computer
  • or systolic array architecture, like Warp at CPU
    or iWarp at Intel in the 1990s
  • Pipelined Architecture debatable if it is a
    hybrid architecture ?
  • EPIC Architecture see HP and Intel Itanium
    architecture

6
Common Architecture Attributes
  • Main memory (main store), external from processor
  • Program instructions stored in main memory
  • Also, data stored in memory typical for von
    Neumann architecture
  • Data available in distributed over static
    memory, stack, heap, reserved OS space, free
    space, IO space
  • Instruction pointer (AKA instruction counter,
    program counter pc), other special registers
  • Von Neumann memory bottle-neck everything
    travels on the same, single bus ?

7
Common Architecture Attributes
  • Accumulator (register, 1 or many) holds result of
    arithmetic-logical operation
  • Memory Controller handles memory access requests
    from processor moves bits to/from memory is
    part of chipset
  • Current trend is to move some of the memory
    controller or IO controller onto CPU chip
    caveat that does not mean the chipset IS part of
    the CPU!
  • Logical processor unit includes FP unit, Integer
    unit, control unit, register file, load-store
    unit, pathways
  • Physical processor unit includes heat sensors,
    frequency control, voltage regulator, and more

8
General Limitations
  • Compute-Bound
  • type of application, in which the vast majority
    of execution time is spent fetching and executing
    instructions time to load and store data in/from
    memory is small of overall
  • Memory-Bound
  • application, in which the majority of execution
    time is spent loading and storing data in memory
    time executing instructions is small vs. time
    to access memory
  • IO-Bound
  • application, in which the majority of execution
    time is spent accessing secondary storage time
    executing instructions, even the time accessing
    memory, is small vs. time to access secondary
    storage
  • Backup-Bound
  • Like IO-Bound, but backup storage medium can be
    even slower than typical secondary storage devices

9
Data-Stream Instruction-Stream
  • Classification developed by Michael J. Flynn,
    1966
  • Single-Instruction, Single-Data Stream (SISD)
    Architecture
  • PDP-11
  • Single-Instruction, Multiple-Data Stream (SIMD)
    Architecture
  • Array Processors, Solomon, Illiac IV, BSP, TMC
  • Multiple-Instruction, Single-Data Stream (MISD)
    Architecture
  • Pipelined architecture
  • Multiple-Instruction, Multiple-Data Stream
    Architecture (MIMD)
  • true multiprocessor

10
Generic Architecture Model
11
Instruction Set Architecture (ISA)
  • ISA is boundary between Software and Hardware
  • Specifies logical machine visible to programmer
    compiler
  • Is functional specification for processor
    designers
  • That boundary is sometimes a very low-level piece
    of system SW that handles exceptions, interrupts,
    and HW-specific services that could fall into the
    domain of the OS

12
Instruction Set Architecture (ISA)
  • Specified by ISA are
  • Operations what to perform and in which order
  • Active, temporary operand storage in CPU
  • accumulator, stack, registers
  • note that stack can be word-sized, even bit-sized
    (e.g. extreme design of successor for NCRs
    Century architecture of the 1970s)
  • Number of operands per instruction implicit,
    others explicit
  • Operand location where and how to locate/specify
    the operands Register, literal, data in memory
  • Type and size of operands bit, byte, word,
    double-word, . . .
  • Instruction Encoding in binary
  • Data types int, float, double, decimal, char, bit

13
Instruction Set Architecture (ISA)
14
Iron Law of Performance
  • Clock-rate doesnt count! Bus width doesnt
    count. Number of registers and operations
    executed in parallel doesnt count! ?
  • What counts is how long it takes for my
    computational task to complete. That time is of
    the essence of computing!
  • If a MIPS-based solution runs at 1 GHz that
    completes a program X in 2 minutes, while an
    Intel Pentium 4based program runs at 3 GHz and
    completes that same program x in 2.5 minutes,
    programmers are more interested in the MIPS
    solution
  • If a solution on an Intel CPU can be expressed in
    an object program of size Y bytes, but on an IBM
    architecture of size 1.1 Y bytes, the Intel
    solution is generally more attractive
  • Meaning of this
  • Wall-clock time (Time) is time I have to wait for
    completion
  • Program Size is overall complexity of
    computational task

15
Iron Law of Performance
16
Uniprocessor (UP) Architectures
  • Single Accumulator Architecture (SAA)
  • Single register to hold operation results
  • Conventionally called accumulator
  • Accumulator used as destination of arithmetic
    operations, and as (one) source
  • SAA has central processing unit, memory unit,
    connecting memory bus typical for van Neumann
    architecture
  • The pc points to next instruction in memory to be
    executed
  • Sample ENIAC

17
Uniprocessor (UP) Architectures
  • General-Purpose Register (GPR) Architecture
  • Accumulates ALU results in n registers, typically
    4, 8, 16, 64
  • Allows register-to-register operations, fast!
  • GPR is essentially a multi-register extension of
    SAA
  • Two-address architecture specifies one source
    operand explicitly, another implicitly, plus one
    destination
  • Three-address architecture specifies two source
    operands explicitly, plus an explicit destination
  • Variations allow additional index registers, base
    registers, multiple index registers, etc.

18
Uniprocessor (UP) Architectures
  • Stack Machine Architecture (SMA)
  • AKA zero-address architecture, since arithmetic
    operations require no explicit operand, hence no
    operand addresses all are implied to be on the
    stack, except for push and pop
  • Wake-up call to Students What is equivalent of
    push/pop on GPR?
  • Pure Stack Machine (SMA) has no registers
  • Hence performance is inherently poor, as all
    operations involve memory on a stack machine
  • However, one will design an SMA that implements
    the n top of stack elements as registers, i.e. as
    a Stack Cache n 4, 8, . . .
  • Sample architectures Burroughs B5000, HP 3000
  • Implement impure stack operations that bypass tos
    operand addressing
  • Sample code sequence to compute

19
Uniprocessor (UP) Architectures
  • Stack Machine Architecture (SMA)
  • res a ( 145 b ) -- operand sizes are
    implied!
  • push a -- destination implied stack
  • pushlit 145 -- also destination implied
  • push b -- ditto
  • add -- 2 sources, and destination implied
  • mult -- 2 sources, and destination implied
  • pop res -- source implied stack

20
Uniprocessor (UP) Architectures
  • Pipelined Architecture (PA)
  • Arithmetic Logic Unit, ALU, split into separate,
    sequentially connected units in PA
  • Unit is referred to as a stage more precisely
    the time at which the action is done is the stage
  • Each of these stages/units can be initiated once
    per cycle
  • Yet each subunit is implemented in HW just once
  • Multiple subunits operate in parallel on
    different sub-ops, each executing a different
    stage each stage is part of one instruction
    execution, many stages running in parallel ?
  • Non-unit time, differing of cycles per
    operation cause different terminations ?
  • Operations abort in intermediate stage, if some
    later instruction changes the flow of control
    e.g. due to a branch, exception, return,
    conditional branch, call

21
Uniprocessor (UP) Architectures
  • Pipelined Architecture (PA)

22
Uniprocessor (UP) Architectures
  • Pipelined Architecture (PA)
  • Operation must stall in case of data or control
    dependence stall, AKA interlock
  • Ideally each instruction can be partitioned into
    the same number of stages, i.e. sub-operations
  • Operations to be pipelined can sometimes be
    evenly partitioned into equal-length
    sub-operations
  • That equal-length time quantum might as well be a
    single sub-clock
  • In practice it is hard/impossible for architect
    to achieve compare for example integer add and
    floating point divide!

23
Uniprocessor (UP) Architectures
  • Pipelined Architecture (PA)
  • Ideally all operations have independent operands
  • i.e. one operand being computed is not needed as
    source of the next few operations
  • if they were needed and often they arethen this
    would cause dependence, which causes stall
  • read after write (RAW)write after read (WAR)
  • write after write with use in between (WAW)
  • Also, ideally, all instructions just happen to be
    arranged sequentially one after another
  • In reality, there are branches, calls, returns
    etc.

24
Uniprocessor (UP) Architectures
  • Simplified Pipelined Resource Diagram
  • if fetch an instruction
  • de decode the instruction
  • op1 fetch or generate the first operand if any
  • op2 fetch or generate the second operand if
    any
  • exec execute that stage of the overall
    operation
  • wb write result back to destination, if any
  • e.g. noop has no destination halt has no
    destination

25
Uniprocessor (UP) Architectures
  • Superscalar Architecture shown at Hybrid
  • Identical to regular uniprocessor architecture
  • But some arithmetic or logical units are
    replicated
  • E.g. may have multiple floating point (FP)
    multipliers
  • Or FP multiplier and FP adder may work at the
    same time
  • The key is On a superscalar architecture
    sometimes more instructions than one can execute
    at one moment!
  • Provided that there is no data dependence!
  • First superscalar machines included CDC 6600,
    Intel i960CA, and AMD 29000 series
  • Object code can look identical to code for strict
    uni-processor, yet the HW fetches more than just
    the next instruction, and performs data
    dependence analysis

26
Uniprocessor (UP) Architectures
  • Vector Architecture (VA)
  • Register implemented as HW array of identical
    registers, named vri
  • VA may also have scalar registers, named r0, r1,
    etc.
  • Scalar register can also be the first of the
    vector registers
  • Vector registers can load/store block of
    contiguous data
  • Still in sequence, but overlapped number of
    steps to complete load/store of a vector also
    depends bus width
  • Vector machine can perform multiple operations of
    the same kind on whole contiguous blocks of
    operands
  • Still in sequence, but overlapped, and all
    operands are readily available
  • Otherwise operates like GPR architecture, but on
    vector operands if vector size is 1, then VA
    identical to UP

27
Uniprocessor (UP) Architectures
Vector Architecture (VA)
28
Uniprocessor (UP) Architectures
Sample Vector Architecture operation ldv vr1,
memi -- loads 64 memory locs from
memi0..63 stv vr2, memj -- stores vr2 in
64 contiguous locs vadd vr1, vr2, vr3 --
register-register vector add   cvaddf r0, vr1,
vr2, vr3 -- has conditional meaning --
sequential equivalent for i 0 to 63 do if bit
i in r0 is 1 then vr1i vr2i vr3i //
e.g. cvadd r0, r1, r2, r3 else -- do not move
corresponding bits end if end for   -- parallel
syntax equivalent forall i 0 to 63
doparallel if bit i in r0 is 1 then vr1i
vr2i vr3i end if end parallel for
29
Multiprocessor (MP) Architectures
  • Shared Memory Architecture (SMA)
  • Equal access to memory for all n processors, p0
    to pn-1
  • Only one will succeed in accessing shared memory,
    when there are multiple, quasi-simultaneous
    accesses
  • Simultaneous memory access must be deterministic
    needs an arbiter to ensure determinism
  • Von Neumann bottleneck tighter than conventional
    UP system
  • Generally there are twice as many loads as there
    are stores in typical object code
  • Occasionally, some processors are idle due to
    memory conflict
  • Typical number of processors n4, but n8 and
    greater possible, with large 2nd level cache,
    even larger 3rd level
  • Only limited commercial success and acceptance,
    programming burden frequently on programmer
  • Morphing in the 2000s into multi-core and
    hyper-threaded architectures, where programming
    burden is on multi-threading OS or the programmer

30
Multiprocessor (MP) Architectures
  • Shared Memory Architecture (SMA)

31
Multiprocessor (MP) Architectures
  • Distributed Memory Architecture (DMA)
  • Processors have private memories, AKA local
    memories
  • Yet programmer has to see single, logical memory
    space, regardless of local distribution
  • Hence each processor pi always has access to its
    own memory Memi
  • Collection of all memories Memi i 0..n-1 is
    logical data space
  • Thus, processors must access others memories
  • Done via Message Passing or Virtual Shared Memory
  • Messages must be routed, route be determined
  • Route may be long, i.e. require multiple,
    intermediate nodes
  • Blocking when message expected but hasnt
    arrived yet
  • Blocking when when destination cannot receive
  • Growing message buffer size increases illusion of
    asynchronicity of sending and receiving
    operations
  • Key parameter time for 1 hop and package
    overhead to send empty message
  • Message may also be delayed because of network
    congestion

32
Multiprocessor (MP) Architectures
  • Distributed Memory Architecture (DMA)

33
Multiprocessor (MP) Architectures
  • Systolic Array Architecture (SAA)
  • Very few designed CMU and Intel for (then) ARPA
  • Each processor has private memory
  • Network is pre-defined by the Systolic Pathway
    (SP)
  • Each node is pre-connected via SP to some subset
    of other processors
  • Node connectivity determined by network topology
  • Systolic pathway is high-performance network
    sending and receiving may be synchronized
    (blocking) or asynchronous (data received are
    buffered)
  • Typical network topologies line, ring, torus,
    hex grid, mesh, etc.
  • Sample below is a ring wrap-around along x and y
    dimensions not shown
  • Processor can write to x or y gate sends word
    off on x or y SP
  • Processor can read from x or y gate consumes
    word from x or y SP
  • Buffered SA can write to gate, even if receiver
    cannot read
  • Reading from gate when no message available
    blocks
  • Automatic code generation for non-buffered SA
    hard, compiler must keep track of interprocessor
    synchronization
  • Can view SP as an extension of memory with
    infinite capacity, but with sequential access

34
Multiprocessor (MP) Architectures
  • Systolic Array Architecture (SAA)

35
Multiprocessor (MP) Architectures
  • Systolic Array Architecture (SAA)
  • Note that each pathway, x or y, may be
    bi-directional
  • May have any number of pathways, nothing magic
    about 2, x and y could be 3 or more
  • Possible to have I/O capability with each node
  • Typical application large polynomials of the
    form
  • y k0 k1x1 k2x2 .. kn-1xn-1 S kixi
  • Next example shows a torus without displaying the
    wrap-around pathways across both dimensions

36
Multiprocessor (MP) Architectures
  • Systolic Array Architecture (SAA)

37
Hybrid Architectures
  • Superscalar Architecture (SA)
  • Replicates (duplicates) some operations in HW
  • Seems like scalar architecture w.r.t. object
    code, can compute some operations of UP in
    parallel, e.g. fadd and fmult
  • Is almost a parallel architecture, if it has
    multiple copies of some hardware units, say two
    fadd units
  • Is not an MP architecture ALU is not replicated
  • Has multiple parts of an ALU, possibly multiple
    FPA units, or FPM units, and/or integer units
  • Arithmetic operations simultaneous with load and
    store operations note data dependence!
  • Instruction fetch speculative, since number of
    parallel operations unknown rule fetch too
    much! But fetch no more than longest possible
    superscalar pattern

38
Hybrid Architectures
  • Superscalar Architecture (SA)
  • Code sequence looks like sequence of instructions
    for scalar processor
  • Example 80486 code executed on Pentium
    processors
  • More famous and successful example 80860
    processor
  • Object code can be custom-tailored by compiler
    i.e. compiler can have superscalar target
    processor in mind, bias code emission, knowing
    that some code sequences are better suited for
    superscalar execution
  • Fetch enough instruction bytes to support longest
    possible object sequence
  • Decoding is bottle-neck for CISC, way easier for
    RISC ? 32-bit units
  • Sample of superscalar i80860 could run in
    parallel one FPA, one FPM, two integer ops, and a
    load or store in or --

39
Hybrid Architectures
  • Superscalar Architecture (SA)

40
Hybrid Architectures
  • Very Long Instruction Word Architecture (VLIW)
  • Very Long Instruction Word, typically 128 bits or
    more
  • VLIW machine also has scalar operations
  • VLIW code is no longer scalar, but explicitly
    parallel
  • Limitations like in superscalar VLIW is not a
    general MP architecture
  • subinstructions do not have concurrent memory
    access
  • dependences must be resolved before code emission
  • But the VLIW opcode is designed to execute in
    parallel
  • VLIW suboperations can be defined as no-op, thus
    just the other suboperations run in parallel
  • Compiler/programmer explicitly packs
    parallelizable operations into VLIW instruction
  • Just like horizontal microcode compaction

41
Hybrid Architectures
  • VLIW
  • Sample Compute instruction of CMU warp and
    Intel iWarp
  • Could be 1-bit (or few-bit) opcode for compute
    instruction plus sub-opcodes for subinstructions
  • Data dependence example Result of FPA cannot be
    used as operand for FPM in the same VLIW
    instruction
  • But provided proper SW pipelining (not covered in
    CS 201) both subinstructions may refer to the
    same FP register
  • Result of int1 cannot be used as operand for
    int2, etc.
  • With SW pipelining both subinstructions may refer
    to same int register
  • Thus, need to software-pipeline

42
Hybrid Architectures
  • Itanium EPIC Architecture
  • Explicitly Parallel Instruction Computing
  • Group instructions into bundles
  • Straighten out the branches by associating
    predicate with instructions avoids branch and
    executes speculatively
  • Execute instructions in parallel, say the else
    clause and the then clause of an If Statement
  • Decide at run time which of the predicates is
    true, and (post) complete just that path from
    multiple choices discard others
  • Use speculation to straighten branch tree
  • Use rotating register file
  • Has many registers, not just 64 GPRs

43
Hybrid Architectures
  • Itanium
  • Groups and bundles lump multiple compute steps
    into one that can be run in parallel
  • Parallel comparisons allow fast decisions
  • Predication associates a condition (the
    predicate) with 2 simultaneously executed
    instruction sequences, only 1 of which will be
    posted
  • Speculation fetches operands, not knowing for
    sure, whether this results in use branch may
    invalidate early fetch
  • Branch elimination, straightens out code with
    jumps
  • Branch prediction
  • Large register file

44
Hybrid Architectures
  • Itanium
  • Numerous branch registers speeds up execution by
    having some branch destinations in register fast
    to load into ip reg
  • Multiple CFM registers, Current Frame Marker
    regs avoid slowness due to memory access
  • See separate lecture note

45
References
  • http//cs.illinois.edu/csillinois/history
  • http//www.arl.wustl.edu/pcrowley/cse526/bsp2.pdf
  • http//dl.acm.org/citation.cfm?id102450
  • http//csg.csail.mit.edu/Dataflow/talks/DennisTalk
    .pdf
  • http//en.wikipedia.org/wiki/Flynn's_taxonomy
  • http//www.ajwm.net/amayer/papers/B5000.html
  • http//www.robelle.com/smugbook/classic.html
  • http//en.wikipedia.org/wiki/ILLIAC_IV
  • http//www.intel.com/design/itanium/manuals.htm
  • http//www.csupomona.edu/hnriley/www/VonN.html
  • http//cva.stanford.edu/classes/ee482s/scribed/lec
    t11.pdf
  • VLIW Architecture http//www.nxp.com/acrobat_down
    load2/other/vliw-wp.pdf
  • ACM reference to Multiflow computer architecture
    http//dl.acm.org/citation.cfm?id110622collport
    aldlACM
About PowerShow.com