Lecture 11 Multithreaded Architectures - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Lecture 11 Multithreaded Architectures

Description:

Chip Multithreading (CMT) Sun UltraSPARC T1 Processor. http://www.sun.com/servers/wp.jsp?tab=3&group=CoolThreads servers. 8 Cores vs 2 Cores ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 29
Provided by: hun58
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Lecture 11 Multithreaded Architectures


1
Lecture 11Multithreaded Architectures
  • Graduate Computer Architecture
  • Fall 2005
  • Shih-Hao Hung
  • Dept. of Computer Science and Information
    Engineering
  • National Taiwan University

2
Concept
  • Data Access Latency
  • Cache misses (L1, L2)
  • Memory latency (remote, local)
  • Often unpredictable
  • Multithreading (MT)
  • Tolerate or mask long and often unpredictable
    latency operations by switching to another
    context, which is able to do useful work.

3
Why Multithreading Today?
  • ILP is exhausted, TLP is in.
  • Large performance gap bet. MEM and PROC.
  • Too many transistors on chip
  • More existing MT applications Today.
  • Multiprocessors on a single chip.
  • Long network latency, too.

4
Classical Problem, 60 70
  • I/O latency prompted multitasking
  • IBM mainframes
  • Multitasking
  • I/O processors
  • Caches within disk controllers

5
Requirements of Multithreading
  • Storage need to hold multiple contexts PC,
    registers, status word, etc.
  • Coordination to match an event with a saved
    context
  • A way to switch contexts
  • Long latency operations must use resources not in
    use

6
Processor Utilization vs. Latency
R the run length to a long latency event L
the amount of latency
7
Problem of 80
  • Problem was revisited due to the advent of
    graphics workstations
  • Xerox Alto, TI Explorer
  • Concurrent processes are interleaved to allow for
    the workstations to be more responsive.
  • These processes could drive or monitor display,
    input, file system, network, user processing
  • Process switch was slow so the subsystems were
    microprogrammed to support multiple contexts

8
Scalable Multiprocessor (90)
  • Dance hall a shared interconnect with memory on
    one side and processors on the other.
  • Or processors may have local memory

9
How do the processors communicate?
  • Shared Memory
  • Potential long latency on every load
  • Cache coherency becomes an issue
  • Examples include NYUs Ultracomputer, IBMs RP3,
    BBNs Butterfly, MITs Alewife, and later
    Stanfords Dash.
  • Synchronization occurs through share variables,
    locks, flags, and semaphores.
  • Message Passing
  • Programmer deals with latency. This enables them
    to minimize the number of messages, while
    maximizing the size, and this scheme allows for
    delay minimization by sending a message so that
    it reaches the receiver at the time it expects
    it.
  • Examples include Intels PSC and Paragon,
    Caltechs Cosmic Cube, and Thinking Machines
    CM-5
  • Synchronization occurs through send and receive

10
Cycle-by-Cycle Interleaved Multithreading
  • Denelcor HEP1 (1982), HEP2
  • Horizon, which was never built
  • Tera, MTA

11
Cycle-by-Cycle Interleaved Multithreading
  • Features
  • An instruction from a different context is
    launched at each clock cycle
  • No interlocks or bypasses thanks to a
    non-blocking pipeline
  • Optimizations
  • Leaving context state in proc (PC, register ,
    status)
  • Assigning tags to remote request and then
    matching it on completion

12
Challenges with this approach
  • I-Cache
  • Instruction bandwidth
  • I-Cache misses Since instructions are being
    grabbed from many different contexts, instruction
    locality is degraded and the I-cache miss rate
    rises.
  • Register file access time
  • Register file access time increases due to the
    fact that the regfile had to significantly
    increase in size to accommodate many separate
    contexts.
  • In fact, the HEP and Tera use SRAM to implement
    the regfile, which means longer access times.
  • Single thread performance
  • Single thread performance significantly degraded
    since the context is forced to switch to a new
    thread even if none are available.
  • Very high bandwidth network, which is fast and
    wide
  • Retries on load empty or store full

13
Improving Single Thread Performance
  • Do more operations per instruction (VLIW)
  • Allow multiple instructions to issue into
    pipeline from each context.
  • This could lead to pipeline hazards, so other
    safe instructions could be interleaved into the
    execution.
  • For Horizon Tera, the compiler detects such
    data dependencies and the hardware enforces it by
    switching to another context if detected.
  • Switch on load
  • Switch on miss
  • Switching on load or miss will increase the
    context switch time.

14
Simultaneous Multithreading (SMT)
  • Tullsen, et. al. (U. of Washington), ISCA 95
  • A way to utilize pipeline with increased
    parallelism from multiple threads.

15
Simultaneous Multithreading
16
SMT Architecture
  • Straightforward extension to conventional
    superscalar design.
  • multiple program counters and some mechanism by
    which the fetch unit selects one each cycle,
  • a separate return stack for each thread for
    predicting subroutine return destinations,
  • per-thread instruction retirement, instruction
    queue flush, and trap mechanisms,
  • a thread id with each branch target buffer entry
    to avoid predicting phantom branches, and
  • a larger register file, to support logical
    registers for all threads plus additional
    registers for register renaming.
  • The size of the register file affects the
    pipeline and the scheduling of load-dependent
    instructions.

17
SMT PerformanceTullsen 96
18
Commercial Machines w/ MT Support
  • Intel Hyperthreding (HT)
  • Dual threads
  • Pentium 4, XEON
  • Sun CoolThreads
  • UltraSPARC T1
  • 4-threads per core
  • IBM
  • POWER5

19
IBM Power5http//www.research.ibm.com/journal/rd/
494/mathis.pdf
20
IBM Power5http//www.research.ibm.com/journal/rd/
494/mathis.pdf
21
SMT Summary
  • Pros
  • Increased throughput w/o adding much cost
  • Fast response for multitasking environment
  • Cons
  • Slower single processor performance

22
Multicore
  • Multiple processor cores on a chip
  • Chip multiprocessor (CMP)
  • Suns Chip Multithreading (CMT)
  • UltraSPARC T1 (Niagara)
  • Intels Pentium D
  • AMD dual-core Opteron
  • Also a way to utilize TLP, but
  • 2 cores ? 2X costs
  • No good for single thread performacne
  • Can be used together with SMT

23
Chip Multithreading (CMT)
24
Sun UltraSPARC T1 Processor
http//www.sun.com/servers/wp.jsp?tab3groupCool
Threads20servers
25
8 Cores vs 2 Cores
  • Is 8-cores too aggressive?
  • Good for server applications, given
  • Lots of threads
  • Scalable operating environment
  • Large memory space (64bit)
  • Good for power efficiency
  • Simple pipeline design for each core
  • Good for availability
  • Not intended for PCs, gaming, etc

26
SPECWeb 2005
  • IBM X346 3Ghz Xeon
  • T2000 8 core 1.0GHz T1 Processor

27
Sun Fire T2000 Server
28
Server Pricing
  • UltraSPARC
  • Sun Fire T1000 Server
  • 6 core 1.0GHz T1 Processor
  • 2GB memory, 1x 80GB disk
  • List price 5,745
  • Sun Fire T2000 Server
  • 8 core 1.0GHz T1 Processor
  • 8GB DDR2 memory, 2 X 73GB disk
  • List price 13,395
  • X86
  • Sun Fire X2100 Server
  • Dual core AMD Opteron 175
  • 2GB memory, 1x80GB disk
  • List price 2,295
  • Sun Fire X4200 Server
  • 2x Dual core AMD Opteron 275
  • 4GB memory, 2x 73GB disk
  • List price 7,595
About PowerShow.com