William Stallings Computer Organization and Architecture 8th Edition

About This Presentation

Title:

William Stallings Computer Organization and Architecture 8th Edition

Description:

William Stallings Computer Organization and Architecture 8th Edition Chapter 18 Multicore Computers * SMT IS SUPERSCALAR WITH PARALLEL THREADS IN THE ISSUE SLOTS ... – PowerPoint PPT presentation

Number of Views:497

Avg rating:3.0/5.0

Slides: 32

Provided by: siteUott6

Category:

more less

Transcript and Presenter's Notes

Title: William Stallings Computer Organization and Architecture 8th Edition

1
William Stallings Computer Organization and
Architecture8th Edition

Chapter 18
Multicore Computers

2
Hardware Performance Issues

Microprocessors have seen an exponential increase
in performance
Improved organization
Increased clock frequency
Increase in Parallelism
Pipelining
Superscalar (multi-issue)
Simultaneous multithreading (SMT)
Diminishing returns
More complexity requires more logic
Increasing chip area for coordinating and signal
transfer logic
Harder to design, make and debug

3
Alternative Chip Organizations
http//www.cadalyst.com/files/cadalyst/nodes/2008/
6351/i4.jpg
4
Intel Hardware Trends
Exponential speedup trend
ILP has come and gone
http//smoothspan.files.wordpress.com/2007/09/cloc
kspeeds.jpg
http//www.ixbt.com/cpu/semiconductor/intel-65nm/p
ower_density.jpg
5
Increased Complexity

Power requirements grow exponentially with chip
density and clock frequency
Can use more chip area for cache
Smaller
Order of magnitude lower power requirements
By 2015
100 billion transistors on 300mm2 die
Cache of 100MB
1 billion transistors for logic

http//www.tomshardware.com/reviews/core-duo-noteb
ooks-trade-battery-life-quicker-response,1206-4.ht
ml
http//techreport.com/r.x/core-i7/die-callout.jpg
6
Power and Memory Considerations
We passed 50!!! Is this a RAM or a processor?
More action
Less action
7
Increased Complexity

Pollacks rule
Performance is roughly proportional to square
root of increase in complexity
Double complexity gives 40 more performance
Multicore has the potential for near-linear
improvement (needs some programming effort and
wont work for all problems)
Unlikely that one core can use all of a huge
cache effectively, so add PEs to make an MPSoC

8
Chip Utilization of Transistors
Cache
CPU
9
Software Performance Issues

Performance benefits dependent on effective
exploitation of parallel resources (obviously)
Even small amounts of serial code impact
performance (not so obvious)
10 inherently serial on 8 processor system gives
only 4.7 times performance
Many overheads of MPSoC
Communication
Distribution of work
Cache coherence
Some applications effectively exploit multicore
processors

10
Effective Applications for Multicore Processors

Database (e.g. Select )
Servers handling independent transactions
Multi-threaded native applications
Lotus Domino, Siebel CRM
Multi-process applications
Oracle, SAP, PeopleSoft
Java applications
Java VM is multi-threaded with scheduling and
memory management (not so good at SSE ?)
Suns Java Application Server, BEAs Weblogic,
IBM Websphere, Tomcat
Multi-instance applications
One application running multiple times

11
Multicore Organization

Main design variables
Number of core processors on chip (dual, quad ...
)
Number of levels of cache on chip (L1, L2, L3,
...)
Amount of shared cache v.s. not shared (1MB, 4MB,
...)
The following slide has examples of each
organization
ARM11 MPCore
AMD Opteron
Intel Core Duo
Intel Core i7

12
Multicore Organization Alternatives
No shared
AMD Opteron
ARM11 MPCore
Shared
Intel Core i7
Intel Core Duo
13
Advantages of shared L2 Cache

Constructive interference reduces overall miss
rate (A wants X then B wants X ? good!)
Data shared by multiple cores not replicated at
cache level (one copy of X for both A and B)
With proper frame replacement algorithms mean
amount of shared cache dedicated to each core is
dynamic
Threads with less locality can have more cache
Easy inter-process communication through shared
memory
Cache coherency confined to small L1
Dedicated L2 cache gives each core more rapid
access
Good for threads with strong locality
Shared L3 cache may also improve performance

14
Core i7 and Duo

Let us review these two Intel architectures

15
Individual Core Architecture

Intel Core Duo uses superscalar cores
Intel Core i7 uses simultaneous multi-threading
(SMT)
Scales up number of threads supported
4 SMT cores, each supporting 4 threads appears as
16 core (my corei7 has 2 threads per CPU)

Core i7
Core 2 duo
16
Intel x86 Multicore Organization -Core Duo (1)

2006
Two x86 superscalar, shared L2 cache
Dedicated L1 cache per core
32KB instruction and 32KB data
Thermal control unit per core
Manages chip heat dissipation with sensors, clock
speed is throttled
Maximize performance within thermal constraints
Improved ergonomics (quiet fan)
Advanced Programmable Interrupt Controlled (APIC)
Inter-process interrupts between cores
Routes interrupts to appropriate core
Includes timer so OS can self-interrupt a core

17
Intel x86 Multicore Organization -Core Duo (2)

Power Management Logic
Monitors thermal conditions and CPU activity
Adjusts voltage (and thus power consumption)
Can switch on/off individual logic subsystems to
save power
Split-bus transactions can sleep on one end
2MB shared L2 cache
Dynamic allocation
MESI support for L1 caches
Extended to support multiple Core Duo in SMP (not
SMT)
L2 data shared between local cores (fast) or
external
Bus interface is FSB

18
Intel Core Duo Block Diagram
19
Intel x86 Multicore Organization -Core i7

November 2008
Four x86 SMT processors
Dedicated L2, shared L3 cache
Speculative pre-fetch for caches
On chip DDR3 memory controller
Three 8 byte channels (192 bits) giving 32GB/s
No front side bus (just like labs 1 2 with the
SDRAM controller)
QuickPath Interconnect (QPI video if time allows)
Cache coherent point-to-point link
High speed communications between processor chips
6.4G transfers per second, 16 bits per transfer
Dedicated bi-directional pairs
Total bandwidth 25.6GB/s

20
Intel Core i7 Block Diagram
21
ARM11 MPCore

ARM vs. x86 and MicrosoftIntel started this
fight by challenging ARM with its Atom processor,
which is moving downmarket and towards
smartphones. Apparently, the major ARM vendors
are feeling the threat, are now moving upmarket
and are beginning to make their run at low-end
PCs and storage appliances to put the pressure
back on Intel.
http//www.tgdaily.com/trendwatch-features/41561-t
he-coming-arm-vs-intel-pc-battle

22
ARM11 MPCore

Up to 4 processors each with own L1 instruction
and data cache
Distributed Interrupt Controller (DIC)
Recall the APIC from Intels core architecture
Timer per CPU
Watchdog (feed or it barks!)
Warning alerts for software failures
Counts down from predetermined values
Issues warning at zero
CPU interface
Interrupt acknowledgement, masking and completion
acknowledgement
CPU
Single ARM11 called MP11
Vector floating-point unit (VFP)
FP co-processor
L1 cache
Snoop control unit
L1 cache coherency

http//barfblog.foodsafety.ksu.edu/DogObedienceTra
ining.jpg
23
ARM11 MPCore Block Diagram
24
ARM11 MPCore Interrupt Handling

Distributed Interrupt Controller (DIC) collates
from many sources (ironically it is a centralized
controller)
It provides
Masking (who can ignore an interrupt)
Prioritization (CPU A is more important than CPU
B)
Distribution to target MP11 CPUs
Status tracking (of interrupts)
Software interrupt generation
Number of interrupts independent of MP11 CPU
design
Memory mapped DIC control registers
Accessed by CPUs via private interface through
SCU
DIC can
Route interrupts to single or multiple CPUs
Provide inter-process communication
Thread on one CPU can cause activity by thread on
another CPU

25
DIC Routing

Direct to specific CPU
To defined group of CPUs
To all CPUs
OS can generate interrupt to
All but self
Self
Other specific CPU
Typically combined with shared memory for
inter-process communication
16 interrupt ids available for inter-process
communication (per cpu)

26
Interrupt States

Inactive
Non-asserted
Completed by that CPU but pending or active in
others
E.g. allgather
Pending
Asserted
Processing not started on that CPU
Active
Started on that CPU but not complete
Can be pre-empted by higher priority interrupt

27
Interrupt Sources

Inter-process Interrupts (IPI)
Private to CPU
ID0-ID15 (16 IPIs per CPU as mentioned earlier)
Software triggered
Priority depends on receiving CPU not source
Private timer and/or watchdog interrupt
ID29 and ID30
Legacy FIQ line
Legacy FIQ pin, per CPU, bypasses interrupt
distributor
Directly drives interrupts to CPU
Hardware
Triggered by programmable events on associated
interrupt lines
Up to 224 lines
Start at ID32

28
ARM11 MPCore Interrupt Distributor
29
Cache Coherency

Snoop Control Unit (SCU) resolves most shared
data bottleneck issues
Note L1 cache coherency based on MESI similar to
Intels core architecture
3 types of SCU shared data resolution
Direct data Intervention
Copying clean entries between L1 caches without
accessing external memory or L2
Can resolve local L1 miss from remote L1 rather
than L2
Reduces read after write from L1 to L2
Duplicated tag RAMs
Cache tags implemented as separate block of RAM,
a copy is held in the SCU. So the SCU knows when
2 CPUs have the same cache lines.
Tag RAM has same length as number of lines in
cache
TAG duplicates used by SCU to check data
availability before sending coherency commands
Only send to CPUs that must update coherent data
cache
Less bus locking due to less communication during
coherency step
Migratory lines
Allows moving dirty data between CPUs without
writing to L2 and reading back from external
memory(See Stallings CH 18.5 pg703)

30
Performance Effect of Multiple Cores
31
Recommended Reading

Multicore Association web site
Stallings chapter 18
ARM web site
(if we have time) http//www.intel.com/technology/
quickpath/index.htm
http//www.arm.com/products/CPUs/ARM11MPCoreMultip
rocessor.html
http//www.eetimes.com/news/design/features/showAr
ticle.jhtml?articleID23901143

Write a Comment

User Comments (0)

About PowerShow.com

William Stallings Computer Organization and Architecture 8th Edition - PowerPoint PPT Presentation

William Stallings Computer Organization and Architecture 8th Edition

William Stallings Computer Organization and Architecture 8th Edition Chapter 18 Multicore Computers * SMT IS SUPERSCALAR WITH PARALLEL THREADS IN THE ISSUE SLOTS ... – PowerPoint PPT presentation