William Stallings Computer Organization and Architecture 8th Edition - PowerPoint PPT Presentation

About This Presentation
Title:

William Stallings Computer Organization and Architecture 8th Edition

Description:

William Stallings Computer Organization and Architecture 8th Edition Chapter 18 Multicore Computers * SMT IS SUPERSCALAR WITH PARALLEL THREADS IN THE ISSUE SLOTS ... – PowerPoint PPT presentation

Number of Views:497
Avg rating:3.0/5.0
Slides: 32
Provided by: siteUott6
Category:

less

Transcript and Presenter's Notes

Title: William Stallings Computer Organization and Architecture 8th Edition


1
William Stallings Computer Organization and
Architecture8th Edition
  • Chapter 18
  • Multicore Computers

2
Hardware Performance Issues
  • Microprocessors have seen an exponential increase
    in performance
  • Improved organization
  • Increased clock frequency
  • Increase in Parallelism
  • Pipelining
  • Superscalar (multi-issue)
  • Simultaneous multithreading (SMT)
  • Diminishing returns
  • More complexity requires more logic
  • Increasing chip area for coordinating and signal
    transfer logic
  • Harder to design, make and debug

3
Alternative Chip Organizations
http//www.cadalyst.com/files/cadalyst/nodes/2008/
6351/i4.jpg
4
Intel Hardware Trends
Exponential speedup trend
ILP has come and gone
http//smoothspan.files.wordpress.com/2007/09/cloc
kspeeds.jpg
http//www.ixbt.com/cpu/semiconductor/intel-65nm/p
ower_density.jpg
5
Increased Complexity
  • Power requirements grow exponentially with chip
    density and clock frequency
  • Can use more chip area for cache
  • Smaller
  • Order of magnitude lower power requirements
  • By 2015
  • 100 billion transistors on 300mm2 die
  • Cache of 100MB
  • 1 billion transistors for logic

http//www.tomshardware.com/reviews/core-duo-noteb
ooks-trade-battery-life-quicker-response,1206-4.ht
ml
http//techreport.com/r.x/core-i7/die-callout.jpg
6
Power and Memory Considerations
We passed 50!!! Is this a RAM or a processor?
More action
Less action
7
Increased Complexity
  • Pollacks rule
  • Performance is roughly proportional to square
    root of increase in complexity
  • Double complexity gives 40 more performance
  • Multicore has the potential for near-linear
    improvement (needs some programming effort and
    wont work for all problems)
  • Unlikely that one core can use all of a huge
    cache effectively, so add PEs to make an MPSoC

8
Chip Utilization of Transistors
Cache
CPU
9
Software Performance Issues
  • Performance benefits dependent on effective
    exploitation of parallel resources (obviously)
  • Even small amounts of serial code impact
    performance (not so obvious)
  • 10 inherently serial on 8 processor system gives
    only 4.7 times performance
  • Many overheads of MPSoC
  • Communication
  • Distribution of work
  • Cache coherence
  • Some applications effectively exploit multicore
    processors

10
Effective Applications for Multicore Processors
  • Database (e.g. Select )
  • Servers handling independent transactions
  • Multi-threaded native applications
  • Lotus Domino, Siebel CRM
  • Multi-process applications
  • Oracle, SAP, PeopleSoft
  • Java applications
  • Java VM is multi-threaded with scheduling and
    memory management (not so good at SSE ?)
  • Suns Java Application Server, BEAs Weblogic,
    IBM Websphere, Tomcat
  • Multi-instance applications
  • One application running multiple times

11
Multicore Organization
  • Main design variables
  • Number of core processors on chip (dual, quad ...
    )
  • Number of levels of cache on chip (L1, L2, L3,
    ...)
  • Amount of shared cache v.s. not shared (1MB, 4MB,
    ...)
  • The following slide has examples of each
    organization
  • ARM11 MPCore
  • AMD Opteron
  • Intel Core Duo
  • Intel Core i7

12
Multicore Organization Alternatives
No shared
AMD Opteron
ARM11 MPCore
Shared
Intel Core i7
Intel Core Duo
13
Advantages of shared L2 Cache
  • Constructive interference reduces overall miss
    rate (A wants X then B wants X ? good!)
  • Data shared by multiple cores not replicated at
    cache level (one copy of X for both A and B)
  • With proper frame replacement algorithms mean
    amount of shared cache dedicated to each core is
    dynamic
  • Threads with less locality can have more cache
  • Easy inter-process communication through shared
    memory
  • Cache coherency confined to small L1
  • Dedicated L2 cache gives each core more rapid
    access
  • Good for threads with strong locality
  • Shared L3 cache may also improve performance

14
Core i7 and Duo
  • Let us review these two Intel architectures

15
Individual Core Architecture
  • Intel Core Duo uses superscalar cores
  • Intel Core i7 uses simultaneous multi-threading
    (SMT)
  • Scales up number of threads supported
  • 4 SMT cores, each supporting 4 threads appears as
    16 core (my corei7 has 2 threads per CPU)

Core i7
Core 2 duo
16
Intel x86 Multicore Organization -Core Duo (1)
  • 2006
  • Two x86 superscalar, shared L2 cache
  • Dedicated L1 cache per core
  • 32KB instruction and 32KB data
  • Thermal control unit per core
  • Manages chip heat dissipation with sensors, clock
    speed is throttled
  • Maximize performance within thermal constraints
  • Improved ergonomics (quiet fan)
  • Advanced Programmable Interrupt Controlled (APIC)
  • Inter-process interrupts between cores
  • Routes interrupts to appropriate core
  • Includes timer so OS can self-interrupt a core

17
Intel x86 Multicore Organization -Core Duo (2)
  • Power Management Logic
  • Monitors thermal conditions and CPU activity
  • Adjusts voltage (and thus power consumption)
  • Can switch on/off individual logic subsystems to
    save power
  • Split-bus transactions can sleep on one end
  • 2MB shared L2 cache
  • Dynamic allocation
  • MESI support for L1 caches
  • Extended to support multiple Core Duo in SMP (not
    SMT)
  • L2 data shared between local cores (fast) or
    external
  • Bus interface is FSB

18
Intel Core Duo Block Diagram
19
Intel x86 Multicore Organization -Core i7
  • November 2008
  • Four x86 SMT processors
  • Dedicated L2, shared L3 cache
  • Speculative pre-fetch for caches
  • On chip DDR3 memory controller
  • Three 8 byte channels (192 bits) giving 32GB/s
  • No front side bus (just like labs 1 2 with the
    SDRAM controller)
  • QuickPath Interconnect (QPI video if time allows)
  • Cache coherent point-to-point link
  • High speed communications between processor chips
  • 6.4G transfers per second, 16 bits per transfer
  • Dedicated bi-directional pairs
  • Total bandwidth 25.6GB/s

20
Intel Core i7 Block Diagram
21
ARM11 MPCore
  • ARM vs. x86 and MicrosoftIntel started this
    fight by challenging ARM with its Atom processor,
    which is moving downmarket and towards
    smartphones. Apparently, the major ARM vendors
    are feeling the threat, are now moving upmarket
    and are beginning to make their run at low-end
    PCs and storage appliances to put the pressure
    back on Intel.
  • http//www.tgdaily.com/trendwatch-features/41561-t
    he-coming-arm-vs-intel-pc-battle

22
ARM11 MPCore
  • Up to 4 processors each with own L1 instruction
    and data cache
  • Distributed Interrupt Controller (DIC)
  • Recall the APIC from Intels core architecture
  • Timer per CPU
  • Watchdog (feed or it barks!)
  • Warning alerts for software failures
  • Counts down from predetermined values
  • Issues warning at zero
  • CPU interface
  • Interrupt acknowledgement, masking and completion
    acknowledgement
  • CPU
  • Single ARM11 called MP11
  • Vector floating-point unit (VFP)
  • FP co-processor
  • L1 cache
  • Snoop control unit
  • L1 cache coherency

http//barfblog.foodsafety.ksu.edu/DogObedienceTra
ining.jpg
23
ARM11 MPCore Block Diagram
24
ARM11 MPCore Interrupt Handling
  • Distributed Interrupt Controller (DIC) collates
    from many sources (ironically it is a centralized
    controller)
  • It provides
  • Masking (who can ignore an interrupt)
  • Prioritization (CPU A is more important than CPU
    B)
  • Distribution to target MP11 CPUs
  • Status tracking (of interrupts)
  • Software interrupt generation
  • Number of interrupts independent of MP11 CPU
    design
  • Memory mapped DIC control registers
  • Accessed by CPUs via private interface through
    SCU
  • DIC can
  • Route interrupts to single or multiple CPUs
  • Provide inter-process communication
  • Thread on one CPU can cause activity by thread on
    another CPU

25
DIC Routing
  • Direct to specific CPU
  • To defined group of CPUs
  • To all CPUs
  • OS can generate interrupt to
  • All but self
  • Self
  • Other specific CPU
  • Typically combined with shared memory for
    inter-process communication
  • 16 interrupt ids available for inter-process
    communication (per cpu)

26
Interrupt States
  • Inactive
  • Non-asserted
  • Completed by that CPU but pending or active in
    others
  • E.g. allgather
  • Pending
  • Asserted
  • Processing not started on that CPU
  • Active
  • Started on that CPU but not complete
  • Can be pre-empted by higher priority interrupt

27
Interrupt Sources
  • Inter-process Interrupts (IPI)
  • Private to CPU
  • ID0-ID15 (16 IPIs per CPU as mentioned earlier)
  • Software triggered
  • Priority depends on receiving CPU not source
  • Private timer and/or watchdog interrupt
  • ID29 and ID30
  • Legacy FIQ line
  • Legacy FIQ pin, per CPU, bypasses interrupt
    distributor
  • Directly drives interrupts to CPU
  • Hardware
  • Triggered by programmable events on associated
    interrupt lines
  • Up to 224 lines
  • Start at ID32

28
ARM11 MPCore Interrupt Distributor
29
Cache Coherency
  • Snoop Control Unit (SCU) resolves most shared
    data bottleneck issues
  • Note L1 cache coherency based on MESI similar to
    Intels core architecture
  • 3 types of SCU shared data resolution
  • Direct data Intervention
  • Copying clean entries between L1 caches without
    accessing external memory or L2
  • Can resolve local L1 miss from remote L1 rather
    than L2
  • Reduces read after write from L1 to L2
  • Duplicated tag RAMs
  • Cache tags implemented as separate block of RAM,
    a copy is held in the SCU. So the SCU knows when
    2 CPUs have the same cache lines.
  • Tag RAM has same length as number of lines in
    cache
  • TAG duplicates used by SCU to check data
    availability before sending coherency commands
  • Only send to CPUs that must update coherent data
    cache
  • Less bus locking due to less communication during
    coherency step
  • Migratory lines
  • Allows moving dirty data between CPUs without
    writing to L2 and reading back from external
    memory(See Stallings CH 18.5 pg703)

30
Performance Effect of Multiple Cores
31
Recommended Reading
  • Multicore Association web site
  • Stallings chapter 18
  • ARM web site
  • (if we have time) http//www.intel.com/technology/
    quickpath/index.htm
  • http//www.arm.com/products/CPUs/ARM11MPCoreMultip
    rocessor.html
  • http//www.eetimes.com/news/design/features/showAr
    ticle.jhtml?articleID23901143
Write a Comment
User Comments (0)
About PowerShow.com