Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara) - PowerPoint PPT Presentation

About This Presentation
Title:

Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara)

Description:

Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara) Dimitris Kaseridis & Lizy K. John The University of Texas at Austin – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 24
Provided by: dimitris
Category:

less

Transcript and Presenter's Notes

Title: Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara)


1
Performance Analysis of Multiple
Threads/CoresUsing the UltraSPARC T1(Niagara)
Dimitris Kaseridis Lizy K. John The University
of Texas at Austin Laboratory for Computer
Architecture http//lca.ece.utexas.edu
  • Unique Chips and Systems (UCAS-4)

2
Outline
  • Brief Description of UltraSPARC T1 Architecture
  • Analysis Objectives / Methodology
  • Analysis of Results
  • Interference on Shared Resources
  • Scaling of Multiprogrammed Workloads
  • Scaling of Multithreaded Workloads

3
UltraSPARC T1 (Niagara)
  • A multi-threaded processor that
  • combines CMP SMT in CMT
  • 8 cores with each one handling
  • 4 hardware context threads ? 32
  • active hardware context threads
  • Simple in-order pipeline with
  • no branch predictor unit per core
  • Optimized for multithreaded performance ?
    Throughput
  • High throughput ? hide the memory and pipeline
    stalls/latencies by scheduling other available
    threads with zero cycle thread switch penalty

4
UltraSPARC T1 Core Pipeline
  • Thread Group shares L1 cache, TLBs, execution
    units, pipeline registers and data path
  • Blue areas are replicated copies per hardware
    context thread

5
Objectives
  • Purpose
  • Analysis of interference of multiple executing
    threads on the shared resources of Niagara
  • Scaling abilities of CMT architectures for both
    multiprogrammed and multithreaded workloads
  • Methodology
  • Interference on Shared Resources (SPEC CPU2000)
  • Scaling of a Multiprogrammed Workload
  • (SPEC CPU2000)
  • Scaling of a Multithreaded Workloads
    (SPECjbb2005)

6
Analysis Objectives / Methodology
7
Methodology (1/2)
  • On-chip performance counters for real/accurate
    results
  • Niagara
  • Solaris10 tools cpustat, cputrack , psrset to
    bind processes to H/W threads
  • 2 counters per Hardware Thread with one only for
    Instruction count

8
Methodology (2/2)
  • Niagara has only one FP unit ? only integer
    benchmark was considered
  • Performance Counter Unit in the granularity of a
    single H/W context thread
  • No way to break down effects of more threads per
    H/W thread
  • Software profiling tools too invasive
  • Only pairs of benchmarks was considered to allow
    correlation of benchmarks with events
  • Many iterations and use average behavior

9
Analysis of Results
  • Interference on shared resources
  • Scaling of a multiprogrammed workload
  • Scaling of a multithreaded workload

10
Interference on Shared Resources
  • Two modes considered
  • Same core mode executes a benchmark on the same
    core
  • Sharing of pipeline, TLBs, L1 bandwidth
  • More like an SMT
  • Two cores mode execute each member of pair on a
    different core
  • Sharing of L2 capacity/bandwidth and main memory
  • More like an CMP

11
Interference same core (1/2)
  • On average 12 drop of IPC when running in a pair
  • Crafty followed by twolf showed the worst
    performance
  • Eon best behavior keeping the IPC almost close to
    the single thread case

12
Interference same core (2/2)
  • DC misses increased 20 on average / 15 taking
    out crafty
  • Worst DC misses are vortex and perlbmk
  • Highest ratios of L2 misses demonstrated are not
    the one that features an important decrease in
    IPC ? mcf and eon pairs with more than 70 L2
    misses
  • Overall, small performance penalty even when
    sharing pipeline and L1, L2 bandwidth ? latency
    hiding technique is promising

13
Interference two cores
  • Only stressing L2 and shared communication buses
  • On average the misses on L2 are almost the same
    as in the case on same core
  • underutilized the available resources
  • Multiprogrammed workload with no data sharing

14
Scaling of Multiprogrammed Workload
  • Reduced benchmark pair set
  • Scaling 4 ? 8 ? 16 threads with configurations

15
Scaling of Multiprogrammed Workload
  • Same core
  • Mixed mode mode

16
Scaling of Multiprogrammed same core
IPC ratio
DC misses ratio
  • 4 ? 8 case
  • IPC / Data cache misses not affected
  • L2 data misses increased but IPC is not
  • Enough resources running fully occupied
  • memory latency hiding
  • 8 ? 16 case
  • More cores running same benchmark
  • Some footprint / request to L2 /Main memory
  • L2 requirements / shared interconnect traffic
    decreased performance

L2 misses ratio
17
Scaling of Multiprogrammed mixed mode
  • Mixed mode case
  • Significant decrease in IPC when moving both
  • from 4 ? 8 and 8 ? 16 threads
  • Same behavior as same core case for DC
  • and L2 misses with an average of 1 - 2
  • difference
  • Overall for both modes
  • Niagara demonstrated that moving from 4 to 16
    threads can be done with less than 40 on average
    performance drop
  • Both modes showed that significantly increased L1
    and L2 misses can be handed favoring throughput

IPC ratio
18
Scaling of Multithreaded Workload
  • Scaled from 1 up to 64 threads
  • 1 ? 8 threads mapped 1 thread per core
  • 8 ?16 threads mapped at maximum 2 threads per
    core
  • 16 ? 32 threads up to 4 threads per core
  • 32 ?64 more threads per core, swapping is
    necessary

Configuration used for SPECjbb2005
19
Scaling of Multithreaded Workload
SPECjbb2005 score per warehouse
GC effect
20
Scaling of Multithreaded Workload
  • Ratio over 8 threads case with 1 thread per core
  • Instruction fetch and DTLB stressed
  • the most
  • L1 data and L2 Caches managed
  • to scale even for more then 32 threads

GC effect
21
Scaling of Multithreaded Workload
  • Scaling of Performance
  • Linear scaling of almost 0.66 per thread up to 32
    threads
  • 20x speed up at 32 threads
  • SMP (2 Threads/core) gives on average 1.8x speed
    up over the CMP configuration (region 1
  • SMT (up to 4 Threads/core) gives a 1.3x and 2.3x
    speedup over the 2-way SMT per core and the
    single-threaded CMP, respectively.

22
Conclusions
  • Demonstration of interference on a real CMT
    system
  • Long latency hiding technique is effective for L1
    and L2 misses and therefore could be a
    good/promising technique against aggressive
    speculation
  • Promising scaling up to 20x for multithreaded
    workloads with an average of 0.66x per thread
  • Instruction fetch subsystem and DTLBs the most
    contented resources followed by L2 cache misses

23
Q/A
  • Thank you
  • Questions?
  • The Laboratory for Computer Architecture
  • web-site http//lca.ece.utexas.edu
  • Email kaseridi_at_ece.utexas.edu
Write a Comment
User Comments (0)
About PowerShow.com