Title: Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara)
1Performance Analysis of Multiple
Threads/CoresUsing the UltraSPARC T1(Niagara)
Dimitris Kaseridis Lizy K. John The University
of Texas at Austin Laboratory for Computer
Architecture http//lca.ece.utexas.edu
- Unique Chips and Systems (UCAS-4)
2Outline
- Brief Description of UltraSPARC T1 Architecture
- Analysis Objectives / Methodology
- Analysis of Results
- Interference on Shared Resources
- Scaling of Multiprogrammed Workloads
- Scaling of Multithreaded Workloads
3UltraSPARC T1 (Niagara)
- A multi-threaded processor that
- combines CMP SMT in CMT
- 8 cores with each one handling
- 4 hardware context threads ? 32
- active hardware context threads
- Simple in-order pipeline with
- no branch predictor unit per core
- Optimized for multithreaded performance ?
Throughput - High throughput ? hide the memory and pipeline
stalls/latencies by scheduling other available
threads with zero cycle thread switch penalty
4UltraSPARC T1 Core Pipeline
- Thread Group shares L1 cache, TLBs, execution
units, pipeline registers and data path - Blue areas are replicated copies per hardware
context thread
5Objectives
- Purpose
- Analysis of interference of multiple executing
threads on the shared resources of Niagara - Scaling abilities of CMT architectures for both
multiprogrammed and multithreaded workloads - Methodology
- Interference on Shared Resources (SPEC CPU2000)
- Scaling of a Multiprogrammed Workload
- (SPEC CPU2000)
- Scaling of a Multithreaded Workloads
(SPECjbb2005)
6Analysis Objectives / Methodology
7Methodology (1/2)
- On-chip performance counters for real/accurate
results - Niagara
- Solaris10 tools cpustat, cputrack , psrset to
bind processes to H/W threads - 2 counters per Hardware Thread with one only for
Instruction count
8Methodology (2/2)
- Niagara has only one FP unit ? only integer
benchmark was considered - Performance Counter Unit in the granularity of a
single H/W context thread - No way to break down effects of more threads per
H/W thread - Software profiling tools too invasive
- Only pairs of benchmarks was considered to allow
correlation of benchmarks with events - Many iterations and use average behavior
9Analysis of Results
- Interference on shared resources
- Scaling of a multiprogrammed workload
- Scaling of a multithreaded workload
10Interference on Shared Resources
- Two modes considered
- Same core mode executes a benchmark on the same
core - Sharing of pipeline, TLBs, L1 bandwidth
- More like an SMT
- Two cores mode execute each member of pair on a
different core - Sharing of L2 capacity/bandwidth and main memory
- More like an CMP
11Interference same core (1/2)
- On average 12 drop of IPC when running in a pair
- Crafty followed by twolf showed the worst
performance - Eon best behavior keeping the IPC almost close to
the single thread case
12Interference same core (2/2)
- DC misses increased 20 on average / 15 taking
out crafty - Worst DC misses are vortex and perlbmk
- Highest ratios of L2 misses demonstrated are not
the one that features an important decrease in
IPC ? mcf and eon pairs with more than 70 L2
misses - Overall, small performance penalty even when
sharing pipeline and L1, L2 bandwidth ? latency
hiding technique is promising
13Interference two cores
- Only stressing L2 and shared communication buses
- On average the misses on L2 are almost the same
as in the case on same core - underutilized the available resources
- Multiprogrammed workload with no data sharing
14Scaling of Multiprogrammed Workload
- Reduced benchmark pair set
- Scaling 4 ? 8 ? 16 threads with configurations
15Scaling of Multiprogrammed Workload
- Same core
- Mixed mode mode
16Scaling of Multiprogrammed same core
IPC ratio
DC misses ratio
- 4 ? 8 case
- IPC / Data cache misses not affected
- L2 data misses increased but IPC is not
- Enough resources running fully occupied
- memory latency hiding
- 8 ? 16 case
- More cores running same benchmark
- Some footprint / request to L2 /Main memory
- L2 requirements / shared interconnect traffic
decreased performance
L2 misses ratio
17Scaling of Multiprogrammed mixed mode
- Mixed mode case
- Significant decrease in IPC when moving both
- from 4 ? 8 and 8 ? 16 threads
- Same behavior as same core case for DC
- and L2 misses with an average of 1 - 2
- difference
- Overall for both modes
- Niagara demonstrated that moving from 4 to 16
threads can be done with less than 40 on average
performance drop - Both modes showed that significantly increased L1
and L2 misses can be handed favoring throughput
IPC ratio
18Scaling of Multithreaded Workload
- Scaled from 1 up to 64 threads
- 1 ? 8 threads mapped 1 thread per core
- 8 ?16 threads mapped at maximum 2 threads per
core - 16 ? 32 threads up to 4 threads per core
- 32 ?64 more threads per core, swapping is
necessary
Configuration used for SPECjbb2005
19Scaling of Multithreaded Workload
SPECjbb2005 score per warehouse
GC effect
20Scaling of Multithreaded Workload
- Ratio over 8 threads case with 1 thread per core
- Instruction fetch and DTLB stressed
- the most
- L1 data and L2 Caches managed
- to scale even for more then 32 threads
GC effect
21Scaling of Multithreaded Workload
- Scaling of Performance
- Linear scaling of almost 0.66 per thread up to 32
threads - 20x speed up at 32 threads
- SMP (2 Threads/core) gives on average 1.8x speed
up over the CMP configuration (region 1 - SMT (up to 4 Threads/core) gives a 1.3x and 2.3x
speedup over the 2-way SMT per core and the
single-threaded CMP, respectively.
22Conclusions
- Demonstration of interference on a real CMT
system - Long latency hiding technique is effective for L1
and L2 misses and therefore could be a
good/promising technique against aggressive
speculation - Promising scaling up to 20x for multithreaded
workloads with an average of 0.66x per thread - Instruction fetch subsystem and DTLBs the most
contented resources followed by L2 cache misses
23Q/A
- Thank you
- Questions?
- The Laboratory for Computer Architecture
- web-site http//lca.ece.utexas.edu
- Email kaseridi_at_ece.utexas.edu