Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara)

About This Presentation

Title:

Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara)

Description:

Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara) Dimitris Kaseridis & Lizy K. John The University of Texas at Austin – PowerPoint PPT presentation

Number of Views:133

Avg rating:3.0/5.0

Slides: 24

Provided by: dimitris

Learn more at: http://lca.ece.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara)

1
Performance Analysis of Multiple
Threads/CoresUsing the UltraSPARC T1(Niagara)
Dimitris Kaseridis Lizy K. John The University
of Texas at Austin Laboratory for Computer
Architecture http//lca.ece.utexas.edu

Unique Chips and Systems (UCAS-4)

2
Outline

Brief Description of UltraSPARC T1 Architecture
Analysis Objectives / Methodology
Analysis of Results
Interference on Shared Resources
Scaling of Multiprogrammed Workloads
Scaling of Multithreaded Workloads

3
UltraSPARC T1 (Niagara)

A multi-threaded processor that
combines CMP SMT in CMT
8 cores with each one handling
4 hardware context threads ? 32
active hardware context threads
Simple in-order pipeline with
no branch predictor unit per core
Optimized for multithreaded performance ?
Throughput
High throughput ? hide the memory and pipeline
stalls/latencies by scheduling other available
threads with zero cycle thread switch penalty

4
UltraSPARC T1 Core Pipeline

Thread Group shares L1 cache, TLBs, execution
units, pipeline registers and data path
Blue areas are replicated copies per hardware
context thread

5
Objectives

Purpose
Analysis of interference of multiple executing
threads on the shared resources of Niagara
Scaling abilities of CMT architectures for both
multiprogrammed and multithreaded workloads
Methodology
Interference on Shared Resources (SPEC CPU2000)
Scaling of a Multiprogrammed Workload
(SPEC CPU2000)
Scaling of a Multithreaded Workloads
(SPECjbb2005)

6
Analysis Objectives / Methodology
7
Methodology (1/2)

On-chip performance counters for real/accurate
results
Niagara
Solaris10 tools cpustat, cputrack , psrset to
bind processes to H/W threads
2 counters per Hardware Thread with one only for
Instruction count

8
Methodology (2/2)

Niagara has only one FP unit ? only integer
benchmark was considered
Performance Counter Unit in the granularity of a
single H/W context thread
No way to break down effects of more threads per
H/W thread
Software profiling tools too invasive
Only pairs of benchmarks was considered to allow
correlation of benchmarks with events
Many iterations and use average behavior

9
Analysis of Results

Interference on shared resources
Scaling of a multiprogrammed workload
Scaling of a multithreaded workload

10
Interference on Shared Resources

Two modes considered
Same core mode executes a benchmark on the same
core
Sharing of pipeline, TLBs, L1 bandwidth
More like an SMT
Two cores mode execute each member of pair on a
different core
Sharing of L2 capacity/bandwidth and main memory
More like an CMP

11
Interference same core (1/2)

On average 12 drop of IPC when running in a pair
Crafty followed by twolf showed the worst
performance
Eon best behavior keeping the IPC almost close to
the single thread case

12
Interference same core (2/2)

DC misses increased 20 on average / 15 taking
out crafty
Worst DC misses are vortex and perlbmk
Highest ratios of L2 misses demonstrated are not
the one that features an important decrease in
IPC ? mcf and eon pairs with more than 70 L2
misses
Overall, small performance penalty even when
sharing pipeline and L1, L2 bandwidth ? latency
hiding technique is promising

13
Interference two cores

Only stressing L2 and shared communication buses
On average the misses on L2 are almost the same
as in the case on same core
underutilized the available resources
Multiprogrammed workload with no data sharing

14
Scaling of Multiprogrammed Workload

Reduced benchmark pair set
Scaling 4 ? 8 ? 16 threads with configurations

15
Scaling of Multiprogrammed Workload

Same core
Mixed mode mode

16
Scaling of Multiprogrammed same core
IPC ratio
DC misses ratio

4 ? 8 case
IPC / Data cache misses not affected
L2 data misses increased but IPC is not
Enough resources running fully occupied
memory latency hiding
8 ? 16 case
More cores running same benchmark
Some footprint / request to L2 /Main memory
L2 requirements / shared interconnect traffic
decreased performance

L2 misses ratio
17
Scaling of Multiprogrammed mixed mode

Mixed mode case
Significant decrease in IPC when moving both
from 4 ? 8 and 8 ? 16 threads
Same behavior as same core case for DC
and L2 misses with an average of 1 - 2
difference
Overall for both modes
Niagara demonstrated that moving from 4 to 16
threads can be done with less than 40 on average
performance drop
Both modes showed that significantly increased L1
and L2 misses can be handed favoring throughput

IPC ratio
18
Scaling of Multithreaded Workload

Scaled from 1 up to 64 threads
1 ? 8 threads mapped 1 thread per core
8 ?16 threads mapped at maximum 2 threads per
core
16 ? 32 threads up to 4 threads per core
32 ?64 more threads per core, swapping is
necessary

Configuration used for SPECjbb2005
19
Scaling of Multithreaded Workload
SPECjbb2005 score per warehouse
GC effect
20
Scaling of Multithreaded Workload

Ratio over 8 threads case with 1 thread per core
Instruction fetch and DTLB stressed
the most
L1 data and L2 Caches managed
to scale even for more then 32 threads

GC effect
21
Scaling of Multithreaded Workload

Scaling of Performance
Linear scaling of almost 0.66 per thread up to 32
threads
20x speed up at 32 threads
SMP (2 Threads/core) gives on average 1.8x speed
up over the CMP configuration (region 1
SMT (up to 4 Threads/core) gives a 1.3x and 2.3x
speedup over the 2-way SMT per core and the
single-threaded CMP, respectively.

22
Conclusions

Demonstration of interference on a real CMT
system
Long latency hiding technique is effective for L1
and L2 misses and therefore could be a
good/promising technique against aggressive
speculation
Promising scaling up to 20x for multithreaded
workloads with an average of 0.66x per thread
Instruction fetch subsystem and DTLBs the most
contented resources followed by L2 cache misses

23
Q/A

Thank you
Questions?
The Laboratory for Computer Architecture
web-site http//lca.ece.utexas.edu
Email kaseridi_at_ece.utexas.edu

Write a Comment

User Comments (0)

About PowerShow.com

Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara) - PowerPoint PPT Presentation

Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara)

Performance Analysis of Multiple Threads/Cores Using the UltraSPARC T1 (Niagara) Dimitris Kaseridis & Lizy K. John The University of Texas at Austin – PowerPoint PPT presentation