A Study on HyperThreading - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

A Study on HyperThreading

Description:

Other resources partitioned equally between 2 threads ... HT On: Hyper-Threading on and OS context ... Extended the simulator to model SMT and Hyper-Threading: ... – PowerPoint PPT presentation

Number of Views:530
Avg rating:3.0/5.0
Slides: 30
Provided by: www43
Category:

less

Transcript and Presenter's Notes

Title: A Study on HyperThreading


1
A Study on Hyper-Threading
  • Vimal Reddy
  • Ambarish Sule
  • Aravindh Anantaraman

2
Microarchitectural trends
  • Higher degrees of instruction-level parallelism
  • Different generations
  • I. Serial Processors Fetch and execute each
    instruction back to back
  • II. Pipelined Processors Overlap different
    phases of instruction processing for higher
    throughput
  • III. Superscalar Processors Overlap different
    phases of instruction processing and issue and
    execute multiple instructions in parallel for IPC
    gt 1
  • IV. ???

3
Superscalar limits
  • Limitations with superscalar approach
  • - Amount of ILP in most programs is limited
  • - Nature of ILP in programs can be bursty
  • - Bottom-line Resources can be utilized
    better

4
Simultaneous Multithreading
  • Finds parallelism at thread level
  • Executes multiple instructions from multiple
    threads each cycle
  • No significant increase in chip area over a
    superscalar processor

5
Multiple PCs
Replicate architectural state
  • Thread selection
  • Replicate RAS
  • BTB thread ids

Fetch Unit
Data Cache
FP Registers
FP queue
Instruction Cache
Selective squash
Decode
Int. Registers
Int. queue
Int. load/store units
Register Renaming
Selective squash
Replicate architectural state
Per-thread disambiguation
  • Multiple rename map tables
  • Multiple arch. map tables
  • Multiple active lists

From ece721 notes, Prof. Eric Rotenberg, NCSU
6
Hyper-Threading
  • Brings goodness of Simultaneous Multi-Threading
    (SMT) to Intel Architecture
  • Motivation (Same as that for SMT)
  • High processor utilization
  • Better throughput (by exploiting thread level
    parallelism - TLP)
  • Power efficient due to smaller processor cores
    compared to CMP

7
Hyper-Threading Contd.
  • 2 Logical processors (2 threads in SMT
    terminology)
  • Shared Instruction Trace Cache and L1 D-Cache
  • 2 PCs and 2 register renamers
  • Other resources partitioned equally between 2
    threads
  • Recombines shared resources when single threaded
    (no degradation of single thread
    performance)

Intel NetBurst Microarchitecture Pipeline With
Hyper-Threading Technology
8
Project Goal
  • Measure performance of micro-benchmarks (kernels)
    on Pentium-4. Form workloads to utilize different
    processor resources and study behavior.

9
Pentium4 Functional Units
3 Integer ALU units (2 double speed) 1 unit
for Floating point computation Separate address
generator units for loads and stores
10
Micro-benchmarks
  • Created 3 types of kernels
  • Floating Point intensive kernel (flt)
  • Performs FP Add, Sub, Multiply, Divide operations
    a large number of times
  • Targets single FP unit
  • Integer intensive kernel (int)
  • Performs integer Add, Subtract and Shift a large
    number of times
  • Targets integer units (2 double speed and 1 slow)
  • Memory intensive kernel (mem, mem_s)
  • Dynamically allocates a linked list larger than
    L1 D and parses it
  • Targets shared data cache and memory hierarchy as
    such

11
Micro-benchmarks (contd.)
Integer kernel
Floating Point kernel
Memory intensive kernel
12
Workbench
  • Machine Pentium4 Northwood 2.53-2.66 GHz. with
    Hyper-Threading
  • Operating System Linux 2.4.18-SMP kernel. OS
    views each thread as a processor
  • BIOS setting to turn HT On/Off
  • PERL script to fork processes at the same time
  • top (Linux utility) to monitor processes
    (processor and memory utilization)
  • time utility to get timing statistics for each
    program
  • Ran each experiment 10 times and took the average
    execution time

13
Methodology
  • Run different workload combinations.
  • fltflt 2 Floating point kernels
  • mem_smem_s 2 small memory intensive kernels
  • intflt 1 integer and 1 float kernel and so on
    ..
  • Run in 3 modes
  • 1. back-to-back Run each program individually
  • 2. HT Off No Hyper-Threading. But OS context
    switching
  • 3. HT On Hyper-Threading on and OS context
    switching
  • Find Contending workloads Compete for
    resources and degrade performance (increase
    execution time with HT on)
  • Find Complementary workloads Utilize idle
    resources and increase performance (decrease
    execution time with HT on)

14
Experiments Single thread performance
  • Hyper-Threading does not degrade single thread
    performance

15
Experiments (Contd.)
  • Contention for single FP unit increases
    execution time
  • Contention for data cache can lead to thrashing

16
Experiments (Contd.)
  • Integer workloads perform well 3 integer units
  • (2 double speed) are well utilized
  • Workloads with complementary resource
    requirements
  • perform well (intflt, memint)
  • OS plays important role when number of programs
    gt number
  • of hardware contexts available

17
Experiments (Contd.)
18
Experiments (contd.)
  • Execution time with 3 kernel workload is less
    than that for 2!
  • Scheduling important!
  • intfltflt - int kernel has 100 of 1 thread,
    5050 between flt
  • and flt
  • fltfltint - flt kernel has 100 of 1 thread,
    5050 between int
  • and flt. Has higher execution time!

19
Project Goal
  • Model Hyper-Threading on a simulator. Vary key
    parameters and study first order effects

20
Simulator details
  • Execution driven, cycle accurate simulator based
    on SimpleScalar toolset
  • Extended the simulator to model SMT and
    Hyper-Threading
  • Resource sharing by tagging thread id (I, D)
  • Resource replication through multiple
    instantiation (PC, Map tables, Branch history,
    RAS)
  • Resource partitioning by having separate
    instances but imposing a global limit on entries
    ( Active list, Load/store buffers, IQs)
  • Stop simulation after completion of all threads

21
Simulator details
22
Simulator SMT/HT validation
23
Experiment Modeling L1 data cache interference
24
Experiment Modeling issue queue partitioning
25
Experiment Modeling total issue queue size with
partitioning
26
Experiment Varying Load/Store buffer sizes
(Pentium4 48 Load, 24 Store)
27
Experiment Comparison of fetch policies
28
References
  • 1 Prof. Eric Rotenberg, Course Notes, ECE 792E
    Advanced Microarchitecture, Fall 2002 NC State
    University.
  • 2 Deborah T. Marr et al. Hyper-Threading
    Technology Architecture and Microarchitecture,
    Intel Technology Journal 1st Qtr 2002 Vol 6 Issue
    1.
  • 3 Vimal Reddy, Ambarish Sule, Aravindh
    Anantaraman Hyperthreading on the Pentium 4,
    ECE792E Project, Fall 2002 http//www.tinker.ncsu.
    edu/ericro/ece721/student_projects/avananta.pdf
  • 4 D. M. Tullsen, et al. Exploiting Choice
    Instruction Fetch and Issue on an Implementable
    Simultaneous Multithreading Processor, 23rd
    Annual ISCA, pp. 191-202, May 1996.

29
Questions
Write a Comment
User Comments (0)
About PowerShow.com