NetThreads: Programming NetFPGA with Threaded Software - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

NetThreads: Programming NetFPGA with Threaded Software

Description:

no free PLL: processors run at the speed of the Ethernet MACs, 125MHz. Platform: ... Difficult to pipeline the code into balanced stages ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 37
Provided by: Mart700
Category:

less

Transcript and Presenter's Notes

Title: NetThreads: Programming NetFPGA with Threaded Software


1
NetThreads Programming NetFPGA with Threaded
Software
Geoff Salmon Monia Ghobadi Yashar Ganjali
Martin Labrecque Gregory Steffan
ECE Dept.
CS Dept.
University of Toronto
2
Real-Life Customers
  • Hardware
  • NetFPGA board, 4 GigE ports, Virtex 2 Pro FPGA
  • Collaboration with CS researchers
  • Interested in performing network experiments
  • Not in coding Verilog
  • Want to use GigE link at maximum capacity
  • Requirements
  • Easy to program system
  • Efficient system

What would the ideal solution look like?
3
Envisioned System (Someday)
data-level parallelism
  • Many Compute Engines
  • Delivers the expected performance
  • Hardware handles communication and synchronizaton

Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
Hardware Accelerator
control-flow parallelism
Processors inside an FPGA?
4
Soft Processors in FPGAs
  • Soft processors processors in the FPGA fabric
  • FPGAs increasingly implement SoCs with CPUs
  • Commercial soft processors NIOS-II and Microblaze

What is the performance requirement?
5
Performance In Packet Processing
  • The application defines the throughput required

Edge routing ( 1 Gbps/link)
Home networking (100 Mbps/link)
Scientific instruments (lt 100 Mbps/link)
  • Our measure of throughput
  • Bisection search of the minimum packet
    inter-arrival
  • Must not drop any packet

Are soft processors fast enough?
6
Realistic Goals
  • 109 bps stream with normal inter-frame gap of 12
    bytes
  • 2 processors running at 125 MHz
  • Cycle budget
  • 152 cycles for minimally-sized 64B packets
  • 3060 cycles for maximally-sized 1518B packets

Soft processors non-trivial processing at line
rate!
How can they efficiently be organized?
7
Key Design Features
8
Efficient Network Processing
3
Multithreaded soft processor
9
Multiprocessor System Diagram
Synch. Unit
Instr.
Data
Input mem.
Output mem.
Input Buffer
Data Cache
Output Buffer
packet output
packet input
Off-chip DDR
- Overcomes the 2-port limitation of block RAMs -
Shared data cache is not the main bottleneck in
our experiments
10
Performance of Single-Threaded Processors
  • Single-issue, in order pipeline
  • Should commit 1 instruction every cycle, but
  • stall on instruction dependences
  • stall on memory, I/O, accelerators accesses
  • Throughput depends on sequential execution
  • packet processing
  • device control
  • event monitoring

many concurrent threads
Solution to Avoid Stalls Multithreading
11
Avoiding Processor Stall Cycles
F
F
F
Data or control hazard
D
D
D
Single-Thread Traditional execution
E
E
E
5 stages
BEFORE
M
M
M
W
W
W
Time
  • 4 threads eliminate hazards in 5-stage pipeline
  • 5-stage pipeline is 77 more area efficient
    FPL07

12
Multithreading Evaluation
13
Infrastructure
  • Compilation
  • modified versions of GCC 4.0.2 and Binutils 2.16
    for the MIPS-I ISA
  • Timing
  • no free PLL processors run at the speed of the
    Ethernet MACs, 125MHz
  • Platform
  • 2 processors, 4 MAC 1 DMA ports, 64 Mbytes 200
    MHz DDR2 SDRAM
  • Virtex II Pro 50 (speed grade 7ns)
  • 16KB private instruction caches and shared data
    write-back cache
  • Capacity would be increased on a more modern FPGA
  • Validation
  • Reference trace from MIPS simulator
  • Modelsim and online instruction trace collection

- PC server can send 0.7 Gbps maximally size
packets - Simple packet echo application can keep
up - Complex applications are the bottleneck, not
the architecture
14
Our benchmarks
Realistic non-trivial applications, dominated by
control flow
15
What is limiting performance?
Packet Backlog due to Synchronization Serializing
Tasks
Lets focus on the underlying problem
Synchronization
16
Addressing Synchronization Overhead
17
Real Threads Synchronize
  • All threads execute the same code
  • Concurrent threads may access shared data
  • Critical sections ensure correctness

Thread1 Thread2 Thread3 Thread4
Lock() shared_var f() Unlock()
Impact on round-robin scheduled threads?
18
Multithreaded processor with Synchronization
F
D
Release lock
E
5 stages
M
Acquire lock
W
Time
19
Synchronization Wrecks Round-Robin Multithreading
F
D
Release lock
E
5 stages
M
Acquire lock
W
Time
With round-robin thread scheduling and contention
on locks lt 4 threads execute concurrently gt 18
cycles are wasted while blocked on synchronization
20
Better Handling of Synchronization
F
F
F
F
F
F
D
D
D
D
D
D
E
E
E
E
E
E
BEFORE
5 stages
M
M
M
M
M
M
W
W
W
W
W
W
Time
21
Thread scheduler
  • Suspend any thread waiting for a lock
  • Round-robin among the remaining threads
  • Unlock operation resumes threads across processors

- Multithreaded processor hides hazards across
active threads - Fewer than N threads requires
hazard detection
But, hazard detection was on critical path of
single threaded processor
Is there a low cost solution?
22
Static Hazard Detection
  • Hazards can be determined at compile time

- Hazard distances are encoded as part of the
instructions
Static hazard detection allows scheduling without
an extra pipeline stage Very low area overhead
(5), no frequency penalty
23
Thread Scheduler Evaluation
24
Results on 3 benchmark applications
- Thread scheduling improves throughput by 63,
31, and 41 - Why isnt the 2nd processor always
improving throughput?
25
Cycle Breakdown in Simulation
Classifier
NAT
UDHCP
- Removed cycles stalled waiting for a lock -
What is the bottleneck?
26
Impact of Allowing Packet Drops
- System still under-utilized - Throughput still
dominated by serialization
27
Future Work
  • Adding custom hardware accelerators
  • Same interconnect as processors
  • Same synchronization interface
  • Evaluate speculative threading
  • Alleviate need for fine grained-synchronization
  • Reduce conservative synchronization overhead

28
Conclusions
  • Efficient multithreaded design
  • Parallel threads hide stalls on one thread
  • Thread scheduler mitigates synchronization costs
  • System Features
  • System is easy to program in C
  • Performance from parallelism is easy to get

On the lookout for relevant applications suitable
for benchmarking NetThreads available with
compiler at http//netfpga.org/netfpgawiki/index.
php/ProjectsNetThreads
29
Geoff Salmon Monia Ghobadi Yashar Ganjali
Martin Labrecque Gregory Steffan
ECE Dept.
CS Dept.
University of Toronto
NetThreads available with compiler
at http//netfpga.org/netfpgawiki/index.php/Proje
ctsNetThreads
30
Backup
31
Software Network Processing
  • Not meant for
  • Straightforward tasks accomplished at line speed
    in hardware
  • E.g. basic switching and routing
  • Advantages compared to Hardware
  • Complex applications are best described in a
    high-level software
  • Easier to design and fast time-to-market
  • Can interface with custom accelerators,
    controllers
  • Can be easily updated
  • Our focus stateful applications
  • Data structures modified by most packets
  • Difficult to pipeline the code into balanced
    stages
  • Run-to-Completion/Pool-of-Threads model for
    parallelism
  • Each thread processes a packet from beginning to
    end
  • No thread-specific behavior

32
Impact of allowing packet drops
t
NAT benchmark
33
Cycle Breakdown in Simulation
Classifier
NAT
UDHCP
- Removed cycles stalled waiting for a lock -
Throughput still dominated by serialization
34
More Sophisticated Thread Scheduling
  • Add pipeline stage to pick hazard-free
    instruction
  • Result
  • Increased instruction latency
  • Increased hazard window
  • Increased branch mis-prediction cost

MUX
Add hazard detection without an extra pipeline
stage?
35
Implementation
  • Where to store the hazard distance bits?
  • Block RAMs are multiple of 9 bits wide
  • 36 bits word leaves 4 bits available
  • Also encode lock and unlock flags

32 Bits
4 Bits
How to convert instructions from 36 bits to 32
bits?
36
Instruction Compaction 36 ? 32 bits
R-Type Instructions
Example add rd, rs, rt
J-Type Instructions
Example j label
I-Type Instructions
Example addi rt, rs, immediate
- De-compaction 2 block RAMs some logic
between DDR and cache - Not a critical path of
the pipeline
Write a Comment
User Comments (0)
About PowerShow.com