RAMPRED, Whats Next - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

RAMPRED, Whats Next

Description:

It should be USEFUL, not just fast. Handy parallel Debugger. Deterministic Replay ... Jared Casper (Switch RTL Design) Jiwon Seo (Python on ATLAS) Undergrads ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 23
Provided by: MarkHo85
Category:
Tags: rampred | jared | just | next | whats

less

Transcript and Presenter's Notes

Title: RAMPRED, Whats Next


1
RAMP-RED, Whats Next?
  • Sewook Wee
  • (P.I. Christos Kozyrakis)
  • Computer Systems Laboratory
  • Stanford University

2
What is RAMP-RED? (Review)
  • a.k.a. ATLAS FPGA implementation of TCC
    architecture
  • The TCC project (http//tcc.stanford.edu)
  • Goal make parallel programming practical
  • 10 of the effort ? 90 of the performance
  • Key abstraction transactional memory
  • Optimistic execution of atomic isolated code
  • Research on architecture, PL, OS, and application
    issues
  • ATLAS a fast development platform for software
    research
  • Used with both user-level and system software
  • 100x speedup over simulation
  • Rich support for profiling and application
    analysis
  • Accurate performance estimates for software
    tuning
  • A tool to share with our software/application
    colleagues

3
Architecture Model
  • 8-way CMP with TM support in caches/coherence
    protocol
  • See paper PACT05 for details
  • Uniform memory access through star-like
    interconnect
  • Separate processor for the OS

4
ATLAS on BEE2
  • Clock Frequency
  • Control user FPGA _at_ 100MHz
  • FPGA-FPGA links _at_ 100MHz

User FPGA 0
User FPGA 1
Control FPGA 1
Ethernet
DRAM
User FPGA 3
User FPGA 2
BEE2
5
ATLAS Software architecture
Violation
Linux PPC
Commit
6
Evaluation
  • Compared Atlas to our Tassel TCC simulator
  • Configured similarly
  • Tassel overview
  • Execution-driven simulator
  • Runs user-level code only
  • Assumes CPI1 for non load/store instructions
  • Models the details of the memory hierarchy
  • Tassel supports fast-forwarding
  • Functional-only simulation of portions of the
    code
  • Fast-forwarding is controlled by programmer
  • Skip initialization or repeated execution
  • Must be careful to avoid skipping the important
    part

7
Accuracy Estimated Speedup
8
Accuracy Execution Time Breakdown
9
Speed Wall-clock Time Improvement
10
Wall-clock Time Scaling Trends
11
Whats Next?
  • ATLAS is the FAST development platform for
    software research.
  • ATLAS should be USEFUL, not only be fast.
  • More Debugging Features
  • Easier Performance Tuning guide
  • User Evaluation Study

12
Parallel Debugger
  • To achieve correctness easily, handy debugger is
    crucial.
  • printf is simply not an enough option.
  • printf may raise syscall exception, which is
    irrevocable.
  • Bug from interactions among transactions will not
    be caught.
  • Good news from TM
  • In TM land, each transaction runs logically
    isolated.
  • Other nodess execution is not important.
  • Only commit from other node matters.
  • Many policies and related issues.
  • Hardware breakpoint vs. Software breakpoint
  • Local Breakpoint vs. Global Breakpoint
  • Physical channel to connect debugger
  • Behavior of other nodes at ones break

13
Initial Status
  • XMD through JTAG chain
  • Anyway we need a host PC for cross-compiling,
    stat post-processing, and so on.
  • Forwarding Debug Exception through Linux PPC and
    connect remote GDB to Linux PPC seems to
    complicated.
  • Hardware Breakpoint
  • Software breakpoint requires self-modifying code.
  • Poison instruction cannot be propagated to I-Side
    before official commit.
  • Local Breakpoint
  • Hardware breakpoint goes well with local BP.
  • Other processors may continue so far.
  • XMD does not provide a hook to write breakpoint
    handler.
  • TCC PPC has no channel to interrupt other
    processors.

14
Way to Go
  • Software BP
  • Back door to propagate poison instruction without
    official commit.
  • Open nesting is an option. (ATLAS doesnt
    support, yet)
  • Bypass write may work.
  • Software BP goes well with Global BP.
  • Need a way to interrupt all other nodes.
  • What should others do when one meet BP?
  • All-stop? Ignore?
  • Execute, but not commit? (Token arbiter helps.)
  • XMD BP handler hook is needed.
  • Enables to declare more complicated BP.
  • XMD BP handler hook is needed.
  • Implements Local BP using Global BP.

15
Deterministic Parallel Execution Replay
A -1
A -1
A -1
Bug Patch Simply Add Delays A (A1) Divide by
A
A (A 0) Divide by A
A (A 0)
A (A 0)
A (A 1) Divide by A
A (A 1)
BUG!
Hidden
Well-hidden
  • Many of bugs are hidden in parallel execution
    sequence.
  • It does not happens all the time.
  • Fix Evaluation is not straight-forward.
  • How would you guarantee that the bug is fixed,
    not hidden?

16
Transactional Memory can Help!
  • Tons of papers are published in ISCA, HPCA,
    ASPLOS.
  • How to minimize footprint size.
  • RTR (ASPLOS 2006) 1 Byte per 1K instructions
  • Transactional Memory application only needs to
    record the commit order of transactions.
  • By the nature of TM, officially committed run of
    each transaction should see the same memory copy
    in each execution.
  • Step Forward! By tweaking commit order log, you
    can make ATLAS run following your commit order
    scenario. ? In theory, you can cover all
    possible corner cases.

17
Performance Monitor
  • Programmer wants to see how well the app is
    parallelized.
  • Total execution time and breakdown is simply not
    enough. ? Dynamic resource utilization log is
    needed.
  • By logging utilization factor and PC per
    transaction, this chart can be implemented.

radix.c55
Proc 0
Proc 1
Proc 2
Proc 3
Sequential
Well-parallelized
Sequential
Poorly-parallelized
18
TM-oriented Performance Tuning
  • Inspired by TAPE (ISCA 2005)
  • Two important events that hurt performanceViolat
    ion and Overflow
  • Runtime HW/SW monitor keeps tracking most
    expensive ones.
  • Violation
  • Commit node, Violated node, Cost, PC, Data Object
  • Overflow
  • node ID, PC, Cost
  • LRU overflow? or Write-set Buffer overflow?
  • of occurrence

19
Observer
  • Logging overhead can affect dynamic execution
    behavior.
  • In ATLAS, logging to file may incur syscall
    exception.
  • Logging may consume too much memory (or bus)
    bandwidth.
  • Logging information from hardware module need
    simple interface.
  • We want very simple and standard way to log
    runtime information from hardware and software.

LOG Buffer (BRAM)
Light-weight Processor (MicroBlaze)
File System (NFS)
XAUI Channel VirtualEthernet
OPB
20
User Evaluation Study
  • With all these features, we want to perform User
    Evaluation Study of ATLAS system.
  • Parallel Programming Class Project
  • FCRC Tutorial Session
  • Evaluation Items
  • Is it true that TCC makes programming easier?
  • Is ATLAS better platform than software simulator?
  • Are debugging performance tuning
    infrastructures good enough? What do we need
    more?

21
Conclusion
  • RAMP-RED delivers FAST development platform for
    software research.
  • It should be USEFUL, not just fast.
  • Handy parallel Debugger
  • Deterministic Replay
  • Visualized Performance Monitor
  • TM-oriented Performance Tuning
  • We expect many of you participate User Evaluation
    Study.
  • 1000s-core CMP is question of tomorrow, Debugging
    and Performance Tuning is for TODAY!

22
People
  • Grad Students
  • Sewook Wee (API, Apps, System Software, BEE2
    board integration)
  • Nju Njoroge (TCC RTL design)
  • Jared Casper (Switch RTL Design)
  • Jiwon Seo (Python on ATLAS)
  • Undergrads
  • Daxia Ge
  • Lewis Mbae
  • Faculty
  • Christos Kozyrakis
  • Kunle Olukotun
Write a Comment
User Comments (0)
About PowerShow.com