MPhil/Master dissertation - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

MPhil/Master dissertation

Description:

HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based ... Each NIOS-II Avalon based tile is generated effortlessly through QuartusII SOPC ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 43
Provided by: cephi
Category:

less

Transcript and Presenter's Notes

Title: MPhil/Master dissertation


1
HW-SW Co-Design Framework for Parallel
Distributed Computing on NoC-based MPSoC
architectures
  • MPhil/Master dissertation
  • presented by
  • Jaume Joven Murillo
  • and supervised by
  • Dr. Jordi Carrabina Bordoll

2
Presentation outline
  1. Introduction
  2. Basic concepts state of the art in NoCs and
    MPSoCs
  3. Design framework and working methodology
  4. HW-SW NoC-based MPSoC implementation
  5. Experimental results
  6. Conclusions future work

3
1. Introduction
  • 1.1 - Introduction research project analysis
  • 1.2 - Objectives of the research project

4
Introduction
  • The continuous evolution of the technology
    (Moores law) causes that every IC is able to
    contain a large number (until 2020 according SIA
    roadmap)
  • Productivity gap
  • Adopted solutions
  • Component reuse (IP cores)
  • Soft-cores processors
  • HW-SW co-design
  • Novel design methodologies
  • Communication centric
  • Novel on-chip paradigms
  • Networks-on-Chips (NoCs)
  • System-level languages
  • SystemC, UML,
  • Develop complex ICs with billion of transistors
    in the near future
  • Multiprocessor-System-on-Chip (MPSoC) /
    Multi-cores / Chip-multiprocessors (CMP)
  • Sea of tiles (IP cores) interconnected by a
    Network-on-Chip

5
Objectives of the research project
  • Develop a HW-SW co-design framework for parallel
    distributed computing on-chip applying
    platform-based design concepts
  • Performs co-evolution strategy of two concurrent
    phases (HW-SW)
  • Hardware architecture
  • Scalable Distributed-Memory NoC-based MPSoC
    (NUMA)
  • Software framework
  • Software drivers
  • embedded Message Passing Interface (eMPI)
  • Run benchmarks test applications
  • Explore concurrency and parallelism in on-chip
    environments

6
2. Basic Concepts and state of the art in NoCs
and MPSoCs
  • 2.1 - On-chip communication schemes
  • 2.2 - Basic concepts on NoCs
  • 2.3 - NoC topologies
  • 2.4 - Switching modes routing schemes
  • 2.5 - Flow control micro-network stack
  • 2.6 - State of the art in NoCs/MPSoCs

7
On-chip communication architectures
  • Point-to-point
  • Fixed dedicated wires
  • Not flexible, Not shared
  • Null reusability
  • Bus-based interconnection (OCB)
  • Shared communication infrastructure
  • Multi-level, hierarchical or segmented buses
  • Bus becomes a bottleneck
  • On-chip network (NoC)
  • Distributed nature
  • Maximum flexibility scalability
  • Exploits reusability, parallel operations/transact
    ions
  • Regular geometry
  • Predictable layout and performance
  • Best testability verification time
  • Must guarantee a certain QoS

8
Basic concepts on NoCs
  • Tile
  • Computational nodes
  • Router/Switch
  • Communication nodes
  • Switching and routing strategy
  • Network adapter (NA, NI, NIC)
  • Decouple computation from communication
  • Adapts network tile clock domains (GALS)
  • Links
  • Dedicated P2P communication channels
  • Flow control protocol (Handshake or credit-based)
  • NoC-based systems
  • High degree of composition and traffic diversity
  • It is desired to have good floorplanning
    minimal buffer
  • Conventional/Traditional networks
  • Homogeneous and coarse grained

9
NoC topologies
  • Typical of multiprocessor systemsbut now on a
    chip
  • Regular
  • Predictable in terms of
  • Power consumption,
  • Performance (bandwidth, latency)
  • Area usage
  • Good floorplanning
  • Non-regular
  • Mixing regular topologies
  • Mesh-Torus, Ring-Mesh, Ring-hypercube
  • Direct
  • At least one tile attached to each node
  • Indirect
  • A subset of nodes are not connected to any core
  • Its selection is a trade-off between
  • Network complexity or on-chip area costs
  • Communication requirements or network performance

10
Switching modes routing schemes
  • Circuit switching
  • Involves the establishment releasing of a
    circuit between source and destination
  • Buffer-less switching scheme
  • Packet switching
  • Forwards the data to next hop
  • Buffering is mandatory
  • Different packet switching modes
  • Store-and-forward
  • Stall at two nodes and the link between them
  • Wormhole
  • Combines packet switching circuit switching
  • Reduce buffer size
  • Stall at all nodes and links spanned by the
    packet
  • Virtual cut-through
  • Next hop must store the whole packet
  • Stall at local node
  • Buffering
  • Buffer size ? width, depth
  • Location in the router
  • Shared or distributed
  • Affects the power consumption area usage

11
Switching modes routing schemes
  • Routing schemes
  • Deterministic
  • Path determined by its source destinations
    address
  • Easy to implement
  • Not optimal under congestion
  • Adaptive
  • Path decided on a per-hop basis
  • Complex in its implementation
  • Must be a deadlock/livelock free routing
  • Offers great benefits under congestion

12
Flow control Micro-Network stack
  • Flow control protocol (ensures the correct
    transport of packets)
  • Handshake
  • Request acknowledge signals (req, ack/nAck)
  • Simpler and cheaper than credit-based
  • Credit-based
  • All network components keep counters for the
    available buffer space
  • Data received ? counter-- Data sent ? counter
    if counter0 ? buffer full
  • Network ?stack layers
  • Transport
  • Network Adapter has to pack/unpack messages into
    network layer packets
  • Network
  • Where how a packet is transmitted
  • Data-link
  • Protocol to transmit a flit/phit
  • Physical
  • Number length of wires

13
State of the art in NoCs/MPSoCs
  • NoC is an emerging hot topic during last years
  • Research at all ?stack levels
  • System/Application Level
  • Design methodologies, co-exploration
  • Programming models OS support
  • Network Adapters
  • Network architecture
  • Link level
  • Research on MPSoC
  • HW-SW interfaces
  • Implantation of parallel programming models
  • Shared memory or message passing
  • ccNUMA MPSoC architecture using NIOS-II
  • MPSoC using segmented buses (HIBI)

14
3. Design framework and working methodology
  • 3.1 - HW-SW Co-design flow
  • 3.2 - Proposed NoC-based MPSoC architecture
  • 3.3 - Prototyping platform

15
HW-SW Co-design flow
  • System specification
  • Architecture exploration
  • ?P, VLIW, DSP
  • NoC routers, busses
  • NIC interfaces
  • Architecture designand HW-SW Co-design
  • RTL architecture
  • IP core integration (Soft-cores)
  • Software design
  • Benchmarks/Applications
  • embedded MPI (eMPI)
  • NIC driver
  • Integration and system-verification
  • SystemC
  • On-chip co-debugging
  • Functional prototype

Quartus II SOPC
Microsoft Visual Studio Eclipse IDE for NIOSII
ModelSim, GTKwave, Signal-Tap Synplify or
QuartusII
16
Proposed NoC-based MPSoC architecture
  • Distributed-memory NoC-based MPSoC components
  • NoC communication architecture
  • Soft/Hard IP core processors (Pi)
  • Distributed memory subsystem (Mi)
  • Network Interface Controller (NICi)
  • Driver for Network Interface Controller (NIC
    driver)
  • embedded Message Passing Interface (eMPI)

17
Proposed NoC-based MPSoC architecture
  • NoC topology
  • 2D-Mesh (regular, predictable)
  • XY Routing
  • Deterministic, minimal deadlock-free
  • Switching mode
  • Ephemeral Circuit switching
  • Store forward
  • Flow control
  • 4-phase handshake
  • Tile composition
  • NIOS-II Soft-core processor
  • On-chip RAM or SSRAM controller
  • NIC interface to NoC
  • Timer (IRQs, multi-threaded)
  • UART, JTAG, Performance Counter

18
Prototyping platform
  • Stratix EP1S25 DSP prototyping/development board
  • Altera FPGA Stratix EP1S25F780C5
  • Contains 25.660 LEs
  • Includes 1.944.576 bits of on-chip memory
  • 224 - M512 RAM blocks (32x18b)
  • 138 - M4K RAM blocks (128x36b)
  • 2 - M-RAM blocks
  • 6 PLLs
  • 597 maximum user I/O pins
  • Off-chip memory
  • 2 Mbytes of SSRAM configuredas two independent
    banks
  • 32 Mbits of flash memory
  • Other I/O
  • LEDs, RS232, buttons, switches, 7segments

19
4. HW-SW NoC-based MPSoC implementation
  • 4.1 - NoC-based MPSoC block diagram
  • 4.2 - Communication channel
  • 4.3 - Design of the Network Interface Controller
  • 4.4 - Router design
  • 4.5 - Software design
  • 4.6 - Applications and benchmarks

20
NoC-based MPSoC block diagram
  • Distributed-memory NoC-based MPSoC based on
    NIOS-II soft-core processor
  • Each NIOS-II Avalon based tile is generated
    effortlessly through QuartusIISOPC
  • Our custom HW design
  • Implementation of flow control in
    eachcommunication channel
  • Design of Network Interface Controller
  • Design of the router

21
Communication channel
  • Implements full-duplex 4-phase handshake protocol
  • Between NIC-Router or between routers
  • 4-phase is not ambiguous
  • Two independent and synchronous FSM have been
    designed
  • Packet definition
  • The definition of each subfield
  • XY address, message id, message length, sequence
    number, flags, priority
  • Size of each subfield
  • Fixes the router and NIC implementation
  • Our packet format for a 2D-Mesh

22
Design of the Network Interface Controller
  • NIC interface between tiles and routers of our
    NoC
  • Decoupling tiles computation from the NoCs
    communication infrastructure
  • Important piece to get good packet injection rate
    over the NoC
  • Build flits/packets
  • Bus peripheral (slave)
  • Polling or IRQs
  • Register Memory mappings
  • N1 bits of addressable bus space
  • Custom instruction (CI-based NIC)
  • Attached in the processor datapath
  • Is not master or slave

23
Router design
  • Circuit switching
  • Ephemeral circuit switching
  • Two latency cycles
  • One for XY routing
  • Another for PathSwitchMatrix

24
Router design
  • Packet switching
  • Store and forward
  • Full or shared/unified output queue
  • Now, the latency to traverse the router depends
    on
  • FIFO capacity (depth)
  • Output queue policies

25
Software design
  • HW-SW platform stack view of our
    distributed-memory NIOSII-based MPSOC with 2D
    Mesh interconnection strategy
  • Software components
  • NIC driver low-level communication API
  • eMPI high-level communication API for message
    passing
  • Optionally, between HdS (drivers) and
    high-level communication APIs an operation system
    (OS) might be included

26
Software design
  • The NIC software driver contains 3 basic
    functions
  • Interact transparently with a given NIC component
    exploiting all HW capabilities
  • volatile int NIC (int) (NIC_BASE)
  • volatile int NIC_TX (int) (NIC_BASE0x4)
  • Status register masks
  • 0x1 ? dataPending
  • 0x2 ? txBusy

/ NIC driver checking NIC status function / int nicStatus() return (NIC 0x0)
/ NIC driver blocking receive function / int nicRecv() while(!(nicStatus() 0x1)) return (NIC 0x1)
/ NIC driver blocking receive function / void nicSend(int data, int address) while(nicStatus() 0x2) (NIC_TX address) data
27
Software design
  • The eMPI software API will be our high-level
    design language
  • Implements message passing over our on-chip
    network
  • Steps to create our eMPI
  • Select a minimal working subset of standard MPI
    functions
  • MPI_Init(), MPI_Finalize(), MPI_Comm_size(),
    MPI_Comm_rank()
  • MPI_Send(), MPI_Recv()
  • Porting process from standard defacto MPI to our
    on-chip network
  • Lightweight memory overhead message passing
    interface (15-20KB)

28
Applications and benchmarks
  • The software framework let us to run parallel
    applications over the hardware architecture
  • All applications and benchmarks have been done by
    using NIC driver instead eMPI software API
  • COMMS1 COMMS2
  • Ping-pong benchmarks

29
Applications and benchmarks
  • Parallelization of Mandelbrot set
  • Iterative loop using complex numbers
  • Complex numbers are Cabi (a, b are C/C float
    or double)
  • Ideal to perform a message passing
    parallelization
  • Mandelbrot set eMPI function calls

30
5. Experimental results
  • 5.1 - Hardware costs area usage
  • 5.2 - Hardware costs area and power usage
  • 5.3 - Software framework requirements
  • 5.4 - On-chip network throughput and bandwidth
  • 5.5 - Application results
  • 5.6 - Comparative results

31
HW costs area usage
  • Router comparison between our Ephemeral Circuit
    Switching vs.our Packet Switching unified/shared
    queue
  • On a 2D-Mesh the number of ports are between 3-5
    ports
  • Ephemeral Circuit Switching is between 2.5-3.8
    times smaller than our Packet Switching
    unified/share output queue

32
HW costs area usage
  • Evolution of NxN 2D-Mesh NoC-based MPSoC
  • Ratio of HW resources
  • CS 20 comm. / 80 comp.
  • PS 45 comm. / 55 comp.
  • Ephemeral circuit switching is a low cost
    architecture
  • Area resources
  • On-chip memory requirements

Ephemeral Circuit Switching
Packet switching (Store and forward)
Logic elements (LEs)
Logic elements (LEs)
NxN 2D Mesh
NxN 2D Mesh
33
HW costs area and power usage
  • 2x2 Mesh NoC-based MPSoC with Ephemeral Circuit
    Switching
  • Not use any on-chip memory
  • Communication infrastructure (15) is extremely
    small compared to the computational components
    (85)
  • HW resources distribution
  • Running at 20MHz we can achieve around 60 DMIPS
  • Overall system metrics
  • 49,65mW/MHz
  • 3 DMIPS/MHz
  • Dynamic power usage
  • 993,31mW
  • Static 548,39 mW
  • Dynamic 442,92 mW
  • The NoC only affects 0.5

34
Software framework requirements
  • It is necessary a RAM memory for each processor
  • Distributed-memory architecture
  • At least 64KB of RAM per processor
  • To load the software framework
  • Application data and algorithm
  • On-chip FPGA memory resources
  • High throughput (few cycles to access)
  • Low capacity (KB)
  • External SSRAM available on the prototyping board
  • Low throughput (many cycles to access)
  • Large capacity (MB)
  • Trade-off between capacity and throughput

35
On-chip Network throughput bandwidth
  • 2x2 Mesh NoC-based MPSoC with Ephemeral Circuit
    Switching
  • Maximum channel bandwidth is about 168.84Mbps at
    63.24MHz
  • Bandwidth decrease according the number of hops
    (end-to-end flow control)

Metric in COMMS1 2x2 Mesh with Ephemeral Circuit Switching running at 20MHz (flit size 32 bits) 2x2 Mesh with Ephemeral Circuit Switching running at 20MHz (flit size 32 bits)
Metric in COMMS1 Bus slave NIC CI-based NIC
Injection Rate
Channel bandwidth 4Mbps 16,2Mbps
Aggregate Bandwidth 12Mbps 48,6Mbps
Bisection bandwidth 8Mbps 32,4Mbps
Maximum network capacity 48Mbps 192,2Mbps
36
Application results
  • Test of the parallelization of Mandelbrot set in
    several architectures
  • Sequential execution on Simple NIOS-II
    monoprocessor
  • Parallel execution on a Dual-core NIOS-II
    architecture
  • Parallel execution on a 2x2 Mesh NoC-based with
    Ephemeral circuit switching

Speedup 4x
Speedup 2x
37
Comparative results
ON-CHIP MULTIPROCESSOR ARCHITECTURE ON-CHIP MULTIPROCESSOR ARCHITECTURE ON-CHIP MULTIPROCESSOR ARCHITECTURE
Symmetric Multiprocessor Hung05 HIBI-based MPSoC Salminen05 Our NoC-based MPSoC
Architecture ccNUMA (Cache Coherency) Shared memory Hierarchical bus/Network scheme Both, shared memory and message passing Network-on-Chip Both, shared memory and message passing
Interconnection capabilities Avalon bus Cache Coherency Module Hierarchical bus OCP compliant, DMA, and FIFO capabilities HIBI wrappers Circuit switching and wormhole routing Mesh-based NoC Avalon NIC wrappers following VSIA Low cost Ephemeral circuit switching
Scalability Scalability limited Good scalability Highly scalable
Processor core NIOSII soft-core processor NIOSII soft-core processor NIOSII soft-core processor
Hardware results (LEs/On-chip memory usage) 4 CPUs 11.708 LEs/185.920 b 8 CPUs 24.302 LEs/371.840 b 4 CPUs 24.207 LEs/314.3 KB 8 CPUs 36.402 LEs/2.911 Kb 4 CPUs (2x2 Mesh) 7.528 LEs/None 9 CPUs (3x3 Mesh) 17.780 LEs/None 64 KB used for each processor
Software results Functional verification tests Parallel MPEG-4 Simple Profile encoder was tested Includes a software API called eMPI Parallel generation of Mandelbrot set by using eMPI/NIC driver
Other relevant results SMP 1-, 2-, 4- and 8-way Standard running frequency between 60-80 MHz Theoretical maximum bandwidth of HIBI prototype is 328 MB/s _at_ 82MHz Standard running frequency 78 MHz Around 1 Gbps of aggregated bandwidth for custom HW _at_ 63,24MHz or 100Mbps _at_ 20MHz from NIOSII with message passing API using CI-based NIC
38
6. Conclusions future work
  • 6.1 Conclusions
  • 6.2 Future work

39
Conclusions
  • I have proposed a complete HW-SW framework for
    distributed-memory NoC-based MPSoC architecture
  • eMPI is a viable solution to on-chip parallelism
    using message passing
  • The methodology have been formalized as a HW-SW
    co-design flow
  • Complete system level design tool chain
  • Validity tested on a physical platform (FPGA)
  • Methodology is also valid for ASIC development
  • This research work let us to perform effortlessly
    distributed parallel computing on a chip
  • Useful parallel on-chip platform for many
    high-performance computing and low power
    emerging applications
  • Multimedia applications
  • Smart cams
  • Software-defined radio
  • Lack of verification and support tools to create
    complex MPSoC

40
Future work
  • Long term
  • Extend this architecture to implement
    heterogeneous systems
  • Extend this architecture to an hybrid memory
    model(shared distributed memory system)
  • Large memory bank as a tile
  • Cache coherence
  • Mechanism to access the shared medium
  • Should be useful to get a complete SystemC
    simulation model
  • Evolution of Ephemeral Circuit Switching
    architecture
  • Build a wormhole packet switching
  • Include a NIC queue in our Ephemeral circuit
    switching architecture
  • Change the fixed PriorityEncoder within
    PathSwitchMatrix
  • Test our architecture with bus-slave NIC with
    IRQs

41
Future work
  • Evolution of software framework
  • Improve the NIC software driver functions
  • Extend the eMPI SW API with other useful message
    passing collective communication functions
  • broadcast, scatter, gather, scan, reduce,
    allreduce, alltoall, reducescatter, barrier
    synchronization,
  • Application-level
  • Take real application
  • Coarse grain or fine grain parallelism
  • Run GALS scheme with multiple clock domains

42
The endThank you !
43
HW-SW design using UML and SystemC
  • Useful to verify complex MPSoC systems
  • SystemC let us to model all HW-SW component
  • UML let us to generate automatically the SystemC
    code
  • Easy example MeshXYRouting
  • Comparator
Write a Comment
User Comments (0)
About PowerShow.com