Silicon Operating System for Large Scale Heterogeneous Cores and its FPGA Implementation - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Silicon Operating System for Large Scale Heterogeneous Cores and its FPGA Implementation

Description:

Silicon Operating System for Large Scale Heterogeneous Cores and its FPGA Implementation Huang, Xiang , Department of Electrical Engineering – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 23
Provided by: Jou94
Category:

less

Transcript and Presenter's Notes

Title: Silicon Operating System for Large Scale Heterogeneous Cores and its FPGA Implementation


1
Silicon Operating System for Large Scale
Heterogeneous Cores and its FPGA Implementation
  • ?? Huang, Xiang
  • ???, Department of Electrical Engineering
  • ??????, National Cheng Kung University
  • Tainan, Taiwan, R.O.C
  • (06)2757575 ?62400 ?2825, Office ???, 6F,95602
  • Email hxhxxhxxx_at_gmail.com
  • Website address http//j92a21b.ee.ncku.edu.tw/br
    oad/index.html

2
Abstract (1/3)
  • Grand challenge applications have a strong hunger
    for high performance supercomputing clusters to
    satisfy their requirements.
  • Competent node architecture in supercomputing
    clusters is critical to quench the requirements
    of the varied computationally demanding
    applications.
  • This accentuates the need for heterogeneous
    multicore node architectures in supercomputing
    clusters, thus paving way for the novel concept
    of execution of Simultaneous Multiple Application
    (SMAPP) non space-time sharing.

3
Abstract (2/3)
  • OS is the other side of the coin in attaining
    exa-flop performance in supercomputing clusters.
  • Conventional OSs being software driven, their
    performance becomes a bottleneck since it
    involves the complexities associated with
    parallel mapping and scheduling of different
    applications across the underlying nodes.
  • In this context it is suitable if the kernel of
    the OS is made completely hardware based.
  • Further the simultaneous multiple application
    execution with non-space time sharing calls for a
    parallel and hierarchy based multi-host system.
  • Hence the hardware design for an OS for
    supercomputing clusters designed to meet these
    demands known as Silicon Operating System
    SILICOS was evolved at Waran Research
    Foundation WARFT.

4
Abstract (3/3)
  • This thesis analyses the architecture and design
    of SILICOS at greater depths.
  • The SILICOS architecture is integrated with the
    Warft India Many Core WIMAC simulator a clock
    driven, cycle accurate simulator.

5
1. Origin and History (1/10)
  • The execution of Simultaneous Multiple
    Application (SMAPP) non space-time sharing will
    be a major step forward towards attaining
    exa-flop computing.
  • Some positives of SMAPP are
  • Enhanced resource utilization due to a large
    scale increase in the execution of independent
    instructions as the number of applications and
    their problem size increases.
  • Cost effectiveness across multiple applications
    being run in a single cluster.
  • Eliminates conventional space sharing and time
    sharing leading to increased performance.
  • SMAPP is supported by virtue of the Heterogeneous
    Multi Core, node Architectures based on CUBEMACH
    (CUstom Built hEterogeneous Multi-Core
    ArCHitectures) design paradigm2.
  • CUBEMACH design paradigm achieves low power yet
    high performance.

6
1. Origin and History (2/10)
  • Figure 1 Concept of SMAPP

7
1. CUBEMACH (3/10)
  • The CUBEMACH design paradigm is aimed towards
    creation of high performance, low power and cost
    effective heterogeneous multicore architectures
    capable of executing wide range of applications
    without space or time sharing 1.
  • The use of Hardwired Algorithm Level Functional
    Units (ALFU) 3 and its corresponding Backbone
    Instruction Set Architecture also called
    Algorithm Level Instruction Set Architecture
    (ALISA), brings about increased performance due
    to much reduced number of instruction generation
    and hence memory fetches.

8
1. Algorithm Level Functional Unit (4/10)
  • Why ALFU and Why not ALU?
  • ALFUs handle higher order computations by
    processing blocks of data in a single operation
    when compared to using a set of ALUs to execute
    the same computations.
  • 1 ALFU instructionsSeveral ALU instructions
  • ALFU based cores are proven to offer better
    performance at reduced power compared to ALU
    based cores 3.
  • ALISA is a superset of other instruction sets
    such as vector instructions, CISC and VLIW which
    are used in various multi-core/many core
    processors.
  • A single ALISA instruction encompasses the data
    dependencies associated with several equivalent
    ALU instructions and helps in minimizing the
    number of cache misses.
  • Parallel issue of ALISA instructions pose a major
    challenge to the compilers and cannot be handled
    by a purely software based compiler hence we have
    resolved to a hardware based compiler.

9
1. Customizable Compiler On Silicon (5/10)
  • The Compiler-On-Silicon4 is an easily
    customizable hardware based compiler to suit
    different CUBEMACH architecture for different
    classes of applications.
  • Compiler-On-Silicon is made up of a two stage
    hierarchy.
  • The Primary Compiler On Silicon.
  • The Secondary Compiler On Silicon.
  • The hardware based dependency analyzer in COS, is
    the key to increase the rate of instruction
    generation.

10
1. Customizable Compiler On Silicon (6/10)
  • Figure 3 Hierarchical Architecture of Compiler
    on Silicon

11
1. On Core Network Architecture (7/10)
  • CUBEMACH architecture uses a novel cost effective
    On Chip Network called the On Core Network (OCN).
  • The hierarchy of OCN is emphasized by the
    presence of a Sub-Local Router for a group of
    ALFUs (population), a Local router for across
    population communication.
  • While populations of ALFUs form a core, global
    routers are used to establish communication
    across them.

12
1. Customizable Compiler On Silicon (8/10)
  • Figure 4 Hierarchical OCN Architecture of single
    core

13
1. Silicon Operating System (9/10)
  • OS of current day supercomputers is managed by a
    stripped kernel present in the nodes.
  • Core OS functionalities such as process
    scheduling, memory management, I/O handling and
    exception handling are monitored by this stripped
    kernel.
  • This level of operation at the cluster does
    suffice for parallel execution of applications.
  • But in case of SMAPP non space-time sharing the
    communication complexity involved is huge, hence
    needs to be monitored by efficient mapping
    strategies.
  • The hardware design of the OS for supercomputing
    clusters designed to meet these demands known as
    Silicon Operating System (SILICOS) was evolved at
    WARFT 1.

14
2. Overview of Linux Kernels (1/8)
  • The Linux Kernel abstracts and mediates access to
    all hardware resources including the CPU.
  • One important aspect of Linux kernel is support
    to multitasking.
  • Each process can act individually in the system
    with exclusive memory access and other hardware
    usage.
  • The kernel is responsible for providing this
    facility by running each process concurrently,
    providing an equal share to hardware resources
    for each process and also maintaining the
    inter-process security.

15
2. Overview of Linux Kernels (2/8)
  • The Linux kernel as defined by Iwan T.Bowman 8
    is composed of five main subsystems
  • the Process Scheduler (SCHED)
  • the Memory Manager (MM)
  • the Virtual File System (VFS)
  • the Network Interface (NET)
  • the Inter-Process Communication (IPC) subsystem
  • This thesis analyses the architecture and design
    of SILICOS at greater depths.

16
2. Loop Unroller and Dependency Analyser (3/8)
  • Dependencies across application libraries play a
    major role in allocation of libraries to the
    underlying nodes.
  • The information on dependent libraries needs to
    be passed onto the process scheduler for
    efficient scheduling thus extracting maximum work
    from the underlying nodes.
  • In addition, in case of complex applications may
    be loops across the dependent libraries hence the
    loops needs to be unrolled in order to
    effectively identify the execution time of each
    iteration and to schedule those libraries.
  • In this regard, a dependency analyzer, to extract
    the dependency and execution time for each of the
    dependent libraries and also to perform loop
    unrolling is needed.

17
2. Loop Unroller and Dependency Analyser (4/8)
  • The graph traversal unit is used to traverse
    across the dependency graph and extract the
    dependent libraries.
  • This information from the unit is updated in the
    library detail table.
  • The loop unroller unit forms an integral part of
    the dependency analyzer.
  • It unrolls loops by replicating the libraries
    using the loop index value.
  • Thus this unit greatly assists the dependency
    analyzer in time stamp generation.
  • After extracting the dependencies across the
    libraries the information is used to generate
    time stamp of child libraries.

18
2. ISA of the Dependency Analyzer (5/8)
Figure 12 Overall Architecture of Dependency
Analyzer
19
2. Design of Hardware Based Programmable
Scheduler for SMAPP (6/8)
  • The existing scheduler is not programmable hence
    cannot facilitate any new scheduling heuristics
    to be programmed into it.
  • Hence the scheduler needs to be made adaptive in
    such a way that the user himself can choose the
    scheduling heuristics.
  • The optimization techniques which we have adopted
    for our scheduler are
  • Game Theory-Simulated Annealing based Scheduling
  • Ant Colony Optimization based Scheduling

20
2. Design of Hardware Based Programmable
Scheduler for SMAPP (7/8)
  • Game Theory-Simulated Annealing based Scheduling
  • The communication and computation complexity of
    the nodes are considered as cost function in the
    GT-SA based approach.
  • By varying the parameters of the cluster system,
    an optimal cost function of the nodes in
    secondary host is achieved.
  • This scheduler unit schedules libraries to
    underlying nodes by maintaining the computation
    and communication complexity (cost functions) of
    the node plane in its optimized state.
  • The GT-SA based scheduler unit compares the
    current state of the cost function with a next
    state obtained by varying the system parameters
    load available, queue length of buffers.
  • The unit also accepts poor next states based on
    probability equation in order to not to get stuck
    in a local minima.

21
2. Design of Hardware Based Programmable
Scheduler for SMAPP (8/8)
  • Ant Colony Optimization based Scheduling
  • The behavior of ants for shortest path finding to
    the food using pheromone extraction has been
    adopted in this scheduling algorithm 21.
  • Here, the application libraries to be mapped onto
    a distant node need to traverse through the
    shortest path across the nodes to reach the
    destination node.
  • In order to choose the path to reach a
    destination node, to broadcast the host can make
    use of this scheduling unit.
  • Information about distances in between nodes and
    shortest path are constantly updated by this unit
    hence can be utilized to reduce the communication
    complexity across the nodes.
  • Thus based on the network topology and also the
    traffic in the network, optimized path to reach a
    particular node are identified.

22
3. Xilinx Virtex FPGA Family
  • The Xilinx Virtex family FPGAs are being utilized
    in prototyping the SILICOS architecture.
  • The Xilinx Virtex 7 FPGA kit being the latest in
    the Virtex family consists of 2,000,000 logic
    cells. It provides a 68Mb ram space.
  • It consists of a 3600 DSP slices thus providing
    higher bandwidth and aid for programming parallel
    processing logics into the FPGA kit.
Write a Comment
User Comments (0)
About PowerShow.com