Title: Silicon Operating System for Large Scale Heterogeneous Cores and its FPGA Implementation
1Silicon Operating System for Large Scale
Heterogeneous Cores and its FPGA Implementation
- ?? Huang, Xiang
- ???, Department of Electrical Engineering
- ??????, National Cheng Kung University
- Tainan, Taiwan, R.O.C
- (06)2757575 ?62400 ?2825, Office ???, 6F,95602
- Email hxhxxhxxx_at_gmail.com
- Website address http//j92a21b.ee.ncku.edu.tw/br
oad/index.html
2Abstract (1/3)
- Grand challenge applications have a strong hunger
for high performance supercomputing clusters to
satisfy their requirements. - Competent node architecture in supercomputing
clusters is critical to quench the requirements
of the varied computationally demanding
applications. - This accentuates the need for heterogeneous
multicore node architectures in supercomputing
clusters, thus paving way for the novel concept
of execution of Simultaneous Multiple Application
(SMAPP) non space-time sharing.
3Abstract (2/3)
- OS is the other side of the coin in attaining
exa-flop performance in supercomputing clusters. - Conventional OSs being software driven, their
performance becomes a bottleneck since it
involves the complexities associated with
parallel mapping and scheduling of different
applications across the underlying nodes. - In this context it is suitable if the kernel of
the OS is made completely hardware based. - Further the simultaneous multiple application
execution with non-space time sharing calls for a
parallel and hierarchy based multi-host system. - Hence the hardware design for an OS for
supercomputing clusters designed to meet these
demands known as Silicon Operating System
SILICOS was evolved at Waran Research
Foundation WARFT.
4Abstract (3/3)
- This thesis analyses the architecture and design
of SILICOS at greater depths. - The SILICOS architecture is integrated with the
Warft India Many Core WIMAC simulator a clock
driven, cycle accurate simulator.
51. Origin and History (1/10)
- The execution of Simultaneous Multiple
Application (SMAPP) non space-time sharing will
be a major step forward towards attaining
exa-flop computing. - Some positives of SMAPP are
- Enhanced resource utilization due to a large
scale increase in the execution of independent
instructions as the number of applications and
their problem size increases. - Cost effectiveness across multiple applications
being run in a single cluster. - Eliminates conventional space sharing and time
sharing leading to increased performance. - SMAPP is supported by virtue of the Heterogeneous
Multi Core, node Architectures based on CUBEMACH
(CUstom Built hEterogeneous Multi-Core
ArCHitectures) design paradigm2. - CUBEMACH design paradigm achieves low power yet
high performance.
61. Origin and History (2/10)
- Figure 1 Concept of SMAPP
71. CUBEMACH (3/10)
- The CUBEMACH design paradigm is aimed towards
creation of high performance, low power and cost
effective heterogeneous multicore architectures
capable of executing wide range of applications
without space or time sharing 1. - The use of Hardwired Algorithm Level Functional
Units (ALFU) 3 and its corresponding Backbone
Instruction Set Architecture also called
Algorithm Level Instruction Set Architecture
(ALISA), brings about increased performance due
to much reduced number of instruction generation
and hence memory fetches.
81. Algorithm Level Functional Unit (4/10)
- Why ALFU and Why not ALU?
- ALFUs handle higher order computations by
processing blocks of data in a single operation
when compared to using a set of ALUs to execute
the same computations. - 1 ALFU instructionsSeveral ALU instructions
- ALFU based cores are proven to offer better
performance at reduced power compared to ALU
based cores 3. - ALISA is a superset of other instruction sets
such as vector instructions, CISC and VLIW which
are used in various multi-core/many core
processors. - A single ALISA instruction encompasses the data
dependencies associated with several equivalent
ALU instructions and helps in minimizing the
number of cache misses. - Parallel issue of ALISA instructions pose a major
challenge to the compilers and cannot be handled
by a purely software based compiler hence we have
resolved to a hardware based compiler.
91. Customizable Compiler On Silicon (5/10)
- The Compiler-On-Silicon4 is an easily
customizable hardware based compiler to suit
different CUBEMACH architecture for different
classes of applications. - Compiler-On-Silicon is made up of a two stage
hierarchy. - The Primary Compiler On Silicon.
- The Secondary Compiler On Silicon.
- The hardware based dependency analyzer in COS, is
the key to increase the rate of instruction
generation.
101. Customizable Compiler On Silicon (6/10)
- Figure 3 Hierarchical Architecture of Compiler
on Silicon
111. On Core Network Architecture (7/10)
- CUBEMACH architecture uses a novel cost effective
On Chip Network called the On Core Network (OCN). - The hierarchy of OCN is emphasized by the
presence of a Sub-Local Router for a group of
ALFUs (population), a Local router for across
population communication. - While populations of ALFUs form a core, global
routers are used to establish communication
across them.
121. Customizable Compiler On Silicon (8/10)
- Figure 4 Hierarchical OCN Architecture of single
core
131. Silicon Operating System (9/10)
- OS of current day supercomputers is managed by a
stripped kernel present in the nodes. - Core OS functionalities such as process
scheduling, memory management, I/O handling and
exception handling are monitored by this stripped
kernel. - This level of operation at the cluster does
suffice for parallel execution of applications. - But in case of SMAPP non space-time sharing the
communication complexity involved is huge, hence
needs to be monitored by efficient mapping
strategies. - The hardware design of the OS for supercomputing
clusters designed to meet these demands known as
Silicon Operating System (SILICOS) was evolved at
WARFT 1.
142. Overview of Linux Kernels (1/8)
- The Linux Kernel abstracts and mediates access to
all hardware resources including the CPU. - One important aspect of Linux kernel is support
to multitasking. - Each process can act individually in the system
with exclusive memory access and other hardware
usage. - The kernel is responsible for providing this
facility by running each process concurrently,
providing an equal share to hardware resources
for each process and also maintaining the
inter-process security.
152. Overview of Linux Kernels (2/8)
- The Linux kernel as defined by Iwan T.Bowman 8
is composed of five main subsystems - the Process Scheduler (SCHED)
- the Memory Manager (MM)
- the Virtual File System (VFS)
- the Network Interface (NET)
- the Inter-Process Communication (IPC) subsystem
- This thesis analyses the architecture and design
of SILICOS at greater depths.
162. Loop Unroller and Dependency Analyser (3/8)
- Dependencies across application libraries play a
major role in allocation of libraries to the
underlying nodes. - The information on dependent libraries needs to
be passed onto the process scheduler for
efficient scheduling thus extracting maximum work
from the underlying nodes. - In addition, in case of complex applications may
be loops across the dependent libraries hence the
loops needs to be unrolled in order to
effectively identify the execution time of each
iteration and to schedule those libraries. - In this regard, a dependency analyzer, to extract
the dependency and execution time for each of the
dependent libraries and also to perform loop
unrolling is needed.
172. Loop Unroller and Dependency Analyser (4/8)
- The graph traversal unit is used to traverse
across the dependency graph and extract the
dependent libraries. - This information from the unit is updated in the
library detail table. - The loop unroller unit forms an integral part of
the dependency analyzer. - It unrolls loops by replicating the libraries
using the loop index value. - Thus this unit greatly assists the dependency
analyzer in time stamp generation. - After extracting the dependencies across the
libraries the information is used to generate
time stamp of child libraries.
182. ISA of the Dependency Analyzer (5/8)
Figure 12 Overall Architecture of Dependency
Analyzer
192. Design of Hardware Based Programmable
Scheduler for SMAPP (6/8)
- The existing scheduler is not programmable hence
cannot facilitate any new scheduling heuristics
to be programmed into it. - Hence the scheduler needs to be made adaptive in
such a way that the user himself can choose the
scheduling heuristics. - The optimization techniques which we have adopted
for our scheduler are - Game Theory-Simulated Annealing based Scheduling
- Ant Colony Optimization based Scheduling
202. Design of Hardware Based Programmable
Scheduler for SMAPP (7/8)
- Game Theory-Simulated Annealing based Scheduling
- The communication and computation complexity of
the nodes are considered as cost function in the
GT-SA based approach. - By varying the parameters of the cluster system,
an optimal cost function of the nodes in
secondary host is achieved. - This scheduler unit schedules libraries to
underlying nodes by maintaining the computation
and communication complexity (cost functions) of
the node plane in its optimized state. - The GT-SA based scheduler unit compares the
current state of the cost function with a next
state obtained by varying the system parameters
load available, queue length of buffers. - The unit also accepts poor next states based on
probability equation in order to not to get stuck
in a local minima.
212. Design of Hardware Based Programmable
Scheduler for SMAPP (8/8)
- Ant Colony Optimization based Scheduling
- The behavior of ants for shortest path finding to
the food using pheromone extraction has been
adopted in this scheduling algorithm 21. - Here, the application libraries to be mapped onto
a distant node need to traverse through the
shortest path across the nodes to reach the
destination node. - In order to choose the path to reach a
destination node, to broadcast the host can make
use of this scheduling unit. - Information about distances in between nodes and
shortest path are constantly updated by this unit
hence can be utilized to reduce the communication
complexity across the nodes. - Thus based on the network topology and also the
traffic in the network, optimized path to reach a
particular node are identified.
223. Xilinx Virtex FPGA Family
- The Xilinx Virtex family FPGAs are being utilized
in prototyping the SILICOS architecture. - The Xilinx Virtex 7 FPGA kit being the latest in
the Virtex family consists of 2,000,000 logic
cells. It provides a 68Mb ram space. - It consists of a 3600 DSP slices thus providing
higher bandwidth and aid for programming parallel
processing logics into the FPGA kit.