Membrane Computing in the Connex Environment - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Membrane Computing in the Connex Environment

Description:

The Ubiquitousness of Parallelism Asks for Integral Parallel ... Intel's approach. Multi-processors: the best approach for multi-threading on MIMD architecture ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 40
Provided by: see48
Category:

less

Transcript and Presenter's Notes

Title: Membrane Computing in the Connex Environment


1
Membrane Computingin theConnex Environment
Gheorghe Stefan BrightScale Inc., Sunnyvale, CA
Politehnica University of Bucharest gstefan_at_b
rightscale.com
2
Outline
  • Integral Parallel Architecture
  • The Connex Chip
  • The Connex Architecture
  • How to Use the Connex Environment
  • Concluding Remarks

3
Integral Parallel Architecture
  • The Ubiquitousness of Parallelism Asks for
    Integral Parallel Architectures
  • Partial Recursive Functions Parallel
    Computation
  • A Functional Taxonomy of Parallel Computation

4
Parallelism can not be avoided anymore
  • Intels approach
  • Multi-processors
  • the best approach for multi-threading on MIMD
    architecture
  • Inefficient on SIMD architecture
  • Ignores the MISD architecture
  • Many-processors asking for another taxonomy
  • They work as accelerators
  • They perform critical functions
  • Berkeleys 13 dwarfs is a functional approach for
    many-processors
  • Real applications ask for all kind of parallelism
    to solve corner cases the places where the
    devil hides

5
Partial Recursive Functions Parallel
Computation
  • Composition Rule the Basic Parallel Structures
  • Primitive Recursive Rule
  • Minimalization Rule

6
Composition the Associated Structure
  • f(x0, xn-1) g(h0(x0, xn-1), h1(x0, xn-1),
    hm-1(x0, xn-1))



x0, xn-1


. . .

. . .

f(x0, xn-1)
h1
hm-1
h0
g(h0, h1, hm-1)
7
Data Parallel Composition
  • X x0, xn-1 ? h(x0), h(x0), h(x0)

x0
x1
xn-1


. . .
h(x0) h(x1)
h(xn-1)

h
h
h
8
Speculative Composition
  • function vector H h0, h1, hn-1, scalar x
  • H(x) h0(x), h1(x) hn-1(x)


x


. . .

h0(x) h1(x)
hn-1(x)


h0
h1
hm-1
9
Serial Composition
  • f(x) g(h(x))


x



Time parallelism
The general case

f(x) g1(g2( g3(
gp(x) )))
f(x)
h
g(h(x))
10
Reduction Composition
  • f(x0, xm-1) g(x0, xm-1)



x0 x1 xm-1





g(x0, xm-1)
g(x0, x1, xm-1)
11
Primitive recursive rule
  • f(x,y) h(x, f(x, y-1)), where f(x,0) g(x)
  • f(x,y) h(x, h(x, h(x, h(x, g(x) ) )))
  • Parallel solution makes sense only if the
    function must be
  • computed many times.
  • Implementations
  • Data parallel composition
  • Loop in a serial composition

12
Data Parallel Composition for the Primitive
Recursive Rule
  • x, Y y0, yn-1 ? f(x,y0), f(x,y1),
    f(x,yn-1)

(x, y0)
(x, y1) (x, yn-1)



. . . f(x, y0) f(x,
y1) f(x, yn-1)


h
h
h
13
Serial Composition for the Primitive Recursive
Rule
  • x, ltYgt lty0, yn-1gt ? ltFgt ltf(x,y0), f(x,y1),
    f(x,yp-1)gt



x, ltYgt

. . .
ltFgt

h
h
h
sel
14
Minimalization rule
  • f(x) min(y)m(x,y) 0
  • Implementations
  • Speculative composition reduction composition
  • Serial composition reduction composition

15
Speculative Composition Reduction Composition
for Minimalization


x


. . .

. . . m(x,0), 0

m(x,n-1), n-1

f(x) i
m(x,1)
m(x,n-1)
m(x,0)
first0, i
16
Serial Composition Reduction Composition for
Minimalization

yi-1 yi-2
yi-s


selection
code yi

(Pi the i-th pipe stage) Example of dynamic
reconfiguration

Pi-5
Pi-1
Pi-2
Pi-3
Pi-4
Pi
17
Functional Taxonomy of Parallel Computation
  • Data Parallel Computation uses SIMD-like
    machines
  • Time Parallel Computation is a very special
    sort of MIMD used to compute only one function
  • Speculative Computation is MISD machine
    completely ignored by the actual implementations

18
Integral Parallel Architecture
  • An Integral Parallel Architecture (IPA) uses all
    kinds of parallelism to build a real machine, in
    two versions
  • complex IPA all types of parallel mechanisms
    tightly interleaved on the same physical
    structure (pipelined superscalar speculative
    general purpose processors)
  • intensive IPA all types of parallel mechanisms
    highly separated, implemented on specific
    physical structures (accelerators for embedded
    computation in a SoC approach)

19
Intensive IPA
  • Intensive IPA are used as accelerators for
    complex IPA
  • Monolithic intensive IPA the same machine works
    in two modes
  • Data parallel
  • Time parallel
  • Segregated intensive IPA two distinct machines
    are used, one for data parallel computation and
    the other for time parallel (i.e. speculative)
    computation

20
The Connex Chip
  • The organization of BA1024
  • multi-core area of 4 MIPS
  • many-core data parallel area of 1024 simple PEs
  • speculative time parallel pipe of 8 PEs
  • interfaces (DDR, PCI, video audio interfaces
    for 2 HDTV channels)

21
The Connex System
255
Connex Array 1,024 linearly connected 16-bit
Processing Cells Sequencer 32-bit stack machine
program memory data memory issues in each
cycle (on a 2-stage pipe) one 64-bit instruction
for Connex Array and a 24-bit instruction for
itself IO Controller 32-bit stack machine
controls a 3.2 GB/s IO channel Processing
Cell Integer unit data memory Boolean unit
254
16-bit RAM For data
Sequencer (4KB data 32Kb program memory)
I/O Controller (4KB data 4KB program memory)
Connex Array

1
Address
0
R7
R6
R5
R4
R3
AUX
R2
I/O
R1
Connex
R0
1
I/O channel works in parallel with code running
on the Connex Array
16 bit ALU
22
Connex Array Structure
255
254
  • Processing Cells are linearly connected using
    only the register R0
  • IO Plan consists in all R1s supervised mainly by
    the IO Controller
  • Conditional execution based on the state of
    Boolean unit
  • Integer unit, Boolean unit and Data memory
    execute in each cycle command fields from a
    64-bit instruction issued by Sequencer
  • Vector reduction operations with scalar results
    in the TOS of Sequencer (receiving through a
    3-stage pipe data from the array of cells)

1
0
R7
R6
R5
R4
R3
R2
R1
R0
1023
0
1
on
off
on
16 bit ALU
16 bit ALU
16 bit ALU
23
I/O System
Switch Fabric (128-bit word)
Connex Array
IS
I/O Plane
IOC
Interrupts
DRAM
DDR-DRAM Controller
DRAM
DRAM
DRAM
24
Test ICE
64-bit Wide DRAM
Configurable Switch Fabric
BT.656/1120
BT.656/1120
ConnexArray Programmable Media
Processor Multi-Codec Processing Pre-Analysis 3D
Filter Scaling Video Merge/Blend Motion Adaptive
De-interlacing
BT.656/1120
BT.656/1120
I/O Sequencer
1x-I2S
1x-I2S
Configurable Switch Fabric
Configurable Switch Fabric
S/PDIF
1x-I2S
4xI2S
Instruction Sequencer
Test
PCI v2.2 or Generic
Flash
Configurable Switch Fabric
BA1024
25
The Connex Architecture
  • Vectors selections
  • Programming Connex
  • Performances

26
Vectors Selections
  • Linear array of processing elements ? vectors
  • Local data memory in each processing element ?
    array of vectors
  • Data dependency operations at the level of each
    processing element ? selections

27
Full Line Operations
0
1023
0
16-bit data operand
Line i
, -, , XOR, etc.
Line j

Line k
255
Line k Line i OP Line j Line k Line i
OP scalar value (repeated for all elements)
28
Columns Active Based On Repeating Patterns
0
1023
0
Line i
, -, , XOR, etc.
Line j

Line k
255
Mark all odd columns active. Or mark every third
column active. Or mark every third and fourth
column active, etc.
29
Columns Active Based On Data Content
0
1023
0
Line i
, -, , XOR, etc.
Line j

Line k
255
Apparently random columns are active, marked,
based on data-dependent results of previous
operations.
30
Outer-Loop Parallelism
0
1023
7
0
8x8
8x8
8x8
8x8
..
7
Line i
Line j
255
Example 128 sets of 8x8 run in parallel in a
1024-cell array
31
Programming Connex
  • int main()
  • vector V1 2 // V1 2, 2, 2
  • vector V2 3 // V2 3, 3, 3
  • vector V // V 0, 0, 0
  • vector Index indexvector() // Index 0, 1,
  • V mm_absdiff(V1, V2) // V 1,1, 1
  • return 0
  • // Find the absolute difference between two
    vectors
  • vector mm_absdiff(vector V1, vector V2)
  • vector V
  • V V1 - V2
  • WHERE (V lt 0)
  • V -V // V abs(V)
  • ENDW
  • return V
  • VectorC is an extension/restriction of C
  • Code that operates on scalar data written in
    regular C notation
  • Connex-specific operators defined as functions
    for features not available in C, e.g.
    operations on vectors and selections (Boolean
    vectors)
  • VectorC uses sequential operators and control
    structures on vector and select data-types
  • Using VectorC the Connex Machine is programmed
    the same way as conventional sequential machines

32
Overall performances of BA1024
  • 200 GOP/sec
  • 3.2 GB/sec external bandwidth
  • 400 GB/sec internal bandwidth
  • gt 60 GOP/Watt
  • gt 2 GOP/mm2
  • Note 1 OP 16-bit simple integer operation
    (excluding multiplication)

33
How to Use the Connex Environmentfor Membrane
Computation
  • Example (G. Paun)
  • the initial configuration 123a f c3 2
    1...
  • R1 e ? (e, out), f ? f
  • R2 b ?d, d ? de, ff ? f, cf ? cdd
  • R3 a ? ab, a ? bd, f ? ff

34
The first example of processing
Initial vector (1,) (2,) (3,) (0,a) (0,f)
(0,c) (3,) (2,) (1,) ...
a f c ... a ? ab, f ? ff
a b f f c ... // 11
clock cycles a ? ab, f ? ff a b b f f
f f c ... // 15 clock cycles a ?
bd, f ? ff b b b f f f f f f f f c
... // 27 clock cycles b ?d, ff ? f
d d d f f f f c ... // 10
clock cycles d ? de, ff ? f d e d e d e
f f c ... // 10 clock cycles d ? de, cf
? cdd d e e d e e d e e d f c ... // 10 clock
cycles e ? (e, out), f ? f d d d d f c e e e
e e e... // 15 clock cycles total
98 clock cycles
35
The second example of processing
Initial vector (1,) (2,) (3,) (1,a) (1,f)
(1,c) (3,) (2,) (1,) ... 1a 1f 1c
... ? 1a 1b 2f 1c ... ? //
in 5 clock cycles 1a 2b 4f 1c ... ?
// in 5 clock cycles 3b 8f 1c ...
? // in 10 clock cycles 3d 4f 1c ...
? // in 7 clock cycles 3d 3e 2f 1c
... ? // in 8 clock cycles 4d 3e 1f 1c
... ? // in 8 clock cycles 4d 1f 1c
3e... ? // in 5 clock cycles
total 48 clock cycles
36
The third example of processing
The third membrane is duplicated (multiplicated),
but the content can be different 1a 1f
1c 2a 1f 1c ... ? 1a 1b 2f
1c 2a 2b 2f 1c ... ? // in 5 clock
cycles 1a 2b 4f 1c 2a 4b 4f 1c ... ?
// in 5 clock cycles 3b 8f 1c 6b 8f 1c
... ? // in 10 clock cycles
3d 4f 1c 6d 4f 1c ... ? //
in 7 clock cycles 3d 3e 2f 1c 6d 6e 2f 1c
... ? // in 8 clock cycles 4d 3e 1f
1c 7d 6e 1f 1c... ? // in 8 clock
cycles 4d 1f 1c 7d 1f 1c 9e...
? // in 10 clock cycles
total 53
clock cycles For up to 200 level 3 membranes the
number of clock cycles remains 53.
37
Concluding Remarks
  • Functional taxonomy vs. Flynn taxonomy
  • Connex architecture accelerates membrane
    computation
  • An efficient P-architecture asks for few
    additional features to the Connex architecture
  • Why not a P-language?

38
Main technical contributors to the Connex project
Emanuele Altieri, BrightScale Inc., CA Lazar
Bivolarski, BrightScale Inc., CA Frank Ho,
BrightScale Inc., CA Mihaela Malita, St. Anselm
College, NH Bogdan Mitu, BrightScale Inc.,
CA Dominique Thiebaut, Smith College, MA Tom
Thomson, BrightScale Inc., CA Dan Tomescu,
BrightScale Inc., CA
39
  • Thank You
  • Mihaelas webpage on VectorC
  • www.anselm.edu/homepage/mmalita/
  • QA
Write a Comment
User Comments (0)
About PowerShow.com