Cell/B.E. - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Cell/B.E.

Description:

mailbox-ed notification can reach PPE before the data. SPE can do mfcsync ... Mailboxes. 32bit messages. blocking for SPE (stalls SPE) reading of empty inbound ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 37
Provided by: jakuby
Category:
Tags: cell | mailboxes

less

Transcript and Presenter's Notes

Title: Cell/B.E.


1
Cell/B.E.
  • Jirí Dokulil

2
Introduction
  • Cell Broadband Engine
  • developed Sony, Toshiba and IBM
  • 64bit PowerPC
  • PowerPC Processor Element (PPE)
  • runs OS
  • SIMD
  • Synergistic Processor Element (SPE)
  • 8x
  • computations, no OS
  • big endian

3
Architecture
4
Memory access
  • PPE
  • load store
  • cache
  • SPE
  • DMA
  • up to 16 concurrent per SPE
  • no direct access to memory
  • no need for out-of-order processing, no
    speculation
  • local storage
  • no cache

5
PPE
  • PowerPC Processor Element
  • PPU (PowerPC Processor Unit)
  • PPSS (PowerPC Processor Storage Subsystem)
  • 64-bit, dual-thread PowerPC Architecture RISC
    core
  • 2x32KB L1 (instructions and data)
  • 512LB L2 (unified)
  • PowerPC instruction set
  • vector/SIMD extensions different from SPE
  • 32x 128bit vector registers

6
SPE
  • Synergistic Processor Element
  • SPU (Synergistic Processor Unit)
  • MFC (Memory Flow Controller)
  • RISC, SIMD
  • Synergistic Processor Unit Instruction Set
    Architecture
  • support for DMA and interprocessor messaging
  • 256KB LS
  • 128x128bit register file
  • DMA access to main memory
  • segment and page tables of PPE
  • channels
  • in MFC
  • unidirectional message-passing interfaces
  • memory-mapped I/O (MMIO) registers and queues

7
EIB
  • Element Interconnect Bus
  • four 16-byte-wide data rings
  • transfer 128byte at a time (one PPE cache line)
  • internal bandwidth 96bytes per clock cycle
  • latency depends on the number of hops
  • bus is a ring
  • half frequency of SPU

8
DMA
  • MFCs support naturally aligned DMA transfer sizes
    of 1, 2, 4, or 8 bytes, and multiples of 16 bytes
  • maximum transfer size of 16 KB per transfer
  • DMA list commands can initiate up to 2048
    transfers
  • peak transfer performance
  • if both the effective addresses and the LS
    addresses are 128-byte aligned
  • and the size of the transfer is an even multiple
    of 128 bytes
  • SMM (Synergistic Memory Management) unit
  • processes address translation
  • access-permission information
  • data supplied by the PPE operating system

9
SIMD example
  • // 16 iterations of a loop
  • int rolled_sum(unsigned char bytes16)
  • int i
  • int sum 0
  • for (i 0 i lt 16 i)
  • sum bytesi
  • return sum

10
SIMD example cont.
  • // Vectorized for vector/SIMD multimedia
    extension
  • int vectorized_sum(unsigned char bytes16)
  • vector unsigned char vbytes
  • union
  • int i4
  • vector signed int v
  • sum
  • vector unsigned int zero (vector unsigned
    int)0
  • // Perform a misaligned vector load of the 16
    bytes.
  • vbytes vec_perm(vec_ld(0, bytes), vec_ld(16,
    bytes), vec_lvsl(0, bytes))
  • // Sum the 16 bytes of the vector
  • sum.v vec_sums((vector signed
    int)vec_sum4s(vbytes, zero), (vector signed
    int)zero)
  • // Extract the sum and return the result.
  • return (sum.i3)

11
Communication
  • DMA
  • 2 command queues per SPE
  • one for commands by SPE
  • one for commands by PPE and other SPEs
  • commands have tags (32 different) status query
  • one transfer or a list
  • mailboxes
  • for each SPE
  • communication with PPE
  • 2 outgoing (1 message)
  • 1 incoming (4 messages)
  • signals
  • 2 inbound channels

12
DMA
  • put, get
  • SPE or PPE initiated
  • tag
  • 5bit
  • ordering
  • out of order
  • barrier maintains order (within tag group)
  • fence after all previous (within tag group)
  • simple or lists
  • lists stored in LS (8bytes per item) -gt SPE only
  • up to 2048 transfers, 16KB each -gt 32MB
  • compare to 256KB LS size

13
DMA PPE raw access
  • MFC registers mapped to virtual address space
  • void ps get_ps() //get the problem state
    must be mapped by privileged software
  • unsigned int ls 0x500
  • unsigned int long long ea 0x10000000
  • unsigned int size 0x4000
  • unsigned int tag 5
  • unsigned int classid 0
  • unsigned int cmd MFC_GET_CMD
  • unsigned int cmd_status
  • do
  • ((volatile unsigned int )(ps MFC_LSA)) ls
  • ((volatile unsigned long long )(ps MFC_EAH))
    ea
  • ((volatile unsigned int )(ps MFC_Size))
    (size ltlt 16) tag
  • ((volatile unsigned int )(ps MFC_ClassID))
    (classid ltlt 16) cmd
  • / Read MFC_CMDStatus to enqueue command and
    check enqueue success. /
  • cmd_status ((volatile unsigned int )(ps
    MFC_CMDStatus)) 0x3
  • while (cmd_status) / Attempt to enqueue until
    success /
  • only enqueues the command

14
DMA PPE raw access cont.
  • test for completion (poll tag group status)
  • void ps get_ps()
  • unsigned int tag_mask 1 ltlt 5
  • unsigned int tag_status
  • ((volatile unsigned int )(ps Prxy_QueryMask))
    tag_mask
  • __asm__(eieio) / force write to
    Prxy_QueryMask to complete /
  • do
  • tag_status ((volatile unsigned int )(ps
    Prxy_TagStatus))
  • while (!tag_status)
  • more tag groups
  • unsigned int tag_mask (1ltlt5)(1ltlt14)(1ltlt31)

15
DMA SPE
  • no direct access to the virtual address space
  • only by DMA
  • direct access to own command channels
  • wrch assembly instruction
  • extern void dma_transfer(volatile void lsa,
    // local storage address
  • unsigned int eah,
    // high 32-bit effective address
  • unsigned int eal,
    // low 32-bit effective address
  • unsigned int size,
    // transfer size in bytes
  • unsigned int tag_id,
    // tag identifier (0-31)
  • unsigned int cmd)
    // DMA command
  • in assembler
  • wrch MFC_LSA, 3
  • wrch MFC_EAH, 4
  • wrch MFC_EAL, 5
  • wrch MFC_Size, 6
  • wrch MFC_TagID, 7
  • wrch MFC_Cmd, 8
  • in C intrinsic
  • spu_mfcdma64(lsa, eah, eal, size, tag_id,
    cmd)

16
DMA SPE cont.
  • poll for completion
  • Set tag group mask
  • wrch MFC_WrTagMask, 0
  • Set up for immediate tag status update.
  • il 1, 0
  • repeat
  • wrch MFC_WrTagUpdate, 1
  • rdch 1, MFC_RdTagStat
  • brz 1, repeat
  • OR
  • include ltspu_intrinsics.hgt
  • include ltspu_mfcio.hgt
  • unsigned int tag_id 0
  • unsigned int tag_mask 1 ltlt tag_id
  • spu_writech(MFC_WrTagMask, tag_mask)
  • do
  • while(!spu_mfcstat(MFC_TAG_UPDATE_IMMEDIATE))
    / poll for update /

17
DMA SPE cont.
  • wait for completion (stall SPE)
  • Set tag group mask
  • wrch MFC_WrTagMask, 0
  • 0x1 for any tag, 0x2 for all tags.
  • il 1, 0x1
  • Wait for conditional tag status update (stall
    the SPU).
  • wrch MFC_WrTagUpdate, 1
  • rdch 1, MFC_RdTagStat
  • OR
  • include ltspu_intrinsics.hgt
  • include ltspu_mfcio.hgt
  • unsigned int tag_id 0
  • unsigned int tag_mask 1 ltlt tag_id
  • spu_writech(MFC_WrTagMask, tag_mask)
  • / Wait for all ids in tag group to complete
    (stall the SPU) /
  • spu_mfcstat(MFC_TAG_UPDATE_ALL)

18
DMA SPE cont.
  • completion of DMA
  • source buffer can be reused
  • data may not have yet been written to the main
    storage
  • mailbox-ed notification can reach PPE before the
    data
  • SPE can do mfcsync
  • PPE can do lwsync
  • more efficient
  • SPE can notify via DMA
  • mfceieio must be used between DMAs for ordering

19
Mailboxes
  • 32bit messages
  • blocking for SPE (stalls SPE)
  • reading of empty inbound
  • writing of full outbound
  • SPE can poll the number of messages
  • non-blocking for PPE (and other devices)
  • reading returns zeros
  • writing overwrites last message

20
Mailboxes SPE
  • send (stalling)
  • wrch SPU_WrOutMbox, 1
  • or
  • spu_writech(SPU_WrOutMbox, mb_value)
  • send (active waiting)
  • repeat
  • rchcnt 2, SPU_WrOutMbox
  • brz 2, repeat
  • wrch SPU_WrOutMbox, 1
  • or
  • do
  • / Do other useful work while waiting. /
  • while (!spu_readchcnt(SPU_WrOutMbox))
  • spu_writech(SPU_WrOutMbox, mb_value)

21
Mailboxes SPE cont.
  • read (stalling)
  • rdch 1, SPU_RdInMbox
  • or
  • mb_value spu_readch(SPU_RdInMbox)
  • read (active waiting)
  • repeat
  • rchcnt 1, SPU_RdInMbox
  • brz 1, repeat
  • rdch 2, SPU_RDInMbox
  • or
  • do
  • / Do other useful work while waiting. /
  • while (!spu_readchcnt(SPU_RdInMbox))
  • mb_value spu_readch(SPU_RdInMbox)

22
Mailboxes PPE
  • read SPEs outbound mailboxsend
  • void ps get_ps()
  • unsigned int mb_status
  • unsigned int new
  • unsigned int mb_value
  • do
  • mb_status ((volatile unsigned int )(ps
    SPU_Mbox_Stat))
  • new mb_status 0x000000FF
  • while ( new 0 )
  • mb_value ((volatile unsigned int )(ps
    SPU_Out_Mbox))

23
Mailboxes PPE cont.
  • writing to SPEs inbound mailbox
  • problem of overrunning full mailbox
  • //send four messages without overrunning the
    mailbox
  • void ps get_ps()
  • unsigned int j,k 0
  • unsigned int mb_status
  • unsigned int slots
  • unsigned int mb_value4 0x1, 0x2, 0x3, 0x4
  • do
  • / Poll the Mailbox Status Register until the
    SPU_In_Mbox_Count field indicates there is at
    least one slot available in the SPU Read Inbound
    Mailbox. /
  • do
  • mb_status ((volatile unsigned int )(ps
    SPU_Mbox_Stat))
  • slots (mb_status 0x0000FF00) gtgt 8
  • while ( slots 0 )
  • for (j0 jltslots k lt 4 j)
  • ((volatile unsigned int )(ps
    SPU_In_Mbox)) mb_valuek
  • while ( k lt 4 )

24
CELL SDK 3.1
  • http//www.ibm.com/developerworks/power/cell/
  • Cell BE Programming Handbook Including PowerXCell
    8i
  • http//www-01.ibm.com/chips/techlib/techlib.nsf/te
    chdocs/1741C509C5F64B3300257460006FD68D
  • SPE Runtime Management Library
  • http//www-01.ibm.com/chips/techlib/techlib.nsf/te
    chdocs/1DFEF31B3211112587257242007883F3
  • PPU SPU C/C Language Extension Specification
  • http//www-01.ibm.com/chips/techlib/techlib.nsf/te
    chdocs/30B3520C93F437AB87257060006FFE5E

25
libspe libspe2
  • low level APIs to access Cell from C/C
  • new threading model in libspe2
  • use threading library of your choice and use
    libspe2 from there no SPE threads
  • create e.g. pthread thread and launch SPE code
    from that call returns after SPE finishes

26
Compilation
  • PPE object
  • g -m64 -c -Ox
  • SPE object
  • spu-gcc -Ox
  • no m64
  • LS adresses are always 32bit
  • ppu-embedspu -m64 ltsymbolgt ltobjectgt ltoutputgt
  • link
  • g -m64 ltspe objectgt ltppe objectgt -lspe -lspe2

27
Referencing SPE code from PPE code
  • extern spe_program_handle_t ltsymbolgt
  • spe_program_load(spe_context,ltsymbolgt)

28
Launching SPE code (libspe2)
  • struct thread_data
  • spe_context_ptr_t context
  • program_data pd
  • void ppu_pthread_function(void arg)
  • thread_data td (thread_data ) arg
  • spe_context_ptr_t context td.context
  • unsigned int entry SPE_DEFAULT_ENTRY
  • spe_context_run(context,entry,0,td.pd,NULL,NU
    LL)
  • pthread_exit(NULL)
  • spe_context_ptr_t context
  • pthread_t pthread
  • thread_data td
  • context spe_context_create(0,NULL)
  • spe_program_load(context,spe_prg)

29
SPE code
  • include ltspu_mfcio.hgt
  • int main(
  • unsigned long long spe_id,
  • unsigned long long program_data_ea,
  • unsigned long long env)
  • program_data pd __attribute__((aligned(16)))
  • int tag_id 1
  • mfc_get(pd, program_data_ea, sizeof(pd), tag_id,
    0, 0)
  • mfc_write_tag_mask(1ltlttag_id)
  • mfc_read_tag_status_any()

30
Program data
  • structure shared by SPE and PPE code
  • unsigned long long for 64bit pointers
  • void is 32bit on SPE and 32/64bit on PPE
  • be careful with the alignment
  • DMA cannot handle misaligned transfers
  • size padded to 16byte

31
DMA SPE side
  • (void) mfc_put(volatile void ls, uint64_t ea,
    uint32_t size, uint32_t tag, uint32_t tid,
    uint32_t rid)
  • initiate transfer from LS
  • tag is number (e.g. 5)
  • mfc_putb, mfc_putf

32
DMA SPE side cont.
  • (void) mfc_get(volatile void ls, uint64_t ea,
    uint32_t size, uint32_t tag, uint32_t tid,
    uint32_t rid)
  • mfc_getb, mfc_getf

33
DMA status SPE side
  • (void) mfc_write_tag_mask (uint32_t mask)
  • tag mask (e.g. 1ltlt5)
  • (uint32_t) mfc_read_tag_status_any(void)
  • blocks untill any of the specified tag groups has
    no outstanding operations
  • (uint32_t) mfc_read_tag_status_all(void)
  • blocks untill all of the specified tag groups
    have no outstanding operations

34
Mailboxes SPE side
  • (uint32_t) spu_read_in_mbox(void)
  • (uint32_t) spu_stat_in_mbox(void)
  • (void) spu_write_out_mbox(uint32_t data)
  • (uint32_t) spu_stat_out_mbox(void)

35
Mailboxes PPE side
  • int spe_out_mbox_read (spe_context_ptr_t spe,
    unsigned int mbox_data, int count)
  • int spe_out_mbox_status (spe_context_ptr_t spe)
  • int spe_in_mbox_write (spe_context_ptr_t spe,
    unsigned int mbox_data, int count, unsigned int
    behavior)
  • SPE_MBOX_ALL_BLOCKING
  • blocks until all are sent
  • SPE_MBOX_ANY_BLOCKING
  • blocks until at least one message is sent
  • SPE_MBOX_ANY_NONBLOCKING
  • sends as many as possible without blocking
  • int spe_in_mbox_status (spe_context_ptr_t spe)

36
PPE direct access to SPE
  • void spe_ls_area_get (spe_context_ptr_t spe)
  • less efficient than DMA
  • int spe_ls_size_get (spe_context_ptr_t spe)
  • void spe_ps_area_get (spe_context_ptr_t spe,
    enum ps_area area)
  • enum ps_area
  • SPE_MFC_COMMAND_AREA
  • MFC registers
  • SPE_CONTROL_AREA
  • mailboxes
  • the get_ps function used in examples from the
    first part
Write a Comment
User Comments (0)
About PowerShow.com