Advanced Memory Management - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Advanced Memory Management

Description:

writable? Munin Release Consistency. thread-A thread-B. lock(A) lock(B) ... state: valid, writable, modified, replicated. copyset: bitmap? linked-list? ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 46
Provided by: camarsK
Category:

less

Transcript and Presenter's Notes

Title: Advanced Memory Management


1
Advanced Memory Management
2
Advanced Topics
  • Distributed Shared Memory
  • Application Controlled Memory Management
  • TLB Issues

3
Distributed Shared Memory
on hardware
on opearting system
in user space
4
Distributed Shared Memory
  • a way to facilitate the programming of
    distributed systems
  • DSM
  • message passing
  • provides a uniform single address space view
  • easy to use
  • separate the memory view from threads
  • each node can access all the memories even though
    machines do not physically share memory
  • software layer allow application software to
    access shared data not resident at a node
  • integrated with OS (IVY, Clouds, ..)
  • runs on top of the OS (NOW)

5
Issues in the design of DSM
  • Virtual memory and DSM
  • find where is the physical page frame
  • make it coherent if duplicated
  • memory model and coherence protocol
  • what is guaranteed when a data is accessed in
    parallel
  • how to make shared pages coherent
  • synchronization
  • will be used frequently in parallel programs on
    DSM
  • hardware support
  • speed network, CPU
  • functionality TLB, message processing

6
Distributed Shared Memory
  • Virtual Memory Mechanism
  • 1. CPU generates a virtual address
  • 2. if the page is in local memory, keep going
  • else send a request for the page to ???
  • needs information who has it or
  • needs information who knows where the page is
  • 3. the node who has the page replies with the
    page
  • with write privilege if it is a write request
    invalidating its own copy
  • with read-only privilege keeping the ownership
  • when a page has to be replaced either
  • swap it to disk or
  • send it to network DRAM

7
Consistency Protocol
  • similar to those of distributed file system and
    SMP caches
  • defines actions on a write to a shared page
  • invalidate (a page)
  • update (a page or a word or cache line or ,,,,)
  • the length of write run decides which one is
    better
  • defines state of a page
  • shared or exclusive
  • write or read
  • ownership, if any
  • defines the information about
  • how to find the owner
  • how to find replicated copies

8
Memory Model
  • defines result of series of memory operations
  • restricts the implementation of shared memory
  • may disallow cacheing
  • may allow out of order processing of memory
    operations
  • sequential consistency (Lamport)
  • most strong model
  • result should be the same as serializable
    operations
  • restrict any out of order accesses to memory
  • release consistency
  • based on a synchronization
  • acquire
  • release
  • guarantee consistency only at the end of release
    operation
  • allows buffering memory accesses inside a
    critical section

9
GLUnix
  • DSM that runs on top of OS
  • minimizes OS modification
  • fast prototyping
  • most applications can be run without modification
  • modifications needed
  • network protocol that is suited for page handling
  • use network RAM as a secondary storage
  • page fault handler
  • page tables

10
GLUnix(2)
  • Virtual OS layer
  • captures DSM-related interrupts and syscalls
    generated from running application programs
  • how to capture external interrupts?
  • software fault isolation
  • insert a code to check before the instruction
    that may cause interrupt
  • Issues
  • load balancing
  • parallel computation finishes when the slowest
    component finishes
  • communicating processes
  • lots of small messages

11
IVY
  • First DSM
  • invalidation-based
  • ownership based
  • test three algorithms to find an owner
  • centralized server
  • fixed distributed manager
  • each node has the ownership for predetermined
    pages
  • dynamic distributed manager
  • page table of each node has the possible owner of
    each page
  • if it is not the true owner, it should know where
    to look for further search

12
IVY(2)
  • Process Migration
  • needed for load balancing
  • sends PCB to destination
  • send pages containing stack to destination with
    write privilege
  • why? those pages will be transferred anyway as
    the stack is accessed in the destination node
  • Good performance
  • runs only those good for DSM
  • some applications show even super-linear speedup

13
Munin
  • DSMized the V kernel
  • Software Release Consistency
  • Multiple Consistency Protocols
  • performance of a consistency protocol is
    sensitive to
  • structure of parallel programs
  • sharing pattern inside a program
  • a protocol is defined on each shared object
  • an annotation is needed for each object to define
    the protocol
  • if it is missing, default protocol works

14
Munin Annotation
  • what is defined in an annotation?
  • invalidate vs update
  • replication allowed?
  • delayed operation allowed?
  • fixed owner?
  • multiple writes allowed?
  • static sharing pattern
  • the object is accessed by a single thread by a
    static pattern, updates are sent to the same node
    even before the node requests them
  • flush changes to owner?
  • writable?

15
Munin Release Consistency
  • thread-A thread-B
  • lock(A) lock(B)
  • X X1 Y Y1 X and Y are in the
  • unlock(A) unlock(B) same page
  • when updated pages are flushed to owner, how to
    update the home page?
  • introduces twin page
  • if there is no twin no problem
  • else, difference of each twin page are updated
  • where to keep the twin pages?

16
Munin Directory Structure
  • hash function directs an address to an entry in
    the table
  • the entry contains object description
  • start address and size
  • protocol defined
  • state valid, writable, modified, replicated
  • copyset bitmap? linked-list?
  • synchq pointer to the synch object that governs
    this object
  • probable owner best guess
  • home node for book keeping
  • access control semaphore

17
Munin(3)
  • Merging Sync with Data Transfer
  • a message transfer is expensive in DSM
  • most sync operations are used to control shared
    data
  • then let's merge them in a single message
  • how do we know which sync object governs which
    object?
  • programmer knows it
  • can be defined at variable declaration
  • when a lock is released to a node
  • the data governed is sent together

18
Shasta
  • Motivations
  • fine-grain sharing will reduce false sharing
  • run binary executables
  • most commercial softwares are distributed in this
    way
  • insert checks in application executables (like
    Blizzard-S) at loads and stores
  • ordinary overhead 50150
  • support SMP as a node
  • Virtual address space
  • conventional space is private
  • code, static data, stack
  • shared space is dynamically allocated (followed
    the convention of SPLASH)

19
Shasta Coherence Protocol
  • three states
  • invalid, shared, exclusive
  • directory-based invalidation
  • home node is assigned for each virtual page
  • owner node is the last node that updates the page
  • directory contains
  • a pointer to the owner
  • full map bit-vector for all sharers
  • coherence unit and coherence information
  • block (multiple of lines) by directory
    information
  • lines (64128 bytes) by state table

20
Shasta(2)
  • Polling instead of interrupt for coherence
    actions
  • polling is much more efficient (only 3
    instructions)
  • simplifies concurrency problem handling a miss
  • while a miss is being check, messages related to
    this miss can arrive
  • places to insert polling code
  • when the protocol waits for a message
  • depends on desired response time
  • at every function call
  • at every loop backedge

21
Shasta Shared Miss Check
  • Each load and store should be checked if it is a
    miss instructions that need not be checked
  • private, stack accesses
  • addresses calculated from the above addresses
  • check if they use registers used for private data
  • normal operations
  • 1. checks if the target address is in shared
    region
  • 2. if so, check out the state from the state
    table
  • 3. if needed, call miss handling routine

22
Inserted Code for Store Check
1. lda rx, offset(base) 2. srl rx, 39, ry 3.
beq ry, nomiss 4. srl rx, 6, rx 5. ldq_u ry,
0(rx) 6. extbl ry, rx, ry 7. beq ry, nomiss 8.
call miss handler check for a store
shared region
state table
static data text stack
23
Inserted Code for Store Check(2)
  • no register save/restore
  • use unused ones
  • if cannot find unused ones, insert a code to
    secure two registers
  • instruction 2 and 6 requires smart address space
    allocation
  • shared region
  • state table
  • operations
  • 1. calculate effective address of the target
  • 2. 3. check if the target address is within the
    shared region
  • 4. calculate the (byte) address in the state
    table for the target address
  • 5.6. extract the state information
  • 7. if it is 0 (exclusive) go to nomiss

24
Shasta Optimization
  • code rescheduling
  • shift instruction needs 2 cycles to generate the
    result
  • branch delay slots can be filled with the above
    code
  • rx, and ry are unused registers no dependency
  • load checks
  • when a line becomes invalid, store a fixed value
    to each long word in the line
  • for a load check, compare the value of a long
    word with the flag value
  • if equal, call miss handling routine
  • else, continue
  • the flag value should not be the one that used
    frequently in normal computing
  • not zero, not a small positive integer
  • 253 was chosen in Shasta

25
Shasta Optimization(2)
  • store check
  • separate exclusive bit(1 bit) for each line is
    prepared
  • a table of such bits will occupy a small space in
    the data cache reduce cache misses for looking
    up the state table
  • batching miss checks
  • if several instructions touch the same line, one
    check is enough for all of them
  • multiple granularity
  • allow applications to define the block size at
    malloc() system call

26
Alpha LL and SC
  • synchronization primitives of Alpha
  • lock_flag and lock_address per processor
  • operations
  • LL sets the lock_flag and the lock_address
  • the lock_flag will be reset if another processor
    writes in its own cache at the lock_address
  • SC succeeds if the lock_flag is still set
  • exact implementation would be expensive
    (inefficient) for shasta
  • Alpha programming recommendation
  • for an SC there is a unique LL
  • no store or load between SC and LL
  • SC and LL are in the same line

27
Shasta Approach to LL and SC
  • before LL
  • save the state of the line in a register
  • get the latest copy if the state is invalid
  • before SC
  • if the saved state is exclusive, OK
  • if invalid return fail
  • if shared, send a special message to the home
  • at the home node
  • if the requester is still a sharer, send OK
  • else send failure

28
Memory Barrier
  • fence operation to make all the pending
    operations of the processor to be globally
    performed
  • inefficient due to blind performing at MB
  • Shasta at MB
  • finishes all the pending operations
  • the hardware also executes MB operation for SMP

29
System Calls
  • validating arguments
  • what if arguments are in shared region?
  • copy the arguments from shared region to local
    memory
  • expensive operations
  • validate arguments using a wrapper
  • make sure the arguments are in proper states
  • support multiple cluster
  • replace all related calls with Shastas calls
  • process managements
  • shared memory management
  • threads that share an address space
  • need inline checking even for the accesses to
    private and stack expensive
  • page based protocol can be used for stack
  • access to remote file
  • distributed file system is needed

30
Process Handling
  • Processes are created/terminated dynamically
  • Issues
  • data and states information owned by a terminated
    process
  • more processes than processors
  • inactive ones may cause delays for servicing
    requests from other processes
  • solution
  • a daemon process per processor while the
    application is running
  • it shares all the data with processes allocated
    to the same processor
  • it runs at a low priority and handles messages
    arrived for peer processes

31
Code modification
  • when?
  • load time
  • it may slow down the loading time
  • you have only binaries
  • cacheing will reduce the number of modifications
    of frequently-used code
  • caveat program that generates code

32
Page Table for 64 bit OS
  • Motivations
  • page table is huge for 64-bit address space
  • inverted page table is not a solution due to
    increased physical memory size
  • most programs use the address space sparsely
  • Multi-level page tables
  • PTEs are structured into an n-ary tree
  • reduces significantly PTEs for unused address
    space
  • when the height of the tree is large
  • too many memory reference to find a PTE
  • when it is too small
  • lose the benefit of multi-level

33
Page Table for 64 bit OS
  • Hashed page tables
  • hash function maps a VPN to a bucket
  • a bucket is a linked list of elements which
    consist of
  • PTE (PPN, attribute, valid bit,..)
  • VPN (almost 8 bytes)
  • next pointer (8 bytes)
  • space overhead 16 bytes per PTE
  • next pointer can be eliminated by allocating a
    fixed number of elements for each bucket
  • overflow problem remains

34
Page Table for 64 bit OS
  • Clustered page table
  • each element of the linked list maps multiple
    pages
  • VPN
  • next pointer
  • n PTEs
  • a memory object (in virtual address) usually
    occupies multiple pages
  • space overhead of the hashed page table is
    amortized
  • more efficient than linear table for sparse space

35
Clustered Page Table
  • Operations
  • adding a PTE
  • hash table memory allocation list insertion
    PTE initialization per each new PTE
  • clustered
  • mem alloc list insertion per n PTEs
  • initialization for each PTE
  • modifying a PTE
  • modification is done for a memory object not for
    a page, so cluster scheme is more efficient
  • synchronization
  • many threads use the page table concurrently
  • cluster lock for a group of pages
  • reduces concurrency
  • less blocking overhead
  • can support finer granularity with some overhead

36
TLB Issues
  • Can this scheme support new TLB technologies such
    as superpage and subblocking?
  • Superpage
  • a superpage is 2n of base page size
  • each TLB entry must has the size field
  • why not segmentation
  • complex because its size is arbitrary and it
    starts at an arbitrary location
  • reduce TLB misses since each entry maps wider
    region
  • good for frame buffer, kernel data, DB buffer
    pool
  • how about file cache?

37
TLB Issues(2)
  • Subblocking
  • put multiple PPNs in a TLB entry
  • it may waste TLB space
  • partial subblocking
  • physical memory is aligned, so
  • one PPN in a TLB entry
  • multiple valid bits needed
  • Clustered page table
  • just need a field to indicate if this list
    element is for a normal cluster or
    partial-subblock or superpage
  • mechanism for operation is naturally similar

38
Application Controlled MM
  • why user-controlled something?
  • computing usages are too diverse
  • multimedia data
  • real time
  • personal computing
  • large scientific application (usually parallel
    computing)
  • a general OS cannot satisfy the needs that come
    from such diversity
  • User-controlled what?
  • almost any part of OS scheduling, memory
    management, file system, network protocol,
    security,

39
Application Controlled MM
  • Mechanisms for User Control
  • microkernel approach
  • parts of OS that need to be customized are
    prepared as user processes or
  • they are prepared as library functions that can
    be binded into applications (ExoKernel)
  • modular but inefficient
  • binary loadable into the OS
  • needs dynamic linking method inside the OS
  • efficient but insecure

40
External Pager
  • Motivations applications doesnt know and cant
    control
  • the amount of memory available to it
  • some programs can use as much memory as it is
    given
  • what parts of it are kept in the memory
  • some data accesses are predictable
  • Some solutions
  • allow applications to pin pages in memory
  • disable OS ability to shared memory
  • hard to know how much to pin
  • allow application to advise VM system
  • madvise() system call
  • very primitive and complex mechanism

41
External Pager
  • segment manager
  • user level pager
  • reclaims page frames
  • writeback page frames
  • on page fault
  • kernel forwards this event to the segment manager
  • via signal or interrupt
  • the manager reclaims a page frame
  • may writeback a page
  • need to maintain a list of free pages

42
External Pager
  • system call
  • SetSegmentManager(seg, manager)
  • specify the manager of a segment
  • MigratePages(srcSeg, dstSeg, flags)
  • move pages from a segment to another
  • ModifyPageFlags(seg, flags)
  • set/clear dirty bit, protection
  • GetPageAttribute(seg, pages)
  • determine the flags and mapping of pages
  • the manager can be a part of the application
  • recursive page faults may occur
  • a manger pins its stack into memory

43
External Pager
  • how a manager gets a free page
  • from a free-page segment that it manages
  • reclaim a page from another segment it manages
  • request an additional page from the kernel
  • System page cache manager
  • the controller of the machines global memory
    pool
  • segment managers get segments from this
  • it may approve, deny, or partially fulfill
    requests

44
Memory Market Model
  • until now, scheduling is for the time a program
    uses
  • with multiprocessors memory will be more
    contended than CPU
  • charges each process for space_used x time
  • an application requests an amount of dram
    initially
  • the kernel allocates dram according to system
    status and user request
  • applications choose if they want large memory to
    be executed fast OR
  • small memory since it is not very urgent
  • questions remain
  • interactions with CPU scheduling

45
Summary
  • external pager is a trend
  • OS should provide abstractions of the hardware
    complete in
  • functionality (dont hide useful functionality)
  • performance (dont hide performance)
  • other user-controlled approaches
  • gang scheduling on parallel machine
  • scheduler activation
  • user-level devices and file system
Write a Comment
User Comments (0)
About PowerShow.com