Hybrid SoftwareHardware Dynamic Memory Allocator - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Hybrid SoftwareHardware Dynamic Memory Allocator

Description:

Flip the bits corresponding to the memory chunk in the bit-vector. Bit-vector ... [4]D.E.Knuth, The Art of Computer Programming Vol.I: Fundamental Algorithms. ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 32
Provided by: mustafaozg
Category:

less

Transcript and Presenter's Notes

Title: Hybrid SoftwareHardware Dynamic Memory Allocator


1
Hybrid (Software-Hardware) Dynamic Memory
Allocator
  • Prepared by Mustafa Özgür Akduran
  • Istanbul, 2006
  • Bogaziçi Üniversitesi

2
Outline
  • Introduction
  • Related Research
  • Proposed Hybrid Allocator
  • Complexity and Performance Comparison
  • Conclusion
  • References

3
Introduction
  • Need for
  • Efficient Implementation of Memory Management
    Functions
  • Memory Usage
  • Execution Performance

Dynamic Memory Allocation (DMA)
Garbage Collection
Modern Programming Languages
4
Introduction
Dynamic Memory Management
DMM
  • Current Systems
  • Execution time spent on Memory Management is 42.
  • Still important researches on
  • Good execution performance
  • Memory locality
  • How to get free chunks of memory ?
  • Software Allocator
  • Hardware Allocator

Pure Software
Low Cost Allocator
5
Introduction
  • Software Allocator
  • Different Search Techniques to organize available
    chunks of free memory
  • Disadvantage
  • Search could be in the critical path of
    allocators causing a major performance bottleneck.
  • Hardware Allocator
  • Parallel Search
  • Speed up Memory Allocation
  • Improve Performance
  • Hide execution latency of freeing objects
  • Coalescing of free chunks of memory
  • Disadvantage
  • Potential Hardware Complexity

6
Introduction
A New Hybrid Software-Hardware Allocator
Changs Hardware Allocator
PHK (Poul-Henning Kamp) Allocation Algorithm used
in Free-BSD System
Aim is to balance the hardware complexity with
performance by using both hardware and software
together.
7
Related Research
  • PHK (Poul-Henning Kamp) Allocator
  • Two most popular general purpose open source
    allocator
  • 1. Doug Lea used in LINUX System
  • 2. PHK used in Free-BSD System
  • Difference between them is less than 3 for
    memory allocation intensive benchmarks in SPEC
    2000 CPU.
  • PHK Allocator chosen bacause of its suitability
    for hardware/software co-design.
  • Free-BSD (Berkeley Software Distribution ) is an
    advanced operating system for x86 compatible
    (including Pentium and Athlon), architectures.
    It is derived from BSD, the version of UNIX
    developed at the University of California,
    Berkeley. It is developed and maintained by a
    large team of individuals.

8
Related Research
  • PHK (Poul-Henning Kamp) Allocator
  • Page based allocator
  • Each page can only contain objects of one size
  • For a large object sufficient number of pages
    allocated
  • For small objects less than a half page, object
    size is padded to the nearest power of 2
  • Allocator keeps a page directory for all
    allocated pages and at the beginning of each
    small object page, bitmap of allocation
    information is created
  • While allocating small objects, PHK Allocator
    performs a linear search on the bitmap to find
    the first available chunk in that page

9
Related Research
  • Changs Hardware Allocator
  • Based on Buddy System invented by Knuth
  • The buddy memory allocation technique divides
    memory into partitions to satisfy a memory
    request as suitably as possible
  • This system makes use of splitting memory into
    halves to try to give a best-fit
  • Compared to the memory allocation techniques
    (such as paging) that modern OS such as MS
    Windows and Linux use, the buddy memory
    allocation is relatively easy to implement, and
    does not have the hardware requirement of a
    memory management unit
  • Changs algorithm is a first method based on a
    binary OR-tree and a binary AND-tree.

10
Related Research
Or Gates
  • Changs Hardware Allocator
  • Each leaf node of the OR-tree represents base
    size of the smallest unit of memory that can be
    allocated
  • The leaves of OR-tree together represent the
    entire memory
  • AND-tree has the same number of leaves as the
    OR-tree
  • Input of the AND-tree is generated by a complex
    interconnection network of the OR-tree

11
Related Research
  • Changs Hardware Allocator
  • Or-Tree
  • Determine if there is a large enough space for
    allocation request
  • AND-Tree
  • Find the beginning address of that memory chunk
  • Flip the bits corresponding to the memory chunk
    in the bit-vector

Bit-vector
12
Related Research
13
Related Research
  • The interconnection between the OR-tree and the
    AND-tree is the most complex part of the Changs
    allocator
  • The interconnection has the same critical path
    delay as the OR/AND-tree
  • Final allocation result is produced by the output
    of the AND-tree through a set of multiplexers
  • The Hardware complexity, in terms of number of
    gates is
  • O(n logn)

the memory chunks
Critical path delay
14
Proposed Hybrid Allocator
Problems of hardware-software only allocators
1. Complexity of the hardware increases with the
size of the memory managed 2. Poor object locality
  • Pure hardware allocators based on buddy system

Poor execution performance
Software Allocators
15
Proposed Hybrid Allocator
  • Using small , fixed hardware to help manage the
    memory
  • Software portion which is based on PHK algorithm
    provides better object localities than buddy
    system
  • Hardware portion improves execution performance
    of the software portion

New Hybrid Allocator
16
Proposed Hybrid Allocator
  • Creating page indexes
  • For large sized objects (gthalf a page) does the
    allocation without any assistance from hardware
  • Allocation for a small sized object, it will
    locate the bitmap of a page with free memory and
    issues a search request to the hardware

Software portion
Responsible for
  • Search the page index (or bitmap) in parallel to
    find a free chunk
  • Mark the bitmap to indicate an allocation

Hardware portion
17
Proposed Hybrid Allocator
  • OR-tree responsible for determining if there is a
    free chunk in a page (similar to Changs system)
  • AND-tree will locate the position of the first
    free chunk in the page (similar to Changs
    system)
  • Because an OR-tree and an AND-tree are dedicated
    to one object size, complex interconnections
    between OR and AND tree are not needed( unlike
    Changs)

18
Proposed Hybrid Allocator
  • MUX uses opcode to select the address of the bit
    needed to be flipped.
  • If the opcode is alloc the address from the
    AND-tree will be chosen
  • If the opcode is free the address from the
    request will be selected
  • D-latches are used as storage devices where the
    bitmap will be loaded from the page in accordance
    with the allocation size
  • DEMUX used to decode the address from the MUX

19
Proposed Hybrid Allocator
  • Bit-flippers use the decoded address and the
    opcode to determine how to flip a desired bit

Block Diagram of Proposed Hardware Component (For
Page Size 4096 bytes and Object Size 16 bytes)
20
Proposed Hybrid Allocator
  • Overall design of the system with 4096-byte pages
  • For different object sizes, the hardware needed
    to support the bit-map will be different
  • In our design, preselected object sizes are from
    16-bytes to 2048-bytes and include hardware to
    support pages for these objects
  • MUX is used to select the hardware unit that will
    be responsible for supporting objects of a given
    size
  • The larger the object size, the smaller the
    amount of hardware needed to support the bit-maps
    indicating the availability of chunks in that page

21
Proposed Hybrid Allocator
  • With 4096-byte pages, we have 8 different sized
    objects ranging from 16-bytes to 2048-bytes.
  • For allocating
  • 2048-byte objects we need a tree with two leaves
  • 16-byte objects we need trees with 256 leaves
  • For a 16-byte object we need only 255 AND/OR
    gates
  • For overall system 137153163127255502 AND
    gates and 502 OR gates are needed
  • Very small amount compared to billions of
    transistors available on modern processor chips

22
Complexity and Performance Comparison
  • Complexity Comparison
  • Existing hardware allocator designs implement the
    buddy system
  • The amount of hardware that is used to implement
    a buddy allocator is dependent on the size of
    memory
  • That makes buddy system based allocators not
    scalable.
  • Our design has much lower hardware complexity
    than Changs allocator. (Buddy System)

23
Complexity and Performance Comparison
M Total dynamic memory size P Page size S
Smallest allocated object size
24
Complexity and Performance Comparison
  • Performance Analysis

Conventional CPU using SimpleScalar simulation
tool set V2.0
Hardware-assisted PHK allocator
25
Complexity and Performance Comparison
26
Complexity and Performance Comparison
27
Complexity and Performance Comparison
  • We show the reduced memory management execution
    cycles normalized to the original execution
    cycles spent on memory management functions by
    software only allocator
  • Cfrac application shows the best performance
    improvement
  • Ave.obj.size is 8 bytes which means that most
    pages allocated contain 256 objects
  • Linear search in the software implementation for
    that many objects will be very slow
  • The hardware speeds up the search, leading to
    76.2 normalized performance improvement over the
    software only allocation
  • Benchmark espresso with average object size of
    250 bytes shows the least amount of improvement
    using the hybrid allocator
  • Pages allocated for espresso contain fewer than
    20 objects
  • Linear search of 20 objects is not significant,
    and the hardware allocator only shows 48.0
    nornalized performance improvement
  • Other benchmarks have average object sizes of 16
    bytes to 48 bytes, so the performance gains are
    not significant as cfrac, but better than
    espresso
  • On average, the Hybrid allocator reduces the
    memory management time by 58.9. The average
    overall execution speedup of our design when
    compared to a software only allocator
    implementation is 12.7

28
Conclusion
Our Design
  • Significantly lower hardware complexity
  • Lower critical path delays
  • Our design has a fixed hardware complexity which
    is dependent on the size of a memory page (not
    the total user memory being managed)

Compared to Hardware only allocators
  • Overall execution performance is 12.7 better on
    memory intensive benchmarks
  • Memory management efficiency improved by 58.9

Compared to Software only allocators
29
Conclusion
  • Future Work
  • Exploring variable sized pages such that the
    number of allocated objects are the same in each
    page
  • All the bitmaps will have the same number of bits
  • Thus, we need only one pair of AND-tree and
    Or-tree in the design
  • That will further reduce the hardware complexity
  • This will also improve the memory management
    efficiency of allocators for large objects

30
References
  • 1 W.Li, S.P.Mohanty and K.Kavi, A Page-based
    Hybrid (Software-Hardware) Dynamic Memory
    Allocator IEEE Computer Architecture Letters
    (accepted in July 2006 for future issue)
  • 2 J.M. Chang and E.F.Gehringer, A
    High-Performance Memory Allocator for
    Object-oriented Systems, IEEE Transactions on
    Computers, Mar. 1996, pp 357-366.
  • 3 P.H.Kamp.Malloc(3)revisited,
    http//phk.freebsd.dk/pubs/malloc.pdf
  • 4D.E.Knuth, The Art of Computer Programming
    Vol.I Fundamental Algorithms., Addison-Wesley,
    1968.
  • 5D.Burgerand, T.M.Austin, The Simple Scalar
    Tool Set, V2.0, Tech Report CS-1342, University
    of Wisconsin-Madison, Jun. 1997.

31
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com