Chapters 6 and 8 selections Virtual Memory and Parallel Processing - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Chapters 6 and 8 selections Virtual Memory and Parallel Processing

Description:

(f) A double torus (g) A cube (h) A 4D hypercube. 39. Massively parallel processors (MPPs) ... 65,536 dual-processor nodes configured as a 32 x 32 x 64 3-D torus ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 44
Provided by: markt2
Category:

less

Transcript and Presenter's Notes

Title: Chapters 6 and 8 selections Virtual Memory and Parallel Processing


1
Chapters 6 and 8 (selections)Virtual Memory and
Parallel Processing
  • CS 271 Computer Architecture
  • Indiana University Purdue University Fort Wayne

2
The Operating System Machine Level
  • This level is also known as the OSM level
  • The OSM level consists of . . .
  • Conventional ISM level machine language
    instructions
  • Additional OSML instructions
  • New conventional machine instructions reserved
    for use by the operating system
  • Calls to operating system service routines (API
    calls)
  • For example call to support reading a file
  • We will focus on three areas
  • Virtual memory
  • Process concept
  • Parallel computer architectures

3
Virtual memory
  • The traditional solution to the problem of not
    enough memory was overlays
  • The programmer would break a program into pieces
    called overlays
  • Each overlay was small enough to fit into memory
  • The first overlay was brought in
  • When done, it was responsible for reading in the
    next overlay
  • The programmer was responsible for all the
    details
  • Virtual memory allows the operating system to use
    the hard disk to allow whatever RAM memory is
    available to appear to expand to the size of the
    address space allowed by the processor

4
Virtual memory
  • The virtual address space of a computer is the
    set of addresses that make sense at the
    conventional machine level
  • Typically depends on the number of bits used for
    addresses
  • Examples
  • The physical address space consists of the RAM
    memory addresses that are actually installed

5
Virtual memory
  • Virtual addresses in a program must be mapped to
    physical addresses dynamically
  • This means during run-time
  • This requires a memory map
  • A table relating a virtual address to the
    corresponding physical address
  • Two common techniques are used
  • Paging
  • Segmentation
  • We will only consider paging

6
Paged virtual memory
  • The virtual address space is divided into 2m
    pages of fixed size 2n
  • m n the number of physical address bits
  • Virtual address
  • Physical memory is logically divided into 2k page
    frames of the same fixed size 2n
  • Of course, k lt m
  • Any page may be loaded into any available page
    frame

m bits n
bits
page number displacement
within page
7
Pages and page frames
8
Paged virtual memory
  • The operating system maintains a page table for
    each process
  • The page table consists of page table entries (or
    PTEs)
  • Entry n in the table is the PTE of page n
  • The PTE of page n gives the page frame number
    where the page is loaded

9
The page table and PTEs
  • The present / absent bit indicates if the page
    has been loaded
  • 0 not loaded
  • 1 loaded
  • Sometimes called a residence bit or a valid bit

10
The page table and PTEs
  • Pages 2, 4, 7, 9, 10, 12,13, and 15 are not
    presently loaded into memory

11
Address translation
  • Address translation refers to mapping virtual
    addresses to physical addresses
  • This is done by dynamically by the memory
    management unit (MMU)
  • Given a virtual address let . . .
  • p be the page number
  • d be the displacement within the page

12
Address translation
  • The MMU does the following
  • Uses p to index into the page table to fetch the
    PTE
  • If the residence bit is 1, then extract the frame
    number f
  • This is a k-bit number
  • The physical address is
  • If the residence bit is 0, generate a page fault
  • This is another type of internal interrupt
    similar to a trap
  • It is not fatal, but just temporarily blocks the
    process

k bits n
bits
frame number f displacement d
within page
13
Diagram from Stallings, Operating Systems,
Internals and Design Principles, 4th ed.,
Prentice-Hall (2001)
14
Page fault handler
  • The operating system page fault handler does the
    following
  • Locates an empty page frame
  • Finds the disk address for the missing page in a
    directory
  • Activates the DMA to copy the needed page from
    the disk into the empty page frame
  • Calls the operating system dispatcher routine to
    switch to another process
  • This allows the processor to do something useful
    while the DMA is working

15
DMA interrupt handler
  • When the DMA has completed the transfer, it
    issues an interrupt
  • The interrupt handler for the DMA does the
    following
  • Changes the residence bit in the PTE for the page
    to 1
  • Places the new frame number in the PTE frame
    field
  • Schedules the process that caused the page fault
    for later activation
  • When the process resumes, it tries address
    translation as before and this time succeeds

16
Paged virtual memory
  • Paging is transparent to the OSM level user
  • It is implemented at the ISA level
  • New ISA hardware must provide an automatic
    mechanism (the MMU hardware) to
  • Either translate the mn bit virtual address to a
    kn bit physical address
  • Or generate a page fault
  • This requires an additional memory cycle for each
    memory reference in order to fetch the needed PTE
  • More hardware may also be needed to facilitate
    this

17
Paged virtual memory
  • When a page fault occurs, all page frames are
    typically full
  • To make room for the needed page, one of the
    currently loaded pages must be sent back to the
    disk
  • How the unlucky page is chosen in determined by a
    page replacement policy
  • The ideal choice
  • Choose the page that will be needed the farthest
    in the future, if at all
  • Some page replacement algorithms
  • LRU Least Recently Used
  • FIFO First In First Out

18
LRU page replacement
  • Swap out the Least Recently Used page
  • This method performs well
  • To implement LRU, time stamp each page frame
    whenever it is referenced
  • A practical way to do this is to have a special
    memory cell associated with each page frame
  • For every reference, increment a global counter
    and copy it into the special cell of the
    associated frame
  • When a page needs to be replaced, the page fault
    handler searches for the frame with the lowest
    counter
  • This involves overhead and costly hardware

19
FIFO page replacement
  • Swap out the oldest loaded page
  • Implement FIFO as follows
  • Add a counter field to each PTE
  • When a new page is loaded, set the PTE counter to
    0
  • Whenever a page fault occurs, the page fault
    handler iterates through the page table and
    increments all of the PTE counters
  • While doing this, the handler makes note of the
    page with the highest counter and chooses that
    page for replacement
  • Implementation is much simpler than LRU
  • But how would FIFO work in a grocery store?

20
Dirty pages
  • If a page that is to be replaced has not been
    modified (written), it need not be copied back to
    disk
  • The disk copy is an identical clean copy
  • If the page has been modified, the disk copy is
    dirty
  • The page in memory must be copied back to disk
  • Bookkeeping
  • Include an extra dirty bit in the PTE
  • Initialize the dirty bit to 0
  • Set the bit to 1 whenever there is a write to the
    page
  • This could be implemented by the microcode for
    memory writes
  • After a page fault, this bit determines whether
    the page needs to be copied back to disk

21
Parallel processing
  • A large problem may sometimes be solved by
    distributing the computations over many CPUs that
    work on the problem simultaneously
  • The best way to organize the activity is to
    decompose the activity into separate independent
    processes
  • A process can be thought of as a running program
    together with all of its state information
  • A process can be interrupted at any point and
    resumed later
  • Each process runs on only one processor at a time
  • At least in the simple case of a process
    consisting of a single thread
  • A process can jump from one processor to another

22
Processes
  • Typically, many processes are concurrently active
    on a computer
  • Each process gives the illusion of running on a
    separate OSML computer
  • OSML instructions allow process . . .
  • Creation
  • Termination
  • Memory sharing and synchronization
  • Blockage
  • Scheduling
  • Etc.

23
Processes
  • The operating system needs to maintain state
    information for each process
  • PSW
  • PC
  • Registers
  • Stack
  • Allocated address space (memory)
  • Privilege
  • Pending I/O activity
  • Device ownership
  • etc.

24
Process states
  • At a given time, a process may be running, ready,
    or blocked

Running
Ready
dispatch
time out
block (event wait)
wake-up (event completion)
Blocked
25
Process states
  • The ready state involves a queue of waiting
    processes
  • A process makes a number of state transitions
    whenever there is a . . .
  • Page fault
  • Time-out
  • I/O request
  • A transition to the blocked state is the only
    transition that a process itself initiates

26
Concurrent processes
  • Asynchronous concurrent processes . . .
  • May collaborate on an application
  • Need to communicate
  • Need to synchronize
  • Concurrent processes may run on . . .
  • A single shared processor
  • Simulated parallel processing
  • Separate processors
  • True parallel processing

27
Single processor execution
  • Simulated parallel processing on a single
    processor is implemented using time slicing
  • A time slice is a maximum time increment for a
    process

28
Single processor execution
  • Each time slice terminates with a timer interrupt
  • The interrupt handler . . .
  • Saves the state of the interrupted process
  • Enqueues the interrupted process in the ready
    queue
  • Dequeues the next process to run from the ready
    queue
  • Loads the state of the new process
  • Transfers to the new process

29
Multiple processor execution
  • Symmetric multiprocessing (SMP)
  • Multiple processors share a common memory
  • Each processor is equivalent
  • If there are more processes than processors, then
    the CPUs must simulate parallelism with time
    slicing

CPU1
CPU2
CPU3
memory
CPU4
30
Parallel computer architectures
  • High-level decomposition of parallel architectures

(a) On-chip parallelism (b) A coprocessor (c) A
multiprocessor (d) A multicomputer (e) A grid
31
Homogeneous multiprocessors on a chip
  • (a) A dual-pipeline chip (Pentium 4
    hyperthreading)
  • Allows resources (functional units) to be shared
  • Does not scale up well
  • (b) A chip with two cores
  • A core is a complete CPU

32
Symmetric multiprocessors (SMP)
  • (a) A multiprocessor with 16 CPUs sharing a
    common memory
  • (b) An image partitioned into 16 sections, each
    being analyzed by a different CPU

33
Multicomputers
  • (a) A multicomputer with 16 CPUs, each with its
    own private memory
  • (b) The bit-map image of Fig. 8-17 split up among
    the 16 memories

34
Taxonomy of parallel computers
35
UMA symmetric multiprocessor architectures
  • (a) Without caching
  • (b) With caching
  • (c) With caching and private memories

36
UMA multiprocessors using crossbar switches
  • (a) An 8 8 crossbar switch
  • (b) An open crosspoint
  • (c) A closed crosspoint

37
Message-passing multicomputers
  • A generic multicomputer

38
Interconnection network topologies
  • The heavy dots represent switches (the CPUs and
    memories are not shown)
  • (a) A star
  • (b) A complete interconnect
  • (c) A tree
  • (d) A ring
  • (e) A grid
  • (f) A double torus
  • (g) A cube
  • (h) A 4D hypercube

39
Massively parallel processors (MPPs)
  • Typical supercomputer
  • Use standard CPUs
  • Intel Pentium
  • Sun UltraSPARC
  • IBM PowerPC
  • Set apart by a very high-performance proprietary
    interconnection network

40
BlueGene/L MPP
  • The BlueGene/L custom processor
    chip
  • Design goals
  • Worlds fastest MPP
  • Most efficient in terms of teraflops/dollar and
    terraflops/watt
  • 65,536 dual-processor nodes configured as a 32 x
    32 x 64 3-D torus
  • Peak 360 teraflops/sec (sustained 280.6
    teraflops/sec)
  • 1.5 megawatts
  • 2500 square feet floor space

41
BlueGene/L MPP
42
COWs (Cluster of Workstations)
  • A cluster consists of dozens, hundreds, or
    thousands of PCs or workstations connected over a
    commercially-available network
  • Two dominant types
  • Centralized
  • Typically all in one room
  • Decentralized
  • Connected by a LAN or the internet
  • Google

43
Software metrics
  • Real programs achieve less than the perfect
    speedup indicated by the dotted line
  • Data from a multicomputer consisting of 64
    Pentium Pro CPUs
Write a Comment
User Comments (0)
About PowerShow.com