Title: Chapters 6 and 8 selections Virtual Memory and Parallel Processing
1Chapters 6 and 8 (selections)Virtual Memory and
Parallel Processing
- CS 271 Computer Architecture
- Indiana University Purdue University Fort Wayne
2The Operating System Machine Level
- This level is also known as the OSM level
- The OSM level consists of . . .
- Conventional ISM level machine language
instructions - Additional OSML instructions
- New conventional machine instructions reserved
for use by the operating system - Calls to operating system service routines (API
calls) - For example call to support reading a file
- We will focus on three areas
- Virtual memory
- Process concept
- Parallel computer architectures
3Virtual memory
- The traditional solution to the problem of not
enough memory was overlays - The programmer would break a program into pieces
called overlays - Each overlay was small enough to fit into memory
- The first overlay was brought in
- When done, it was responsible for reading in the
next overlay - The programmer was responsible for all the
details - Virtual memory allows the operating system to use
the hard disk to allow whatever RAM memory is
available to appear to expand to the size of the
address space allowed by the processor
4Virtual memory
- The virtual address space of a computer is the
set of addresses that make sense at the
conventional machine level - Typically depends on the number of bits used for
addresses - Examples
- The physical address space consists of the RAM
memory addresses that are actually installed
5Virtual memory
- Virtual addresses in a program must be mapped to
physical addresses dynamically - This means during run-time
- This requires a memory map
- A table relating a virtual address to the
corresponding physical address - Two common techniques are used
- Paging
- Segmentation
- We will only consider paging
6Paged virtual memory
- The virtual address space is divided into 2m
pages of fixed size 2n - m n the number of physical address bits
- Virtual address
- Physical memory is logically divided into 2k page
frames of the same fixed size 2n - Of course, k lt m
- Any page may be loaded into any available page
frame
m bits n
bits
page number displacement
within page
7Pages and page frames
8Paged virtual memory
- The operating system maintains a page table for
each process - The page table consists of page table entries (or
PTEs) - Entry n in the table is the PTE of page n
- The PTE of page n gives the page frame number
where the page is loaded
9The page table and PTEs
- The present / absent bit indicates if the page
has been loaded - 0 not loaded
- 1 loaded
- Sometimes called a residence bit or a valid bit
10The page table and PTEs
- Pages 2, 4, 7, 9, 10, 12,13, and 15 are not
presently loaded into memory
11Address translation
- Address translation refers to mapping virtual
addresses to physical addresses - This is done by dynamically by the memory
management unit (MMU) - Given a virtual address let . . .
- p be the page number
- d be the displacement within the page
12Address translation
- The MMU does the following
- Uses p to index into the page table to fetch the
PTE - If the residence bit is 1, then extract the frame
number f - This is a k-bit number
- The physical address is
- If the residence bit is 0, generate a page fault
- This is another type of internal interrupt
similar to a trap - It is not fatal, but just temporarily blocks the
process
k bits n
bits
frame number f displacement d
within page
13Diagram from Stallings, Operating Systems,
Internals and Design Principles, 4th ed.,
Prentice-Hall (2001)
14Page fault handler
- The operating system page fault handler does the
following - Locates an empty page frame
- Finds the disk address for the missing page in a
directory - Activates the DMA to copy the needed page from
the disk into the empty page frame - Calls the operating system dispatcher routine to
switch to another process - This allows the processor to do something useful
while the DMA is working
15DMA interrupt handler
- When the DMA has completed the transfer, it
issues an interrupt - The interrupt handler for the DMA does the
following - Changes the residence bit in the PTE for the page
to 1 - Places the new frame number in the PTE frame
field - Schedules the process that caused the page fault
for later activation - When the process resumes, it tries address
translation as before and this time succeeds
16Paged virtual memory
- Paging is transparent to the OSM level user
- It is implemented at the ISA level
- New ISA hardware must provide an automatic
mechanism (the MMU hardware) to - Either translate the mn bit virtual address to a
kn bit physical address - Or generate a page fault
- This requires an additional memory cycle for each
memory reference in order to fetch the needed PTE - More hardware may also be needed to facilitate
this
17Paged virtual memory
- When a page fault occurs, all page frames are
typically full - To make room for the needed page, one of the
currently loaded pages must be sent back to the
disk - How the unlucky page is chosen in determined by a
page replacement policy - The ideal choice
- Choose the page that will be needed the farthest
in the future, if at all - Some page replacement algorithms
- LRU Least Recently Used
- FIFO First In First Out
18LRU page replacement
- Swap out the Least Recently Used page
- This method performs well
- To implement LRU, time stamp each page frame
whenever it is referenced - A practical way to do this is to have a special
memory cell associated with each page frame - For every reference, increment a global counter
and copy it into the special cell of the
associated frame - When a page needs to be replaced, the page fault
handler searches for the frame with the lowest
counter - This involves overhead and costly hardware
19FIFO page replacement
- Swap out the oldest loaded page
- Implement FIFO as follows
- Add a counter field to each PTE
- When a new page is loaded, set the PTE counter to
0 - Whenever a page fault occurs, the page fault
handler iterates through the page table and
increments all of the PTE counters - While doing this, the handler makes note of the
page with the highest counter and chooses that
page for replacement - Implementation is much simpler than LRU
- But how would FIFO work in a grocery store?
20Dirty pages
- If a page that is to be replaced has not been
modified (written), it need not be copied back to
disk - The disk copy is an identical clean copy
- If the page has been modified, the disk copy is
dirty - The page in memory must be copied back to disk
- Bookkeeping
- Include an extra dirty bit in the PTE
- Initialize the dirty bit to 0
- Set the bit to 1 whenever there is a write to the
page - This could be implemented by the microcode for
memory writes - After a page fault, this bit determines whether
the page needs to be copied back to disk
21Parallel processing
- A large problem may sometimes be solved by
distributing the computations over many CPUs that
work on the problem simultaneously - The best way to organize the activity is to
decompose the activity into separate independent
processes - A process can be thought of as a running program
together with all of its state information - A process can be interrupted at any point and
resumed later - Each process runs on only one processor at a time
- At least in the simple case of a process
consisting of a single thread - A process can jump from one processor to another
22Processes
- Typically, many processes are concurrently active
on a computer - Each process gives the illusion of running on a
separate OSML computer - OSML instructions allow process . . .
- Creation
- Termination
- Memory sharing and synchronization
- Blockage
- Scheduling
- Etc.
23Processes
- The operating system needs to maintain state
information for each process - PSW
- PC
- Registers
- Stack
- Allocated address space (memory)
- Privilege
- Pending I/O activity
- Device ownership
- etc.
24Process states
- At a given time, a process may be running, ready,
or blocked
Running
Ready
dispatch
time out
block (event wait)
wake-up (event completion)
Blocked
25Process states
- The ready state involves a queue of waiting
processes - A process makes a number of state transitions
whenever there is a . . . - Page fault
- Time-out
- I/O request
- A transition to the blocked state is the only
transition that a process itself initiates
26Concurrent processes
- Asynchronous concurrent processes . . .
- May collaborate on an application
- Need to communicate
- Need to synchronize
- Concurrent processes may run on . . .
- A single shared processor
- Simulated parallel processing
- Separate processors
- True parallel processing
27Single processor execution
- Simulated parallel processing on a single
processor is implemented using time slicing - A time slice is a maximum time increment for a
process
28Single processor execution
- Each time slice terminates with a timer interrupt
- The interrupt handler . . .
- Saves the state of the interrupted process
- Enqueues the interrupted process in the ready
queue - Dequeues the next process to run from the ready
queue - Loads the state of the new process
- Transfers to the new process
29Multiple processor execution
- Symmetric multiprocessing (SMP)
- Multiple processors share a common memory
- Each processor is equivalent
- If there are more processes than processors, then
the CPUs must simulate parallelism with time
slicing
CPU1
CPU2
CPU3
memory
CPU4
30Parallel computer architectures
- High-level decomposition of parallel architectures
(a) On-chip parallelism (b) A coprocessor (c) A
multiprocessor (d) A multicomputer (e) A grid
31Homogeneous multiprocessors on a chip
- (a) A dual-pipeline chip (Pentium 4
hyperthreading) - Allows resources (functional units) to be shared
- Does not scale up well
- (b) A chip with two cores
- A core is a complete CPU
32Symmetric multiprocessors (SMP)
- (a) A multiprocessor with 16 CPUs sharing a
common memory - (b) An image partitioned into 16 sections, each
being analyzed by a different CPU
33Multicomputers
- (a) A multicomputer with 16 CPUs, each with its
own private memory - (b) The bit-map image of Fig. 8-17 split up among
the 16 memories
34Taxonomy of parallel computers
35UMA symmetric multiprocessor architectures
- (a) Without caching
- (b) With caching
- (c) With caching and private memories
36UMA multiprocessors using crossbar switches
- (a) An 8 8 crossbar switch
- (b) An open crosspoint
- (c) A closed crosspoint
37Message-passing multicomputers
38Interconnection network topologies
- The heavy dots represent switches (the CPUs and
memories are not shown) - (a) A star
- (b) A complete interconnect
- (c) A tree
- (d) A ring
- (e) A grid
- (f) A double torus
- (g) A cube
- (h) A 4D hypercube
39Massively parallel processors (MPPs)
- Typical supercomputer
- Use standard CPUs
- Intel Pentium
- Sun UltraSPARC
- IBM PowerPC
- Set apart by a very high-performance proprietary
interconnection network
40BlueGene/L MPP
- The BlueGene/L custom processor
chip - Design goals
- Worlds fastest MPP
- Most efficient in terms of teraflops/dollar and
terraflops/watt - 65,536 dual-processor nodes configured as a 32 x
32 x 64 3-D torus - Peak 360 teraflops/sec (sustained 280.6
teraflops/sec) - 1.5 megawatts
- 2500 square feet floor space
41BlueGene/L MPP
42COWs (Cluster of Workstations)
- A cluster consists of dozens, hundreds, or
thousands of PCs or workstations connected over a
commercially-available network - Two dominant types
- Centralized
- Typically all in one room
- Decentralized
- Connected by a LAN or the internet
- Google
43Software metrics
- Real programs achieve less than the perfect
speedup indicated by the dotted line - Data from a multicomputer consisting of 64
Pentium Pro CPUs