Chapters 6 and 8 selections Virtual Memory and Parallel Processing - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Chapters 6 and 8 selections Virtual Memory and Parallel Processing

Description:

(f) A double torus (g) A cube (h) A 4D hypercube. 39. Massively parallel processors (MPPs) ... 65,536 dual-processor nodes configured as a 32 x 32 x 64 3-D torus ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 44

Provided by: markt2

Category:

more less

Transcript and Presenter's Notes

Title: Chapters 6 and 8 selections Virtual Memory and Parallel Processing

1
Chapters 6 and 8 (selections)Virtual Memory and
Parallel Processing

CS 271 Computer Architecture
Indiana University Purdue University Fort Wayne

2
The Operating System Machine Level

This level is also known as the OSM level
The OSM level consists of . . .
Conventional ISM level machine language
instructions
Additional OSML instructions
New conventional machine instructions reserved
for use by the operating system
Calls to operating system service routines (API
calls)
For example call to support reading a file
We will focus on three areas
Virtual memory
Process concept
Parallel computer architectures

3
Virtual memory

The traditional solution to the problem of not
enough memory was overlays
The programmer would break a program into pieces
called overlays
Each overlay was small enough to fit into memory
The first overlay was brought in
When done, it was responsible for reading in the
next overlay
The programmer was responsible for all the
details
Virtual memory allows the operating system to use
the hard disk to allow whatever RAM memory is
available to appear to expand to the size of the
address space allowed by the processor

4
Virtual memory

The virtual address space of a computer is the
set of addresses that make sense at the
conventional machine level
Typically depends on the number of bits used for
addresses
Examples
The physical address space consists of the RAM
memory addresses that are actually installed

5
Virtual memory

Virtual addresses in a program must be mapped to
physical addresses dynamically
This means during run-time
This requires a memory map
A table relating a virtual address to the
corresponding physical address
Two common techniques are used
Paging
Segmentation
We will only consider paging

6
Paged virtual memory

The virtual address space is divided into 2m
pages of fixed size 2n
m n the number of physical address bits
Virtual address
Physical memory is logically divided into 2k page
frames of the same fixed size 2n
Of course, k lt m
Any page may be loaded into any available page
frame

m bits n
bits
page number displacement
within page
7
Pages and page frames
8
Paged virtual memory

The operating system maintains a page table for
each process
The page table consists of page table entries (or
PTEs)
Entry n in the table is the PTE of page n
The PTE of page n gives the page frame number
where the page is loaded

9
The page table and PTEs

The present / absent bit indicates if the page
has been loaded
0 not loaded
1 loaded
Sometimes called a residence bit or a valid bit

10
The page table and PTEs

Pages 2, 4, 7, 9, 10, 12,13, and 15 are not
presently loaded into memory

11
Address translation

Address translation refers to mapping virtual
addresses to physical addresses
This is done by dynamically by the memory
management unit (MMU)
Given a virtual address let . . .
p be the page number
d be the displacement within the page

12
Address translation

The MMU does the following
Uses p to index into the page table to fetch the
PTE
If the residence bit is 1, then extract the frame
number f
This is a k-bit number
The physical address is
If the residence bit is 0, generate a page fault
This is another type of internal interrupt
similar to a trap
It is not fatal, but just temporarily blocks the
process

k bits n
bits
frame number f displacement d
within page
13
Diagram from Stallings, Operating Systems,
Internals and Design Principles, 4th ed.,
Prentice-Hall (2001)
14
Page fault handler

The operating system page fault handler does the
following
Locates an empty page frame
Finds the disk address for the missing page in a
directory
Activates the DMA to copy the needed page from
the disk into the empty page frame
Calls the operating system dispatcher routine to
switch to another process
This allows the processor to do something useful
while the DMA is working

15
DMA interrupt handler

When the DMA has completed the transfer, it
issues an interrupt
The interrupt handler for the DMA does the
following
Changes the residence bit in the PTE for the page
to 1
Places the new frame number in the PTE frame
field
Schedules the process that caused the page fault
for later activation
When the process resumes, it tries address
translation as before and this time succeeds

16
Paged virtual memory

Paging is transparent to the OSM level user
It is implemented at the ISA level
New ISA hardware must provide an automatic
mechanism (the MMU hardware) to
Either translate the mn bit virtual address to a
kn bit physical address
Or generate a page fault
This requires an additional memory cycle for each
memory reference in order to fetch the needed PTE
More hardware may also be needed to facilitate
this

17
Paged virtual memory

When a page fault occurs, all page frames are
typically full
To make room for the needed page, one of the
currently loaded pages must be sent back to the
disk
How the unlucky page is chosen in determined by a
page replacement policy
The ideal choice
Choose the page that will be needed the farthest
in the future, if at all
Some page replacement algorithms
LRU Least Recently Used
FIFO First In First Out

18
LRU page replacement

Swap out the Least Recently Used page
This method performs well
To implement LRU, time stamp each page frame
whenever it is referenced
A practical way to do this is to have a special
memory cell associated with each page frame
For every reference, increment a global counter
and copy it into the special cell of the
associated frame
When a page needs to be replaced, the page fault
handler searches for the frame with the lowest
counter
This involves overhead and costly hardware

19
FIFO page replacement

Swap out the oldest loaded page
Implement FIFO as follows
Add a counter field to each PTE
When a new page is loaded, set the PTE counter to
0
Whenever a page fault occurs, the page fault
handler iterates through the page table and
increments all of the PTE counters
While doing this, the handler makes note of the
page with the highest counter and chooses that
page for replacement
Implementation is much simpler than LRU
But how would FIFO work in a grocery store?

20
Dirty pages

If a page that is to be replaced has not been
modified (written), it need not be copied back to
disk
The disk copy is an identical clean copy
If the page has been modified, the disk copy is
dirty
The page in memory must be copied back to disk
Bookkeeping
Include an extra dirty bit in the PTE
Initialize the dirty bit to 0
Set the bit to 1 whenever there is a write to the
page
This could be implemented by the microcode for
memory writes
After a page fault, this bit determines whether
the page needs to be copied back to disk

21
Parallel processing

A large problem may sometimes be solved by
distributing the computations over many CPUs that
work on the problem simultaneously
The best way to organize the activity is to
decompose the activity into separate independent
processes
A process can be thought of as a running program
together with all of its state information
A process can be interrupted at any point and
resumed later
Each process runs on only one processor at a time
At least in the simple case of a process
consisting of a single thread
A process can jump from one processor to another

22
Processes

Typically, many processes are concurrently active
on a computer
Each process gives the illusion of running on a
separate OSML computer
OSML instructions allow process . . .
Creation
Termination
Memory sharing and synchronization
Blockage
Scheduling
Etc.

23
Processes

The operating system needs to maintain state
information for each process
PSW
PC
Registers
Stack
Allocated address space (memory)
Privilege
Pending I/O activity
Device ownership
etc.

24
Process states

At a given time, a process may be running, ready,
or blocked

Running
Ready
dispatch
time out
block (event wait)
wake-up (event completion)
Blocked
25
Process states

The ready state involves a queue of waiting
processes
A process makes a number of state transitions
whenever there is a . . .
Page fault
Time-out
I/O request
A transition to the blocked state is the only
transition that a process itself initiates

26
Concurrent processes

Asynchronous concurrent processes . . .
May collaborate on an application
Need to communicate
Need to synchronize
Concurrent processes may run on . . .
A single shared processor
Simulated parallel processing
Separate processors
True parallel processing

27
Single processor execution

Simulated parallel processing on a single
processor is implemented using time slicing
A time slice is a maximum time increment for a
process

28
Single processor execution

Each time slice terminates with a timer interrupt
The interrupt handler . . .
Saves the state of the interrupted process
Enqueues the interrupted process in the ready
queue
Dequeues the next process to run from the ready
queue
Loads the state of the new process
Transfers to the new process

29
Multiple processor execution

Symmetric multiprocessing (SMP)
Multiple processors share a common memory
Each processor is equivalent
If there are more processes than processors, then
the CPUs must simulate parallelism with time
slicing

CPU1
CPU2
CPU3
memory
CPU4
30
Parallel computer architectures

High-level decomposition of parallel architectures

(a) On-chip parallelism (b) A coprocessor (c) A
multiprocessor (d) A multicomputer (e) A grid
31
Homogeneous multiprocessors on a chip

(a) A dual-pipeline chip (Pentium 4
hyperthreading)
Allows resources (functional units) to be shared
Does not scale up well
(b) A chip with two cores
A core is a complete CPU

32
Symmetric multiprocessors (SMP)

(a) A multiprocessor with 16 CPUs sharing a
common memory
(b) An image partitioned into 16 sections, each
being analyzed by a different CPU

33
Multicomputers

(a) A multicomputer with 16 CPUs, each with its
own private memory
(b) The bit-map image of Fig. 8-17 split up among
the 16 memories

34
Taxonomy of parallel computers
35
UMA symmetric multiprocessor architectures

(a) Without caching
(b) With caching
(c) With caching and private memories

36
UMA multiprocessors using crossbar switches

(a) An 8 8 crossbar switch
(b) An open crosspoint
(c) A closed crosspoint

37
Message-passing multicomputers

A generic multicomputer

38
Interconnection network topologies

The heavy dots represent switches (the CPUs and
memories are not shown)
(a) A star
(b) A complete interconnect
(c) A tree
(d) A ring
(e) A grid
(f) A double torus
(g) A cube
(h) A 4D hypercube

39
Massively parallel processors (MPPs)

Typical supercomputer
Use standard CPUs
Intel Pentium
Sun UltraSPARC
IBM PowerPC
Set apart by a very high-performance proprietary
interconnection network

40
BlueGene/L MPP

The BlueGene/L custom processor
chip
Design goals
Worlds fastest MPP
Most efficient in terms of teraflops/dollar and
terraflops/watt
65,536 dual-processor nodes configured as a 32 x
32 x 64 3-D torus
Peak 360 teraflops/sec (sustained 280.6
teraflops/sec)
1.5 megawatts
2500 square feet floor space

41
BlueGene/L MPP
42
COWs (Cluster of Workstations)

A cluster consists of dozens, hundreds, or
thousands of PCs or workstations connected over a
commercially-available network
Two dominant types
Centralized
Typically all in one room
Decentralized
Connected by a LAN or the internet
Google

43
Software metrics