Title: CS 3xx Introduction to High Performance Computer Architecture: Address Accessible Memories
1CS 3xx Introduction to High Performance Computer
Architecture Address Accessible Memories
- A.R. Hurson
- 325 CS Building,
- Missouri ST
- hurson_at_mst.edu
2Introduction to High Performance Computer
Architecture
3Introduction to High Performance Computer
Architecture
- Memory System
- In pursuit of improving the performance and hence
to reduce the CPU time
in this section we will talk about the memory
system.
- The goal is to develop means to reduce m
- and k.
4Introduction to High Performance Computer
Architecture
- Memory System
- Memory Requirements for a Computer
- An internal storage medium to store the
intermediate as well as the final results, - An external storage medium to store input
information, and - An external storage medium to store permanent
results for future
5Introduction to High Performance Computer
Architecture
- Memory System
- Different parameters can be used in order to
classify the memory systems. - In the following we will use the access mode in
order to classify memory systems - Access mode is defined as the way the information
stored in the memory is accessed.
6Introduction to High Performance Computer
Architecture
- Memory System Access Mode
- Address Accessible Memory Where information is
accessed by its address in the memory space. - Content Addressable Memory Where information is
accessed by its contents (or partial contents).
7Introduction to High Performance Computer
Architecture
- Memory System Access Mode
- Within the scope of address accessible memory we
can distinguish several sub-classes - Random Access Memory (RAM) Access time is
independent of the location of the information. - Sequential Access Memory (SAM) Access time is a
function of the location of the information. - Direct Access Memory (DAM) Access time is
partially independent of and partially dependent
on the location of the information.
8Introduction to High Performance Computer
Architecture
- Memory System Access Mode
- Even within each subclass, we can distinguish
several sub subclasses. - For example within the scope of Direct Access
Memory we can recognize different groups - Movable head disk,
- Fixed head disk,
- Parallel disk
9Introduction to High Performance Computer
Architecture
- Memory System
- Movable head disk Each surface has just one
read/write head. To initiate a read or write,
the read/write head should be positioned on the
right track first seek time. - Seek time is a mechanical movement and hence,
relatively, very slow and time consuming.
10Introduction to High Performance Computer
Architecture
- Memory System
- Fixed head disk Each track has its own
read/write head. This eliminates the seek time.
However, this performance improvement comes at
the expense of cost.
11Introduction to High Performance Computer
Architecture
- Memory System
- Parallel disk To respond the growth in
performance and capacity of semiconductor,
secondary storage technology, introduced RAID
Redundant Array of Inexpensive Disks. - In short RAID is a large array of small
independent disks acting as a single high
performance logical disk.
12Introduction to High Performance Computer
Architecture
- Memory System RAID
- RAID increases the performance and reliability.
- Data Striping
- Redundancy
13Introduction to High Performance Computer
Architecture
- Memory System RAID
- Concept of data striping (distributing data
transparently over multiple disks) is used to
allow parallel access to the data and hence to
improve disk performance. - In data striping, the data set is partitioned
into equal size segments, and segments are
distributed over multiple disks. - The size of segment is called the striping unit.
14Introduction to High Performance Computer
Architecture
- Memory System RAID
- Redundant information allows reconstruction of
data if a disk fails. There are two choices to
store redundant data - Store redundant information on a small number of
separate disks check disks. - Distribute the redundant information uniformly
over all disks. - Redundant information can be an exact duplicate
of the data or we can use a Parity scheme
additional information that can be used to
recover from failure of any one disk in the array.
15Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 0 Striping without redundancy
- Offers the best write performance no redundant
data is being updated. - Offers the highest Space utilization.
- Does not offer the best read performance.
16Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 0
17Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 1 Mirrored Two identical copies
- Each disk has a mirror image.
- Is the most expensive solution space
utilization is the lowest. - Parallel reads are allowed.
- Write involves two disks, in some cases this will
be done in sequence. - Maximum transfer rate is equal to the transfer
rate of one disk No striping.
18Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 1
Mirrored
19Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 0 1 Striping and Mirroring
- Parallel reads are allowed.
- Space utilization is the same as level 1.
- Write involves two disks and cost of write is the
same as Level 1. - Maximum transfer rate is equal to the aggregate
bandwidth of the disks.
20Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 2 Error Correcting Codes
- The striping unit is a single bit.
- Hamming coding is used as a redundancy scheme.
- Space utilization increases as the number of data
disks increases. - Maximum transfer rate is equal to the aggregate
bandwidth of the disks Read is very efficient
for large requests and is bad for small requests
of the size of an individual block.
21Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 2
Error Correcting Codes
22Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 3 Bit Interleaved Parity
- The striping unit is a single bit.
- Unlike the level 2, the check disk is just a
single parity disk, and hence it offers a higher
space utilization than level 2. - Write protocol is similar to level 2
read-modify-write cycle. - Similar to level 2, can process one I/O at a time
each read and write request involves all disks.
23Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 3
24Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 4 Block Interleaved Parity
- The striping unit is a disk block.
- Read requests of the size of a single block can
be served just by one disk. - Parallel reads are possible for small requests,
large requests can utilize full bandwidth. - Write involves modified block and check disk.
- Space utilization increases with the number of
data disks.
25Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 4
26Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 5 Block Interleaved Distributed Parity
- Parity blocks are uniformly distributed among all
disks. This eliminates the bottleneck at the
check disk. - Several writes can be potentially done in
parallel. - Read requests have a higher level of parallelism.
27Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 5
28Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 6 PQ Redundancy
- Can tolerate higher level of failure than level
2. - It requires two check disks and similar to level
5, redundant blocks are uniformly distributed at
the block level over the disks. - For small and large read requests, and large
write requests, the performance is similar to
level 5. - For small write requests, it behaves the same as
level 2.
29Introduction to High Performance Computer
Architecture
- Memory System RAID
- Level 6
30Introduction to High Performance Computer
Architecture
31Introduction to High Performance Computer
Architecture
- Memory System RAM
- Random access memory can also be grouped into
different classes - Read Only Memory (ROM)
- Programmable ROM
- Erasable Programmable ROM (EPROM)
- Electrically Alterable ROM (EAROM)
- Flash Memory
32Introduction to High Performance Computer
Architecture
- Memory System RAM
- Read/Write Memory (RWM)
- Static RAM (SRAM)
- Dynamic RAM (DRAM)
- Synchronous DRAM
- Double-Data-Rate SDRAM
- Volatile/Non-Volatile Memory
- Destructive/Non-Destructive Read Memory
33Introduction to High Performance Computer
Architecture
- Memory System
- Within the scope of Random Access Memory we are
concerned about two major issues - Access Gap Is the difference between the CPU
cycle time and the main memory cycle time. - Size Gap Is the difference between the size of
the main memory and the size of the information
space.
34Introduction to High Performance Computer
Architecture
- Memory System
- Within the scope of the memory system, the goal
is to design and build a system with low cost per
bit, high speed, and high capacity. In other
words, in the design of a memory system we want
to - Match the rate of the information access with the
processor speed. - Attain adequate performance at a reasonable cost.
35Introduction to High Performance Computer
Architecture
- Memory System
- The appearance of a variety of hardware as well
as software solutions represents the fact that in
the worst cases the trade-off between cost,
speed, and capacity can be made more attractive
by combining different hardware systems coupled
with special features memory hierarchy.
36Introduction to High Performance Computer
Architecture
- Memory System Access gap
- Access gap problem was created by the advances in
technology. In fact in early computers, such as
IBM 704, CPU and main memory cycle time were
identical 12 µsec. - IBM 360/195 had the logic delay of 5 hsec per
stage, a CPU cycle time of 54 hsec and a main
memory cycle time of .756 µsec. - CDC 7600 had CPU and main memory cycle times of
27.5 hsec and .275 µsec, respectively.
37Introduction to High Performance Computer
Architecture
- Access gap
- How to reduce the access gap bottleneck
- Software Solutions
- Devise algorithmic techniques to reduce the
number of accesses to the main memory. - Hardware Solutions
- Reduce the access gap.
- Advances in technology
- Interleaved memory
- Application of registers
- Cache memory
38Introduction to High Performance Computer
Architecture
- Access gap Interleaved Memory
- A memory is n-way interleaved if it is composed
of n independent modules, and a word at address i
is in module number i mod n. - This implies consecutive words in consecutive
memory modules. - If the n modules can be operated independently
and if the memory bus line is time shared among
memory modules then one should expect an increase
in bandwidth between the main memory and the CPU.
39Introduction to High Performance Computer
Architecture
- Access gap Interleaved Memory
- Dependencies in the programs branches and
randomness in accessing the data will degrade the
effect of memory interleaving.
40Introduction to High Performance Computer
Architecture
- Access gap Interleaved Memory
- To show the effectiveness of memory interleaving,
assume a pure sequential program of m
instructions. - For a conventional system in which main memory is
composed of a single module, the system has to go
through m-fetch cycles and m-execute cycles in
order to execute the program. - For a system in which main memory is composed of
n modules, the system executes the same program
by executing ém/nù-fetch cycles and m-execute
cycles.
41Introduction to High Performance Computer
Architecture
- Access gap Interleaved Memory
- The concept of modular memory can be traced back
to the design of the so-called Harvard-class
machines, where the main memory was composed of
two modules namely, program memory and data
memory.
42Introduction to High Performance Computer
Architecture
- Access gap Interleaved Memory
- It was in the design of the ILLIAC II System,
where the concept of the interleaved memory was
introduced. - In this machine, the memory was composed of two
units. The even addresses generated by the CPU
were sent to the module 0 and the odd addresses
were directed to the module 1.
43Introduction to High Performance Computer
Architecture
- Access gap Interleaved Memory
- In general when the main memory is composed of n
different modules, the addresses in the address
space can be distributed among the memory modules
in two different fashions - Consecutive addresses in consecutive memory
modules Low Order Interleaving. - Consecutive addresses in the same memory module
High Order Interleaving.
44Introduction to High Performance Computer
Architecture
- Access gap Interleaved Memory
- Whether low order interleaving or high order
interleaving, a word address is composed of two
parts - Module Address, and
- Word Address.
45Introduction to High Performance Computer
Architecture
- Questions
- Name and discuss the factors which influence the
speed, cost, and capacity of the main memory the
most. - Compare and contrast low-order interleaving
against high-order interleaving. - Dependencies in the program degrade the
effectiveness of the memory interleaving
justify it.
46Introduction to High Performance Computer
Architecture
- Access gap Interleaved Memory
- Within the scope of interleaved memory, a memory
conflict (contention, interference) is defined if
two or more addresses are issued to the same
memory module. - In the worst case all the addresses issued are
referred to the same memory module. - In this case the system's performance will be
degraded to the level of a single module memory
organization.
47Introduction to High Performance Computer
Architecture
- Access gap Interleaved Memory
- To take advantage of interleaving, CPU should be
able to perform look ahead fetches issuing
addresses before they are really needed. - In the case of straight line programs and lack of
random-data-access such a look ahead policy can
be enforced very easily and effectively.
48Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Branch
- Assume l is the probability of a successful
branch. Hence 1-l is the probability of a
sequential instruction (in case of a straight
line program then l is zero). - In the case of interleaving where memory is
composed of n modules, CPU employs a look-ahead
policy and issues n-instruction fetches to the
memory. - Naturally, memory utilization will be degraded if
one of these n instructions generate a successful
branch.
49Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Branch
- P(1) l Prob. of the 1st instruction to
generate a successful branch. - P(2) l(1-l) Prob. of the 2nd instruction to
generate a successful branch. -
-
-
- P(k) l(1-l)k-1 Prob. of the kth instruction to
generate a successful branch. -
-
-
- P(n) (1-l)n-1 Prob. of 1st (n-1) instructions
to be sequential instructions.
50Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Branch
- Note in case of P(n), it does not matter whether
or not the last instruction is a sequential
instruction. - The average number of memory modules to be used
effectively
51Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Branch
- Example
- For l 5 and n 4 then IBn 3.8
- For l 5 and n 8 then IBn 6.8
- For l 10 and n 4 then IBn 3.4
- For l 10 and n 8 then IBn 5.7
- Less branching, as expected, implies higher
memory utilization. - Memory utilization is not linear in the number of
memory.
52Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Branch
53Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Branch
-
-
-
54Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Random Data-Access
- In case of data-access, the effectiveness of the
interleaved memory will be compromised if among
the n requests made to the memory, some are
referred to the same memory module.
55Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Random Data-Access
- Access requests are queued,
- A scanner will check the request in the head of
the queue - If no conflict, the request is passed to the
memory, - If conflict then the scanning is suspended as
long as the conflict is not resolved.
56Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Random Data-Access
57Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Random Data-Access
Prob. of one successful access. Prob. of two
successful accesses. Prob. of k
successful accesses.
58Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Random Data-Access
- The average number of active memory modules is
- If n 16 one can conclude that on average, just
- 4 modules can be kept busy under randomly
- generated access requests.
59Introduction to High Performance Computer
Architecture
- Interleaved Memory Effect of Random Data-Access
- Naturally, performance of the interleaved memory
under random data-access can be improved by not
allowing an access to a busy module to stop other
accesses to the main memory. - In another words, the conflicting access is
queued and retried again. - This concept was first implemented in the design
of CDC6600 Stunt Box.
60Introduction to High Performance Computer
Architecture
- Interleaved Memory Stunt Box
- Stunt Box is designed to provide a maximum flow
of addresses to the Main memory. - Stunt Box is a piece of hardware that controls
and regulates accesses to the main memory. - Stunt Box allows access out-of-order to the main
memory.
61Introduction to High Performance Computer
Architecture
- Interleaved Memory Stunt Box
- Stunt Box is composed of three parts
- Hopper
- Priority Network
- Tag Generator and Distributor
62Introduction to High Performance Computer
Architecture
- Interleaved Memory Stunt Box
- Hopper is an assembly of four registers to retain
storage reference information until storage
conflicts are resolved. - Priority Network prioritizes the requests to the
memory generated by the central processor and the
peripheral processors. - Tag Generator is used to control read/write
conflict.
63Introduction to High Performance Computer
Architecture
- Interleaved Memory Stunt Box
64Introduction to High Performance Computer
Architecture
- Stunt Box Flow of Data and Control
- Assuming an empty hopper, a storage address from
one of the sources is entered in register M1. - The access request in M1 is issued to the main
memory. - The contents of the registers in hopper are
circulated every 75 nano seconds. - If a request is accepted by the main memory, it
will not be re-circulated back to the M1.
Otherwise, after each 300 nano seconds it will be
sent back to the main memory for a possible
access.
65Introduction to High Performance Computer
Architecture
- Stunt Box Flow of Data and Control
- Time events of a request
- t00 - Enter M1
- t25 - Send to the central storage
- t75 - M1 to M4
- t150 - M4 to M3
- T225 - M3 to M2
- t300 - M2 to M1 (if not accepted)
66Introduction to High Performance Computer
Architecture
- Stunt Box Example
- Assume a sequence of access requests is initiated
to the same memory module
67Introduction to High Performance Computer
Architecture
- Stunt Box Example
- The previous chart indicated
- Access out-of-order,
- A request to the memory, sooner or later, will be
granted.
68Introduction to High Performance Computer
Architecture
69Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Principle of Locality
- Analysis of a large number of typical programs
has shown that most of their execution time is
spent in a few main routines. - As a result, a number of instructions are
executed repeatedly. This maybe in the form of a
single loop, nested loops, or a few subroutines
that repeatedly call each other.
70Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Principle of Locality
- It has been observed that a program spends 90 of
its execution time in only 10 of the code
principle of locality. - The main observation is that many instructions in
each of a few localized areas of the program are
repeatedly executed, while the remainder of the
program is accessed relatively infrequently.
71Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Principle of Locality locality can be
represented in two forms - Temporal Locality If an item is referenced, it
will tend to be referenced again soon. - Spatial Locality If an item is referenced,
nearby items tend to be referenced soon.
72Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Principle of Locality
- Now, if it can be arranged to have the active
segments of a program in a fast memory, then the
total execution time can be significantly
reduced. - Such a fast memory is called a cache (slave,
buffer) memory.
73Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Principle of Locality
- Cache is a level of memory inserted between the
main memory and the CPU. - Due to economical reasons, cache is relatively
much smaller than main memory. - To make the cache effective, it must be
considerably faster than the main memory.
74Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Principle of Locality
- The main memory and the cache are partitioned
into blocks of equal sizes. - Naturally, because of the size gap between the
main memory and the cache at each moment of time
a portion of the main memory is resident in the
cache.
75Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- The concept of the cache was introduced in mid
1960s by Wilkes. - When a memory request is generated, it is first
presented to the cache memory, and if the cache
cannot respond, the request is then presented to
the main memory.
76Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- The idea of cache is similar to virtual memory in
that some active portion of a low-speed memory is
stored in duplicate in a higher-speed memory. - The difference between cache and virtual memory
is a matter of implementation, the two approaches
are conceptually the same because they both rely
on the correlation properties observed in
sequences of address references.
77Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Cache implementations are totally different from
virtual memory implementation because of the
speed requirements of cache. If we assume that
cache memory has an access time of one machine
cycle, then main memory typically has an access
time anywhere from 4 to 20 times longer, not 500
times larger for the delay due to a page fault in
virtual memory. - In general caches are controlled by hardware
algorithms.
78Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Cache vs. Virtual Memory
79Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Ranges of parameters for cache
80Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Address Mapping
- Each reference to a memory word is presented to
the cache. - The cache searches its directory
- If the item is in the cache, then it will be
accessed from the cache. - Otherwise, a miss occurs.
81Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
82Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- In our previous diagram
- A reference to address 01173 is responded by
cache. - A reference to address 01163 produces a miss.
83Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Replacement Policy
- For each read operation that causes a cache miss,
the item is retrieved from the main memory and
copied into the cache. This forces some other
item in cache to be identified and removed from
the cache to make room for the new item (if cache
is full). - The collection of rules which allows such
activities is referred to as the Replacement
Algorithm.
84Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Replacement Policy
- The cache-replacement decision is critical, a
good replacement algorithm, naturally, can yield
somewhat higher performance than can a bad
replacement algorithm.
85Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Let h be the probability of a cache hit hit
ratio
and tcache and tmain be the respective cycle
times of cache and main memory then
- teff tcache (1-h)tmain
- (1-h) is the probability of a miss miss ratio.
86Introduction to High Performance Computer
Architecture
- Cache Memory Issues of Concern
- Read Policy
- Load Through
- Write policy (on hit)
- Write through
- Write back Þ dirty bit
- Write policy (on miss)
- Write allocate
- No-write allocate
- Placement/replacement policy
- Address Mapping
87Introduction to High Performance Computer
Architecture
- Cache Memory Issues of Concern
- In Case of Miss-Hit
- For read operation, the block containing the
referenced address is moved to the cache. - For write operation, the information is written
directly into the main memory.
88Introduction to High Performance Computer
Architecture
- Questions
- Compare and contrast different write policies
against each other. - In case of miss-hit, why are read and write
operations treated differently?
89Introduction to High Performance Computer
Architecture
- Cache Memory Issues of Concern
- Sources for cache misses
- Compulsory cold start misses
- Capacity
- Conflict placement/replacement policy
90Introduction to High Performance Computer
Architecture
- Cache Memory Issues of Concern
- It has been shown that increasing the cache sizes
and/or degree of associativity will reduce cache
miss ratio. - Naturally, compulsory miss ratios are independent
of cache size.
91Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Mixed caches Cache contains both instruction
and data Unified caches. - Instruction-only and Data-only caches Dedicated
caches for instructions and data.
92Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- In general, miss ratios for instruction caches
are lower that miss ratios for data caches. - For smaller cache sizes, unified caches offer
higher miss ratios than dedicated caches.
However, as the cache size increases, miss ratio
for unified caches relative to the dedicated
caches reduces.
93Introduction to High Performance Computer
Architecture
- Exam question
- a) Define term interleave memory
- A memory is n-way interleaved if it is composed
of n independent modules, and a word at address i
is in module number i mod n. Modules are
operated independently and memory bus line is
time shared among memory modules.
94Introduction to High Performance Computer
Architecture
- Exam question
- b) Define high-order interleaving
- Consecutive addresses in the same memory module.
- c) Define low-order interleaving
- Consecutive addresses in consecutive memory
modules.
95Introduction to High Performance Computer
Architecture
- Exam question
- d) Compare and contrast high-order interleaving
with low-order interleaving - Speed low order interleaving
- Fault tolerance high order interleaving
- Block transfer - high order interleaving
- Enforcing security - high order interleaving
- Multiprocessing - high order interleaving
96Introduction to High Performance Computer
Architecture
- Exam question
- e) Two issues affect performance of an
interleave memory - What are they
- Random access to data
- Branches in programs
- Show (proof) how do they affect the
effectiveness of the interleave memory
97Introduction to High Performance Computer
Architecture
- Exam question
- f) With respect to the part (e), discuss about
solutions (one for each case) - Reduce number of branches in the program, or
- Expect compiler to reshuffle instructions in an
attempt to have branch instructions in rightmost
modules. - Application of stunt-box and its intension.
98Introduction to High Performance Computer
Architecture
- Exam question
- As a computer architect, in the design process of
an ALU, what initial issues one have to keep in
mind? Name them and discuss about their
importance. - Functionality
- Representation of information
- Length and number of operand (s)
- Organization
- Serial
- Parallel
- modular
99Introduction to High Performance Computer
Architecture
- Exam question
- Cray Y-MP/8 (a vector processor) has a cycle time
of 6ns. During a cycle, the results of both an
addition and a multiplication can be completed.
Furthermore, there are eight processors operating
simultaneously without interference in the best
case. Calculate the peak performance of the Cray
Y-MP/8 (in MIPS).
100Introduction to High Performance Computer
Architecture
- Exam question - Solution
- Peak Performance
101Introduction to High Performance Computer
Architecture
- Exam question
- Assume we have a machine where the CPI is 2.0
when all memory accesses (instruction fetches and
data fetches) hit in the cache. The only data
accesses are loads and stores (note, these are
one address type instructions), and these total
40 of the instructions (the rest of instructions
are dealing with registers). If the miss penalty
is 25 clock cycles and the miss rate is 2, how
much faster would the machine be if all accesses
were cache hits?
102Introduction to High Performance Computer
Architecture
- Exam question Solution
- CPI 2
- CPI 2 (1 .4) .02 25 2.7
- Speed up CPI/CPI 2.7/2 1.35
103Introduction to High Performance Computer
Architecture
- Exam question
- Apply the Column Compression technique to perform
the following operation - 1110111
- 1101011
- Note numbers are in 2s complement format.
- Column Compression does not work for 2's
complement numbers, so we need to convert them
into positive numbers, or - Apply column compression technique on the numbers
and then correct the result.
104Introduction to High Performance Computer
Architecture
01 00 01
01 01 01 00 01
00 00 00 00 00
00 00 ------------------------
001 000 001 001 001
001 000 001
000 000 000 000 000 -----------------
------
0001001 0010101 -------------
0001001 0000000
0001001 0000000
0001001 0000000
0000000 --------------------
0000000010111101
105Introduction to High Performance Computer
Architecture
- Exam question
- Figure 1 shows the ith stage logic of a parallel
ALU Where (Ai, Bi and Ci) are the operands and
the carry-in, respectively, and (S2, S1, S0 and
M) are the control signals. Determine under what
values of S2, S1, S0, M, and C1 (carry-in to the
right most stage) the ALU performs the following
operation - a) F ? A - B (Why?)
- b) F ? B (Why?)
106Introduction to High Performance Computer
Architecture
F ? B M 0 Logic operation C1 x Dont care S2
0 Disable A S1 0 Disable1s complement of B S0
1 Pass B to ALU
107Introduction to High Performance Computer
Architecture
- Exam question
- Use SRT division method to perform the following
operation - AQ/B where
- AQ .11001100 and B .0111
- Show step-by-step operation.
108Introduction to High Performance Computer
Architecture
- Exam question Solution
- Note the divide overflow condition (A) gt (B), to
eliminate divide overflow Dividend must be
shifted to the right and 1 should be added to its
power - AQ .0110 0110
- Normalize B .1110
109Introduction to High Performance Computer
Architecture
1.1110 110 Negative Result, shift and insert 0
1.1101 100 Skip over 1s
1.0110 011 Add B
0.0100 011 Positive Result, shift and insert 1
0.100 0111
1.101 0111 Negative Result, shift and insert 0
1.010 1110 Now we need to correct remainder
110Introduction to High Performance Computer
Architecture
1.101
0.100
111Introduction to High Performance Computer
Architecture
- Exam question
- Assume we are utilizing a parallel disk (RAID)
composed of 6 and 8 disks ( of data disks
available).
112Introduction to High Performance Computer
Architecture
- Exam question
- A memory is n-way interleaved if
1) It is composed of n independent modules, 2)
Address i is in (i mod n), and 3) Bus is shared
among modules.
113Introduction to High Performance Computer
Architecture
- Exam question
- Define high-order interleaving,
- Define low-order interleaving,
Consecutive addresses in the same module.
Consecutive addresses in consecutive modules.
114Introduction to High Performance Computer
Architecture
- Exam question
- Address accessible memory can be classified as
-
- Explain access gap as clearly as possible.
1) RAM 2) SAM 3) DAM
The difference between main memory cycle time and
CPU cycle time.
115Introduction to High Performance Computer
Architecture
116Introduction to High Performance Computer
Architecture
- Memory System Cache Memory
- Address Mapping
- Direct Mapping
- Associative Mapping
- Set Associative Mapping
117Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- In the following discussion assume
- B block size (2b)
- C number of blocks in cache (2c)
- M number of blocks in main memory (2m)
- S number of sets in cache (2s)
118Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Direct Mapping
- Block K of main memory maps into block (K modulo
C) of the cache. - Since more than one main memory block is mapped
into a given cache position, contention may arise
even when the cache in not full.
119Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Direct Mapping
- Address mapping can be implemented very easily.
- Replacement policy is very simple and trivial.
- In general, cache utilization is low.
120Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Direct Mapping
- Main memory address is of the following form
- A Tag-register of length m-c is dedicated to each
- cache block
121Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Direct Mapping
- Content of (tag-register)c is compared against
the tag portion of the address
122Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Direct Mapping
123Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Associative Mapping
- A block of main memory can potentially reside in
any cache block position. This flexibility can
be achieved by utilizing a wider Tag-Register. - Address mapping requires hardware facility to
allow simultaneous search of tag-registers. - A reasonable replacement policy can be adopted
(least recently used). - Cache can be used very effectively.
124Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Associative Mapping
- Main memory address is of the following form
- A tag-register of length m is dedicated to each
- cache block.
125Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Associative Mapping
- Contents of Tag portion of the address is
searched (in parallel) against the contents of
the Tag-registers
- If no-match, then miss-hit bring block from
- memory into the proper cache block.
126Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Associative Mapping
127Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Set Associative Mapping
- Is a compromise between Direct-Mapping and
Associative Mapping. - Blocks of cache are grouped into sets (S), and
the mapping allows a block of main memory (K) to
reside in any block of the set (K modulo S). - Address mapping can be implemented easily at a
more reasonable hardware cost relative to the
associative mapping.
128Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Set Associative Mapping
- This scheme allows one to employ a reasonable
replacement policy within the blocks of a set and
hence offers better cache utilization than the
direct-mapping scheme.
129Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Set Associative Mapping
- Main memory address is of the following form
- A tag-register of length m-s is dedicated to each
- block in the cache.
130Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Set Associative Mapping
- Contents of Tag-registerss are compared
simultaneously against the tag portion of the
address
131Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Set Associative Mapping
132Introduction to High Performance Computer
Architecture
- Cache Memory Address Mapping
- Set Associative Mapping
133Introduction to High Performance Computer
Architecture
- Questions
- Compare and contrast unified cache against
dedicated caches. - Compare and contrast Direct Mapping, Associative
Mapping, and Set Associative Mapping against each
other.
134Introduction to High Performance Computer
Architecture
- Cache Memory IBM 360/85
- Main Memory (4-way interleaved)
- Size 512-4096 k bytes
- Cycle Time 1.04 µsec
- Block Size 1 k bytes
135Introduction to High Performance Computer
Architecture
- Cache Memory IBM 360/85
- Cache
- Size 16 k bytes
- Access Time 80 hsec
- Block Size1 k bytes
- Address Mapping Associative Mapping
- Replacement Policy Least Recently Used
- Read Policy Read-Through
- Write Policy Write-Back, write access to the
main memory does not cause any cache reassignment.
136Introduction to High Performance Computer
Architecture
- Cache Memory IBM 360/85
- Hardware Configuration
- An associative memory of 16 words, each 14 bits
long, represents the collection of the
tag-registers. - Each block is a collection of 16 units each of
length 64 bytes. - Each block has a validity register of length 16.
137Introduction to High Performance Computer
Architecture
- Cache Memory IBM 360/85
- Hardware Configuration
- A validity bit represents the availability of a
unit within a block in the cache. - A unit is the smallest granule of information
which is transferred between the main memory and
the cache. - The units in a block are brought in on a demand
basis.
138Introduction to High Performance Computer
Architecture
- Cache Memory IBM 360/85
- Main memory address format
139Introduction to High Performance Computer
Architecture
- Cache Memory IBM 360/85
- Flow of Operations
140Introduction to High Performance Computer
Architecture
- Cache Memory IBM 370/155
- Main Memory (4-way interleaved)
- Size 256 - 2048 k bytes
- Cycle time 2.100 µsec
- Block size 32 bytes
141Introduction to High Performance Computer
Architecture
- Cache Memory IBM 370/155
- Cache
- Size 8 k bytes
- Cycle time 230 hsec
- Block size 32 bytes
- Address Mapping
- Set associative mapping
- Set-size 2
142Introduction to High Performance Computer
Architecture
- Cache Memory IBM 370/155
- Hardware
- Configuration
143Introduction to High Performance Computer
Architecture
- Cache Memory IBM 370/155
- Main memory address format
144Introduction to High Performance Computer
Architecture
- Cache Memory IBM 370/155
- Memory Organization and View
145Introduction to High Performance Computer
Architecture
146Introduction to High Performance Computer
Architecture
- Cache Memory 68040
- Two dedicated caches on processor chip
- Instruction cache
- Data cache
- Each cache is of size 4K bytes with 4-way set
associative organization. - Each cache is a collection of 64 sets of 4 blocks
each.
147Introduction to High Performance Computer
Architecture
- Cache Memory 68040
- Each block is a collection of 4 long words a
long word is 4 bytes long. - Each cache block has a valid bit and a dirty bit.
- Either write back or write through policy can be
employed. - It uses a randomly selected block in each set as
a replacement policy.
148Introduction to High Performance Computer
Architecture
- Cache Memory 68040
- Main Memory Address Format
149Introduction to High Performance Computer
Architecture
- Cache Memory Pentium III
- Has two cache Levels
- Level1
- Has dedicated caches for instructions and data.
- Each cache is 16K bytes.
- Data cache is 4-way set associative organization.
- Instruction cache is 2-way set associative
organization. - Both write back and write through policies can be
adopted.
150Introduction to High Performance Computer
Architecture
- Cache Memory Pentium III
- Level2
- It is a unified cache, either external to the
processor chip (Pentium III Katmai) or internal
to the processor chip (Pentium III Coppermine). - If internal, it is of size 256 Kbytes SRAM, 4-way
set associative organization. - If external, it is of size 512 Kbytes, 8-way set
associative organization. - Either write back or write through policy can be
employed.
151Introduction to High Performance Computer
Architecture
- Cache Memory
- How to make cache faster?
- Make the cache faster Better technology,
- Make the cache larger,
- Sub block cache blocks A portion of a block is
the granule of information transferred between
the main memory and cache, - Use a write buffer Care should be taken for
write-read order.
152Introduction to High Performance Computer
Architecture
- Cache Memory
- How to make cache faster?
- Early restart Allow the CPU to continue as
soon as the requested data is in cache (read
through), - Out-of-order fetch Attempt to fetch the
requested information first should be used in
conjunction with read through. - Multi-level cache memory organization.