Title: Simulation of Decode Filter Cache using SimpleScalar simulator
1Simulation of Decode Filter Cache using
SimpleScalar simulator
2Motivation Goals
- Instruction fetches and decodes are the major
on-chip power consumers - Optimize the power consumption by reducing
instruction fetches and decodes - Simulate the DFC architecture using simplescalar
- To test the performance of DFC
3Prediction Mechanism
- Each sector in DFC has the following fields.
- (tag, sector_valid, next_address)
- If A is not equal to C, a different control path
will be taken - tag(A) ! tag(C)
(1) - A and B are consecutively accessed. If they
belonged to a small loop - tag(A) tag(B)
(2) - Based on (1) and (2), the prediction for next
fetch - tag(C)
tag(B) (3)
4Working Process
5The Platform
- Host computer ACPI x86-based PC
- Host computer operating system Microsoft Windows
Vista Ultimate - Virtual Machine VMware Workstation version 6.03
- Linux operating system Fedora Core 6
- Simulator SimpleScalar version 3.0
6Work have done so far
- Setup the platform
- Reading the source code of SimpleScalar
- Apply my DFC structure and working process to
SimpleScalar - Find benchmarks and compile in the platform
- Do simulation using given memory hierarchy
parameters
7MiBench
- dijkstra it constructs a large graph in an
adjacency matrix representation and then
calculates the shortest path between every pair
of nodes using repeated applications of
Dijkstras algorithm. - stringsearch it searches for given words in
phrases using a case insensitive comparison
algorithm. - rijndael encrypt/decrypt it was selected as the
National Institute of Standards and Technologies
Advanced Encryption Standard (AES). - CRC32 This benchmark performs a 32-bit Cyclic
Redundancy Check (CRC) on a file. CRC checks are
often used to detect errors in data transmission.
8 Memory hierarchy parameters
Parameter Value
Instr. size 4B
DFC direct-mapped, 32 secotors, 4 decoded instr. per sector, 8B per decoded instr.
L1 I-cache 16KB, 2-way, 32B line, 1 cycle hit latency
L1 D-cache 8KB, 2-way, 32B line, 1-cycle hit latency
Memory 30-cycle latency
9Simulation results
- reduction in instruction fetches and
decodes
10Simulation results
11Simulation results
dijkstra stringsearch rijndael CRC32
sim_num_insn 255620304 4437612 391487315 533385529
il1.accesses 43508918 1605417 236160209 972328
il1.hits 43399500 1568976 228694324 971600
il1.misses 109418 36441 7465885 728
il1.miss_rate 0.0025 0.0227 0.0316 0.0007
dfc.accesses 215740165 3269067 232531480 532674172
dfc.hits 212111386 2832195 155327106 532413201
dfc.misses 3628779 436872 77204374 260971
dfc.miss_rate 0.0168 0.1336 0.3320 0.0005
12Conclusion
- The DFC stores decoded instructions and can be
very small and energy-efficient. - Use of the DFC eliminates both the access to a
much larger instruction cache and the entire
decoding step. - From the simulation results, we can see that most
instruction fetch and decode can be eliminated by
using DFC. Therefore, it is a very efficient way
to optimize the power consumption of embedded
processors.
13Thank you!