IIT CS570 Graduate Advenced Computer Architecture - PowerPoint PPT Presentation

About This Presentation
Title:

IIT CS570 Graduate Advenced Computer Architecture

Description:

Title: IIT CS570 Graduate Advenced Computer Architecture Author: David Last modified by: sun Created Date: 2/8/2005 3:17:21 AM Document presentation format – PowerPoint PPT presentation

Number of Views:219
Avg rating:3.0/5.0
Slides: 39
Provided by: Davi1879
Category:

less

Transcript and Presenter's Notes

Title: IIT CS570 Graduate Advenced Computer Architecture


1
C-AMATConcurrent Average Memory Access Time
Xian-He Sun April, 2015 Illinois Institute of
Technology sun_at_iit.edu
With Yuhang Liu and Dawei Wang
2
Outline
  • Motivation
  • Memory System and Metrics
  • C-AMAT Definition and Contribution
  • Experimental Design and Verification
  • Application and Related Work
  • Conclusion

Reference
X.-H. Sun and D. Wang, "Concurrent Average
Memory Access Time", in IEEE Computers, vol. 47,
no. 5, pp. 74-80,May 2014 D. Wang and X. Sun,
APC A Novel Memory Metric and Measurement
Methodology for Modern Memory System, IEEE
Transactions on Computers, vol. 63, no. 7, pp.
16261639, 2014.
3
Motivation
  • Processor is 400x faster than memory, and
    applications become more data intensive
  • Data access becomes THE performance bottleneck of
    high-end computing
  • Many concurrency based technologies are developed
    to improve data access speed, but their impact on
    final performance is elusive and, therefore, are
    not fully utilized
  • Existing memory optimization strategies are still
    primarily based on the sequential single-access
    assumption

4
Memory Wall Problem

Processor-DRAM Memory Gap
µProc 1.20/yr.
Moores Law
µProc 1.52/yr. (2X/1.5yr)
DRAM 7/yr. (2X/10 yrs)
Processor-Memory Performance Gap(grows 50 /
year)
  • 1980 no cache in micro-processor 2010 3-level
    cache on chip, 4-level cache off chip
  • 1989 the first Intel processor with on-chip L1
    cache was Intel 486, 8KB size
  • 1995 the first Intel processor with on-chip L2
    cache was Intel Pentium Pro, 256KB size
  • 2003 the first Intel processor with on-chip L3
    cache was Intel Itanium 2, 6MB size

Source Computer Architecture A Quantitative
Approach
5
Extremely Unbalanced Operation Latency
515M cycles
Cycles
IO Access
6
Data Access becomes Performance Bottleneck
GROMACS  (molecular dynamics) 
MPQC (Massively Parallel Quantum Chemistry)
Multi-Grid solver (CFD)
Microstructure
7
Data Access becomes Performance Bottleneck
Computational Fluid Dynamics
Adaptive Multigrid
Computational Finance
Data mining
8
Solution Memory Hierarchy
L1 cache cntl 32-128 bytes
L2 Cache lt50MB 1-10 ns
L2 Cache
9
Data Access Concurrency Exist
10
Solution Memory Hierarchy Parallelism
Pipeline Non-blocking Prefetching Write buffer
Input-Output (I/O)
Parallel File System
Disks
11
Extremely Unbalanced Operation Latency
Assumption of Current Solutions
  • Memory Hierarchy Locality
  • Concurrence Data access pattern
  • Data stream

IO Access 515M cycles
Cycles
Performances vary largely
12
Existing Memory Metrics
  • Miss Rate(MR)
  • the number of miss memory accesses over the
    number of total memory accesses
  • Misses Per Kilo-Instructions(MPKI)
  • the number of miss memory accesses over the
    number of total committed Instructions 1000
  • Average Miss Penalty(AMP)
  • the summary of single miss latency over the
    number of miss memory accesses
  • Average Memory Access Time (AMAT)
  • AMAT Hit time MRAMP
  • Flaw of Existing Metrics
  • Focus on a single component or
  • A single memory access

Missing memory parallelism/concurrency
13
Concurrent AMAT (C-AMAT)
  • H is Hit time
  • CH is the hit concurrency
  • CM is the pure miss concurrency
  • pMR and pAMP are pure-miss ratio and pure-miss
    penalty
  • a Pure-miss cycle is a miss cycle there is no hit

14
Different perspectives
  • Sequential perspective AMAT
  • Concurrent perspective C-AMAT

15
Pure-miss
  • Miss is not important (Pure miss is)
  • The penalty is due to pure miss

16
C-AMAT is Recursive
This Eq. shows the recurrence relation of
C-AMAT1 and C-AMAT2
where
17
The physical meaning of ?1
  • R1 pure miss cycles / miss cycles
  • R2 pure misses / misses
  • ?1 R1 / R2
  • The penalty at L2 is C-AMAT2
  • The actual delay impact is ?1 x C-AMAT2
  • ?1 is the L1 (concurrency) data delay reducer

18
Architecture Impacts
  • CH could be contributed by
  • multi-port cache
  • multi-banked cache
  • pipelined cache structures
  • CM could be contributed by
  • non-blocking cache structures
  • prefetching logic
  • These techniques can both increase the CH and CM
  • out-of-order execution
  • multiple issue pipeline
  • SMT
  • CMP

19
Detecting System
Structure for detecting cache hit concurrency and
cache miss concurrency using the C-AMAT metric
20
Experimental Environment
  • Simulator
  • GEM5
  • Benchmarks
  • 29 benchmarks from SPEC CPU2006 suite
  • For each benchmark, 10 million instructions were
    simulated to collect statistics
  • Average values of the correspondent memory
    metrics are shown
  • A good memory metric should matches the actual
    design choices for modern processors

21
Default configuration
Default processor and cache configuration
parameters forsimulated testing of C-AMAT
22
Experimental Results
L1 DCache AMAT and C-AMAT when Changing Issue
Pipeline Width
AMAT getting worse and C-AMAT getting better when
concurrency increase
23
Experimental Results
L1 DCache AMAT and C-AMAT when Changing MSHR Size
AMAT getting worse and C-AMAT getting better when
concurrency increase
24
Experimental Results
L2 Cache AMAT and C-AMAT when Changing MSHR Size
AMAT getting worse and C-AMAT getting better when
concurrency increase
More results can be found in X. H. Sun and D.
Wang, "Concurrent Average Memory Access Time,"
IEEE Computer, 47(5), May 2014, pp.74-80.
25
Potential of C-AMAT and Data Concurrency
  • Assume total running time is T
  • Data stall time is d, d/T is up to 70, that is
    d/T is 0.7 T
  • Compute time is t, and t is 0.3 T
  • Therefore, data stall time can be up to 0.7/0.3
    2.3 folds of compute time
  • If layered performance matching can be achieved
    when the overlapping effect of data access
    concurrency is enough, data stall time is only 1
    of compute time
  • Then memory performance can be improved 230
    times!

26
Improvement potential due to concurrency
Aided by concurrency, memory system performances
can be improved up to hundreds of times (230X) at
each layer of a memory hierarchy with layered
performance matching
27
How 230x Improvement Achieved
Increasing data access concurrency to have a 230
speedup of memory system performance with our LPM
algorithm
28
Technique Impact Analysis (Original)
Figure 2.11 on page 96 in Hennessy Pattersons
latest book
29
Technique Impact Analysis (Ours)
A new technique summation table with C-AMAT
30
The Impact of C-AMAT
  • New understanding of memory systems with a rigor
    mathematical description
  • Unified the influence of data locality and
    concurrency under one formulation
  • Foundation for developing new concurrency-based
    optimizations, and utilizing existing
    locality-based optimizations
  • Foundation for automatic tuning for best
    configuration, partition, and scheduling, etc.

31
C-AMAT in Action
Traditional AMAT model
Data stall time
New C-AMAT model CPU-time IC(CPIexe
fmemC-AMAT(1overlapRatioc-m))cycle-time
Data stall time
Data stall time
Only pure miss will cause processor stall, and
the penalty is formulated here
Y.-H. Liu and X.-H. Sun, Reevaluating data stall
time with the consideration of data access
concurrency, Journal of Computer Science and
Technology, vol. 30, no. 2, pp. 227245, 2015.
32
C-AMAT in Action
  • Layered performance matching at each memory
    hierarchy
  • Using recursive C-AMAT to measure and mitigate
    layered performance mismatch
  • For instance, the impact of C-AMAT2 can be
    trimmed by pMR1 and ?1
  • The key is to reduce pure miss, not miss, and
    data concurrence can do so

Y.-H. Liu, X.-H. Sun, "LPM Layered Performance
Matching in Memory Hierarchy," Illinois
Institute of Technology Technical Report
(IIT/CS-SCS-2014-08), 2014. 
33
C-AMAT in Action
  • Online Reconfiguration and Smart Scheduling
  • A performance optimization tool has been
    developed base on C-AMAT
  • Provide measurement and optimization suggestions
  • Measure C-AMAT on existing computing systems
  • Optimization in hardware reconfiguration
  • Optimization in software task partitioning and
    scheduling

Y.-H. Liu, X.-H. Sun, "TuningC A
Concurrency-aware Optimization Tool," Illinois
Institute of Technology Technical Report
(IIT/CS-SCS-2015-05), 2015. 
34
Related Work APC Versus C-AMAT
  • Access Per (memory active) Cycle (APC)
  • APC A/T
  • APC is a measurement, a companion of C-AMAT
  • C-AMAT is a analysis and optimization tool
  • APC is very different with the traditional IPC
  • Memory Active Cycle (data centric/access)
  • Overlapping mode (concurrent data access)
  • C-AMAT does not depend on its five parameters for
    its value
  • C-AMAT 1/APC

D. Wang, X.-H. Sun "Memory Access Cycle and the
Measurement of Memory Systems", IEEE
Transactions on Computers, vol. 63, no. 7, pp.
1626-1639, July.2014
35
Related Work MLP
  • Memory Level Parallelism (MLP)
  • Average number of long-latency main memory
    outstanding accesses when there is at least one
    such outstanding access
  • Assuming each off-chip memory access has a
    constant latency, say m cycles, APCMMLP/m
  • That means APCM is directly proportional to MLP
  • APC is superset of MLP
  • C-AMAT is an analytical tool and measurement, MLP
    is a measurement
  • MLP does not consider locality, will APC and
    C-AMAT do

36
Conclusions
  • Data access delay is the premier bottleneck of
    computing
  • Hardware memory concurrence exists but is under
    utilized
  • C-AMAT unifies data concurrency with locality for
    combined data access optimizations
  • C-AMAT can improve AMAT performance 230 times
  • This 230X number could be even larger. With the
    multicore technology, CPU can be built faster.
    The question is if data can be moved up fast
    enough

37
Conclusions
  • Develop C-AMAT based technologies to reduce data
    access time !

38
  • Thank You
  • Questions ?
Write a Comment
User Comments (0)
About PowerShow.com